Should a Hardware Firm Build a Public Cloud?

Apr 22

AI is HOT, and everybody wants to get their hands on the latest Nvidia GPU rigs to train LLM models. Of course, these rigs are crazy-expensive, and only the best-funded companies and afford to equip a data center of their own with them. Even those that can afford to purchase might find themselves at the end of a long waitlist for delivery.

So if you simply must train LLMs on this kind of hardware right now, your only realistic option is cloud computing. In fact, even before the LLM craze, there was a great case to be made that most machine learning and model training at scale was best performed on a utility computing provider like AWS or Azure.

Currently, Nvidia pretty much owns the high-end GPU hardware space (although AMD and various companies who design custom silicon are rushing to catch up). I’ve seen the question raised on Twitter and elsewhere why Nvidia doesn’t just “cut out the middleman” — AWS, Azure, etc. — and sell access to their GPU hardware as a cloud computing service itself. After all, providing cloud platform services has to this point proved to be a lucrative business model for Amazon and others, right?

You never know, maybe they will at some point. But there are a lot of compelling reasons why they might not want to. I’ll put aside the business question of whether it would be wise for Nvidia to compete with their best customers (usually a dubious proposition) and focus on the technical issue. And that issue is:

Having a great hardware story is only the first of many challenging steps to providing a cloud service.

Say you have racks and racks of A100’s and H100’s connected with a lot of fast networking. What do you have to do to make those available to paying customers in the cloud?

Obviously you need some datacenter space. That space needs cheap and reliable power, access to a fast fiber backbone, and physical security. For reliability, latency, and maybe data sovereignty reasons, you almost certainly need multiple datacenters that are geographically distributed. And all of this infrastructure will ultimately be run by people whom you will need to hire and train.

Then the real fun starts. You need to surround your hardware with a software platform. You’ll need to build some sort of portal and APIs to expose your hardware to customers. Those customers expect things like security, compliance, governance, and accurate billing. Each of those problems is incredibly complicated. (Billing, yikes!) There are probably requirements for supporting services like storage and some conventional compute.

And whatever you put together will have to compete on (price x performance x features) with Amazon, Microsoft, Google, and others, who you can rest assured won’t be standing still waiting for you to catch up.

As a hardware design company, you most likely don’t have a lot of engineers (if any) with experience in the public cloud computing infrastructure space, so you’ll have to go out and try to hire or poach some. Even after the tech recession, the ones you really need won’t work for cheap.

As an aside, I remember the days in Seattle when Oracle was trying to hire and pay top dollar to any AWS engineer who would talk to them to bootstrap some table-stakes services for Oracle Cloud. It cost them a lot of money, and people like to make fun of Oracle, but it’s actually quite impressive that they eventually built a respectable cloud platform after starting years behind.

Anyway, if you think that putting all the pieces together for a high-performance cloud service from scratch sounds like a bit of a costly headache, you’re probably right. If you already have a successful core business, it’s hard to see why you’d want to go there at this point.

Again, never say never, but I suspect you’d want to just keep partnering with the cloud incumbents who have already done the hard work of building a public cloud platform.

Scott McMaster https://www.smcmaster.com

Should a Hardware Firm Build a Public Cloud?

Programming Languages Are Overrated

If Hand Sanitizer Has Four 9’s of Reliability, Does Your System Need Five?