The world is starved for GPU chips from the dominant artificial intelligence vendor, Nvidia. That has so far not produced a meaningful surge in chip sales by competitors Advanced Micro Devices and Intel. But it may be helping to build a new kind of computing model.
“It’s increasingly the case that there’s, sort of, one alternative to Nvidia,” said Andrew Feldman, co-founder and CEO of AI computing startup Cerebras Systems, which sells a massive AI computer, the CS-2, running the world’s largest chip.
Feldman and team began selling computers to compete with Nvidia’s GPUs four years ago. A funny thing happened on the way to market. Feldman is increasingly finding his business is a hybrid one, where there are some sales of individual systems, but much larger sales of massively parallel systems that Cerebras builds over months, then runs on behalf of clients as a dedicated AI cloud computing service.
The business “has changed completely” for Cerebras, Feldman told ZDNET. “Rather than buying one or two machines, and putting a [computing] job on one machine for a week, customers would rather have it on 16 machines for a few hours” as a cloud service model.
The result for Cerebras, is, “For hardware sales, you can do fewer, bigger deals, and you’re gonna spend a lot of time and effort in the management of your own cloud.”
On Monday, at a supercomputing conference in Denver called SC23, Feldman and team unveiled the latest achievement of that expanding AI cloud.
The company announced they have completed the construction of a massive AI computer, Condor Galaxy 1, or, “CG-1,” built for client G42, a five-year-old investment firm based in Abu Dhabi, the United Arab Emirates.
Condor Galaxy, announced earlier this year, is named for a spiral galaxy located 212 million light years from Earth. The machine is a collection of 64 of Cerebras’s CS-2s. The total value of CG-1 is equivalent, said Feldman, to a little bit less than the cost of an equivalent number of Nvidia’s GPU chips, on the order of $150 million, based on the price of Nvidia’s 8-way “DGX” computer.
“This is a very good business,” said Feldman of such large-ticket sales. “We’re having a monster year” in terms of sales, he said.
The Condor Galaxy machine is not physically in Abu Dhabi, but rather installed at the facilities of Santa Clara, California-based Colovore, a hosting provider that competes in the market for cloud services with the likes of Equinix.
Cerebras is beginning construction of the second version of Condor Galaxy, number two, or, “CG-2,” which will add another 64 computers and four more “exaFLOPS” of computing power. (An exaFLOP is a billion, billion floating point operations per second, see Wikipedia), for a combined total of 8 exaFLOPs for the Condor Galaxy system.
The Condor Galaxy system is expected to total, in its final configuration, 36 exaFLOPs, using 576 CS-2 computers, overseen by 654,000 AMD CPU cores.
In the new hybrid business model, said Feldman, the measure of success becomes not just system sales but also the rate of new customers renting capacity in the Cerebras cloud without any up-front purchases. “Sometimes you’d ship them hardware and they’d set it up, and you’d run a trial or prove it out on their premise, and now we give you a login,” explained Feldman of the new mode of sales.
Pharmaceutical giant GlaxoSmithKline, an early customer for CS-2 hardware, is also renting capacity in Cerebras’s cloud, said Feldman. “They have our gear on-premise, and then, when they want to do giant runs, they come into our cloud,” he explained. “And that’s a very interesting model.”
“We now have, sort of, so much super-compute capacity that other people are using our system in all sorts of creative ways,” said Feldman. “In the AI space, they’re training interesting models, and in the super-compute space, they’re doing interesting work — and this just isn’t the case with anybody else.”
Feldman cited as “unbelievably interesting” AI work done on Condor Galaxy the development of an open-source large language model, akin to OpenAI’s GPT. That program is the best-performing model with 3 billion neural network “parameters” on the machine learning repository Hugging Face, noted Feldman, with more than a billion downloads. That program is small enough to be run on a smartphone to perform AI inference, which is the intention, said Feldman.
As an example of scientific work, Feldman cited a research paper by scholars at King Abdullah University of Science and Technology in Saudi Arabia that was a finalist for the distinguished Gordon Bell Award given out by the Association for Computing Machinery, organizer of the SC23 event.
“We lent them time on Condor Galaxy so they could break records for seismic processing,” noted Feldman.
The first version of Condor Galaxy, CG-1, took 70 days to complete, said Feldman. The CG-2 machine will be finished “early next year.” The company is already planning Condor Galaxy-3, which will add another 64 machines and another 4 exaFLOPS, for a system total of 12 exaFLOPS.
One of the key advantages of a machine such as Condor Galaxy, both 1 and 2, said Feldman, is the systems’ engineering. Putting together an equivalent number of GPU chips is incredibly difficult, he told ZDNET. “The number of people that can network a thousand GPUs is very small,” said Feldman. “It’s, maybe, 25 companies.”
“It is very hard to get efficient use of that much distributed compute, it’s a very, very difficult problem,” said Feldman. “That’s one of the problems we fundamentally solve.”
Every CS-2 computer in Condor Galaxy 1 and 2 contains one of Cerebras’s novel AI chips, the “Wafer-Scale-Engine,” or WSE. Those chips, the biggest in the world, each contain 850,000 individual “cores” to process AI instructions in parallel, making them the equivalent of multiple GPU chips.
In addition, the CS-2 computers are supplemented by Cerebras’s special-purpose “fabric” switch, the Swarm-X, and its dedicated memory hub, the Memory-X, which are used to cluster together the CS-2s.