Chip upstarts turn collectively

In the magnificent arena of AI chips, large-scale training, once regarded as the "Holy Grail of Technology", is now quietly giving way to a more low-key but more realistic reasoning market.

Nvidia is still ahead of the training chip market, while Cerebras continues to gamble on building a super-large-scale computing platform. But other players who have been fighting for training chips—Graphcore, Intel Gaudi, SambaNova, etc.—are quietly turning to another battlefield: AI reasoning.

This trend is not accidental.

As an industry that emphasizes capital, computing power and software ecosystem, Nvidia's CUDA tool chain and mature GPU ecosystem are compatible with the extensive framework, which allows it to almost master all the voice of training chips. Although Cerebras has taken a different approach and launched a training platform for ultra-large chips, it is still limited to scientific research institutions and a very small number of commercial application scenarios.

Under this pattern, new chip companies have almost no room for survival in the training market. “The market for training chips is not the arena for most players,” AI infrastructure entrepreneurs admitted, “Just getting a training order for a big model means you need to burn tens of millions of dollars – and you may not necessarily win.”

Because of this, those startups that have "head-on-headed" Nvidia have begun to seek application paths that are easier to enter and more large-scale. The reasoning chip is the best option.

Graphcore: Reasoning as a life-saving straw

Founded in 2016, the British AI chip unicorn Graphcore was once one of Nvidia's most challenging opponents. Its IPU (Intelligence Processing Unit) focuses on parallel processing architectures for neural network training.

According to Graphcore, IPU is a processor designed for artificial intelligence and machine learning workloads. Compared with traditional CPUs or GPUs, IPUs are different in structure and processing, aiming to perform AI model training and inference tasks more efficiently.

With the continued surge in global demand for artificial intelligence chips, Graphcore has risen rapidly and has attracted a large number of investors' attention in a short period of time. In 2020, Graphcore released the Colosual MK2 GC200 IPU, which uses TSMC's 7-nanometer process, is said to be close to the performance of the NVIDIA A100. In the same year, it raised $222 million at a valuation of $2.8 billion, becoming one of the most promising startups in the UK.

In the view of Simon Knowles, Graphcore technology leader, it is not wise to compete with Nvidia in full swing. He shared the core entrepreneurial guidelines on The Robot Brains Podcast: never produce enhanced versions of existing products of large companies, because large companies have a huge market foundation and startups are difficult to directly compete with.

He believes that AI will exist in various fields of human future technology, and the needs of different industries cannot be supported by a single architecture. Graphcore only needs to make IPUs better than GPUs in specific areas and can get a share of this rapidly growing market.

Due to the particularity of the architecture, IPUs are particularly suitable for handling high-performance computing tasks that are not possible for the current CPU and GPU to operate optimally, especially "sparse data" processing. Molecules are typical application cases - molecules are arranged irregularly, behave complexly and have small sizes, while the large-scale parallel structure of IPUs is suitable for processing such irregular data structures.

In application fields, IPU has performed outstandingly in the fields of chemical materials and medical care and has been used in coronavirus research. In 2020, Microsoft's Sujeeth Bharadwaj built the IPU into the Azure system to identify the new crown chest X-ray. He said: "The Graphcore chip can complete the work of Nvidia's traditional chips in 30 minutes."

In terms of business model, Graphcore integrates IPUs into the "pods" system and sells them in packages to cloud computing and server manufacturers. The most eye-catching thing is undoubtedly that in November 2019, Microsoft signed a processor purchase agreement with Graphcore, which is like a piece of pie that can't be eaten from the sky for a startup.

Unfortunately, the reality is cruel. As the market continues to raise the threshold for training platforms, Graphcore's IPU system is difficult to shake Nvidia's position in large-scale AI training programs. In the spring of 2021, as Microsoft terminated its cooperation with Graphcore, the startup began to decline. In order to cut costs, Graphcore announced layoffs in September 2022 and closed its office in Oslo the next month.

In 2023, Graphcore was exposed to have significantly laid off employees and closed its US business in North America, and also gave up its IPO plan. Founder Simon Knowles admitted in an internal speech: "The training market is too concentrated, and we need to turn to the actual landing scenario that can bring revenue."

In July 2024, SoftBank Group of Japan announced the completion of its acquisition of Graphcore and began to shift its focus to efficient inference tasks in enterprise AI deployment. It re-optimized the Poplar SDK, launched a lightweight model inference acceleration solution, and focused on "high throughput, low power consumption" AI inference solutions for financial, medical and government scenarios.

For Graphcore, reasoning may be its last life-saving straw.

Intel Gaudi: No more fighting against GPU

Founded in 2016, Habana Labs was once one of the star companies in Israel. Its products are mainly aimed at AI's reasoning, prediction and training. In 2018, Habana Labs launched its first product, the Goya reasoning processor, which is mainly used for AI reasoning and prediction. Gaudi, launched in 2019, is mainly used for AI training. Before it was acquired, it had initially formed a complete product line on both the training and inference of AI chips.

In 2019, Intel acquired Habana for $2 billion, and Gaudi also became an important puzzle in its AI training strategy. In May 2022, Intel officially released the Gaudi2 and Greco deep learning accelerators that use 7nm process. According to Intel, its throughput performance is increased by 2 times compared to Nvidia's A100 GPU.

Although the Intel Gaudi series is enough to challenge Nvidia in some performance parameters, judging from subsequent market feedback, the adoption rate of Gaudi training platform has always been sluggish even among cloud manufacturers.

A former Intel executive admitted: "From the moment of acquiring Habana, Intel has never understood why it operates two departments that develop competing architectures at the same time - Habana and GPU departments." Former Habana employees regarded Intel's bureaucratic efficiency as a serious obstacle. "In Habana, a five-minute corridor conversation can make a decision; at Intel, the same decision requires three meetings, dozens of people participate, but no progress has been made."

Until 2022, Intel has been in two-line parallelism - while selling Gaudi processors, it develops its competitive product, Ponte Vecchio GPU. However, with the rise of generative AI models such as ChatGPT and Nvidia's market dominance is becoming increasingly stable, and Intel once again faces negative feedback from customers.

In mid-2023, Intel announced the incorporation of Gaudi into the newly established AI acceleration product line and shifted the focus of Gaudi 3 to "focus on training + reasoning", in which inference performance and cost-effectiveness have become new selling points.

When Gaudi 3 was released in early 2024, Intel focused on promoting its accelerated performance of large language models in inference scenarios - for example, when running models such as Meta Llama 2, it achieves lower latency and higher energy efficiency compared to Nvidia A100. More importantly, Intel vigorously promotes Gaudi's advantage on the cost side, and its "inference throughput per dollar" is nearly 30% higher than similar GPU chips.

Eventually, Intel began to try to integrate its business, merge Habana with its GPU division, and develop a new AI processor called Falcon Shores—a hybrid chip that combines GPU (similar to Nvidia) and CPU (Intel's expertise). Habana employees questioned the move and even laughed at themselves: "Suddenly, they remembered us."

Earlier this year, Intel announced that its next-generation Habana processor, Falcon Shores, received negative feedback from customers and therefore would not be commercially available. Earlier, about six months ago, Intel announced that Gaudi failed to meet expectations of achieving $500 million in revenue in 2024. Therefore, Intel decided not to develop the next generation of products after Gaudi 3.

Up to now, Gaudi 3 has been packaged into AI servers of manufacturers such as Supermicro, and has deployed large models for enterprises, built private semantic searches, document summary, customer service robots and other scenarios. For medium and large enterprise customers who want to "partially replace the public cloud reasoning API", Gaudi is becoming a price-friendly option.

For Intel, the importance of GPU business, including Gaudi, is constantly weakening, and it may also tend to be more inference than training in the future.

Groq: Exchange speed for market

Groq, which is also a company that started AI chips, has a story that can be traced back to inside Google. Its founder Jonathan Ross is the chief architect of Google's first-generation TPU (Tensor Processing Unit) chips. After witnessing TPU's breakthrough in deep learning training and reasoning, Ross left Google in 2016 to set up Groq in an attempt to build a faster and more controllable "universal AI processor" than TPU.

Groq's core technology is the self-developed LPU (Language Processing Unit) architecture. This architecture abandons the traditional out-of-order execution and dynamic scheduling mechanisms, and adopts a "deterministic design" with static scheduling, fixed data paths, and predictable execution processes. Groq claims that this design can achieve extremely low latency and high throughput, making it ideal for large-scale inference tasks.

At the beginning, Groq also bets on the training market, which early on tried to push LPUs into the big model training market, claiming that its architecture can provide higher utilization and faster training cycles than GPUs. But the reality is cruel: Nvidia's CUDA ecological barriers are almost unshakable, and the competition logic of training markets is more about "big ecology + big capital + big customers". For a chip startup, it is difficult to gain recognition from mainstream AI labs and cloud vendors.

At the same time, Groq's architecture has limited compatibility with mainstream AI frameworks (such as PyTorch and TensorFlow) and lacks mature compilation toolchain support, making the migration cost of training tasks extremely high. These realities force Groq to rethink its market entry point.

Starting from the second half of 2023, Groq has clearly turned to the Inference-as-a-Service direction, creating a complete "AI reasoning engine platform" - not only providing chips, but also opening up ultra-low-latency API interfaces to developers and enterprises, emphasizing the ultimate response of "results in a few milliseconds after text input".

Groq demonstrated in 2024 that its system achieved generation speed of more than 300 tokens per second when it ran the Llama 2-70B model, far exceeding mainstream GPU systems. This advantage allows Groq to quickly attract a group of delay-sensitive vertical industry users, such as financial transaction systems, military information processing, and voice/video synchronous subtitle generation.

In addition, Groq has expanded its product positioning from "AI chip" to "AI processing platform", providing developers with API access through the GroqCloud platform, integrating with LangChain, LlamaIndex and other ecosystems, trying to turn itself into a big model inference cloud focusing on speed optimization.

Currently, Groq is working with several startup AI application companies to be a provider of its low-latency back-end inference service, and has been initially deployed and implemented in small assistants, embedded interactive devices and high-frequency Q&A systems.

For Groq, focusing on reasoning speed has made it stand out among a number of startup AI chip companies.

SambaNova: From System as a Service to Inference as a Service

SambaNova is one of the few AI chip startups that do not rely on "selling chips" but "selling systems". Its Reconfigurable Dataflow Unit (RDU) chip architecture adopts data flow computing, with high throughput as the selling point, and has demonstrated superiority when training large Transformer models.

SambaNova once attached great importance to training models on its hardware. They published articles on how to train on its hardware, showing off their training performance, and mentioning training in official documents. Many analysts and outside observers believe that being able to handle the training and inference market at the same time with a chip is a major advantage for SambaNova compared to competitors such as Groq, one of the first startups to turn to reasoning.

The company has also invested a lot of time and effort to achieve efficient training functions. Around 2019-2021, SambaNova engineers spent considerable time implementing kernel code for the NAdam optimizer, a momentum-based optimizer commonly used to train large neural networks. Its software and hardware features are designed and optimized for training, both internally and externally conveying information, and training has always been an important part of SambaNova's value proposition.

However, SambaNova's sales focus has quietly changed since 2022. The company launched the "SambaNova Suite" enterprise AI system, which no longer emphasizes training model capabilities, but focuses on "AI reasoning as a service". Users do not need to have complex hardware or AI engineering teams, they can complete the big model inference work by simply calling the API. SambaNova provides computing power and optimization models in the background.

In late April this year, SambaNova Systems significantly changed its initial goal, announcing a 15% layoff and turning its focus entirely to AI reasoning, almost giving up its previous focus.

According to reports, its system is particularly suitable for areas where the deployment of privatization models is strongly demanded - such as government, finance, and medical care. In these areas, data is sensitive and compliant, and companies are more inclined to control the model operation environment themselves. SambaNova provides it with a "big model turnkey engineering" solution, focusing on easy deployment, low latency, and compliant inference platform.

SambaNova has currently established cooperation with many Latin American financial institutions and European energy companies to provide large-scale model reasoning services such as multilingual text analysis, intelligent question-and-answer and security audits, and the commercialization path has gradually become clear.

After experiencing various setbacks, SambaNova has also found its position in the reasoning AI market.

Reasoning is even more popular

In a report, an analyst pointed out that to complete training efficiently, you need complex memory hierarchies, including on-chip SRAM, in-package HBM and off-chip DDR. AI startups have difficulty getting HBM and even more difficult to integrate HBM into high-performance systems – so many AI chips like Groq and d-Matrix do not have enough HBM or DDR capacity or bandwidth to efficiently train large models. There is no such problem in reasoning. During the inference process, no gradient is required, and activation values ​​can be discarded after use. This greatly reduces the memory burden of inference tasks and also reduces the memory system complexity required for chips that only support inference.

Another challenge is network communication between chips. All gradients generated during training need to be synchronized between all chips participating in the training. This means you need a large, complex full Internet network to complete the training efficiently. In contrast, inference is a feedforward operation, where each chip only needs to communicate with the next chip in the inference chain. Many startups have limited AI chip network capabilities and are not suitable for the fully connected architecture required for training, but are more than enough for inference workloads. Nvidia solves the dual challenges of memory and networking in AI training well.

At present, Nvidia's advantages are too obvious. Thanks to the versatility given by CUDA to the GPU, Nvidia's hardware is able to complete all the operations required for training and inference. And over the past decade, Nvidia has not only been committed to building chips that are highly optimized for machine learning workloads, but also optimizes the entire memory and network architecture to support large-scale training and inference.

Each chip is equipped with a large number of HBMs, allowing Nvidia hardware to easily and efficiently cache gradient updates generated by each step of training. Coupled with scale expansion technologies such as NVLink and cluster expansion technologies such as InfiniBand, Nvidia hardware is capable of the full Internet network required to globally update the weights of the entire large neural network after each step of training. Inference-only chips like Groq and d-Matrix cannot compete with Nvidia in terms of memory and network capabilities.

And it turns out that Nvidia's advantage in training performance is not just HBM and network. They have put in a lot of effort in low-precision training, and top AI labs have done a lot of work in algorithm hyperparameter tuning accordingly to adapt to the complex details of Nvidia's low-precision training hardware. To move from Nvidia to other chips for training, it is necessary to migrate extremely sensitive training code to a brand new hardware platform and deal with a whole new set of "pits". For a large GPT-4-scale model, this migration is extremely costly and risky.

AI reasoning is not new, but as more and more chip companies "turn around" to embrace it, it is not only a market trend, but also a strategic shift. In the reasoning market, winners can be a small team that understands user needs, or a startup that focuses on edge computing.

The future competition for AI chips will no longer only revolve around floating-point computing and TOPS, but will enter a stage closer to the "real world" - an era that emphasizes cost, deployment, and maintainability. For AI chip companies, from training to reasoning, it is not about giving up technical ideals, but about moving towards industrial reality.

Comment

Dedicated to interviewing and publishing global news events.