- Cerebras Systems introduces the fastest AI inference solution with unprecedented speeds.
- Llama 3.1 8B and 70B models achieve 1,800 and 450 tokens per second, respectively.
- Twenty times faster than GPU-based solutions, with 100x higher price-performance.
- Maintains state-of-the-art accuracy using 16-bit precision throughout inference.
- The Wafer Scale Engine 3 is powered by 7,000x more memory bandwidth than GPUs.
- Cerebras Inference offers three pricing tiers: Free, Developer, and Enterprise.
Main AI News:
Cerebras Systems has announced what it claims to be the fastest AI inference solution on the market. The Cerebras Inference platform reportedly achieves 1,800 tokens per second for Llama 3.1 8B and 450 tokens per second for Llama 3.1 70B, performing 20 times faster than traditional GPU-based solutions in hyperscale cloud environments.
With pricing starting at just 10 cents per million tokens, Cerebras Inference offers a compelling alternative to GPUs, delivering 100x higher-priced performance for AI workloads. Unlike other methods sacrificing accuracy for speed, Cerebras maintains state-of-the-art accuracy by consistently operating within the 16-bit domain throughout the inference process.
Powered by the Cerebras CS-3 system and the groundbreaking Wafer Scale Engine 3, which boasts 7,000 times the memory bandwidth of competing GPUs, Cerebras Inference addresses the fundamental challenge of memory bandwidth in generative AI.
Micah Hill-Smith, co-founder and CEO of Artificial Analysis, highlighted Cerebras’ achievement in AI inference benchmarks, noting that the company has set a new performance standard, particularly for Meta’s Llama 3.1 8B and 70B models. Artificial Analysis has verified that these models running on Cerebras Inference deliver speeds above 1,800 and 446 tokens per second while maintaining high-quality results in line with Meta’s official 16-bit precision.
As AI inference rapidly becomes a significant segment of the AI hardware market, accounting for about 40% of it, the emergence of such high-speed capabilities is likened to the advent of broadband internet. Andrew Ng, founder of DeepLearning.AI, praised the Cerebras Inference platform’s ability to support complex agentic workflows that require repeated LLM prompting.
Denis Yarats, CTO and co-founder of Perplexity, emphasized the potential impact of ultra-fast inference speeds on user interaction, particularly in intelligent search engines.
Cerebras offers three pricing tiers: a Free Tier with API access, a Developer Tier for serverless deployment starting at 10 cents per million tokens, and an Enterprise Tier with fine-tuned models and dedicated support, available through a Cerebras-managed private cloud or on-premise deployment.
Conclusion:
Cerebras Systems’ breakthrough in AI inference technology represents a significant shift in the market. With speeds vastly superior to traditional GPU-based solutions and a pricing model that dramatically reduces costs, Cerebras is poised to disrupt the AI hardware market. This advancement addresses the critical memory bandwidth challenge in generative AI and sets a new standard for performance and cost-efficiency. As AI inference grows to dominate a larger share of the AI hardware market, Cerebras’ innovation could compel competitors to rethink their strategies, potentially accelerating the adoption of AI across industries that demand real-time or high-volume processing. This development marks a pivotal moment in the evolution of AI infrastructure, with the potential to redefine industry benchmarks and reshape market dynamics.