小程序
传感搜
传感圈

Why Nvidia won’t be worried by Google’s AI supercomputer breakthrough

2023-04-10
关注


Google claimed a breakthrough in processor speed this week when it released research showing an AI supercomputer powered by its in-house tensor processing units (TPUs) offers improved performance and better energy efficiency than an equivalent machine running on Nvidia A100 GPUs. Nvidia has cashed in on the generative AI boom, with demand for the A100, the chip used to train large language AI models like OpenAI’s GPT-4, going through the roof. But with a new GPU, the H100, ready to hit the market, it is unlikely to be worried by Google’s achievement.

Google’s supercomputer in Oklahoma, powered by TPU V4. (Photo by Google Cloud)


The research paper, published on Tuesday, shows that Google has strung together 4,000 of its fourth-generation TPUs to make a supercomputer. It says the machine is 1.7 times faster than an equivalent machine running on Nvidia A100 GPUs, and 1.9 times more efficient.

Why Google’s TPUs are more efficient than Nvidia A100

In the scientific paper, Google’s researchers explain how they connected the 4,000 TPUs using optical circuit switches developed in-house. Google has been using TPU v4 in-house since 2020, and made the chips available to customers of its Google Cloud platform last year. The company’s biggest LLM, PaLM, was trained using two 4,000 TPU supercomputers.

                                                                                                                                                                                                                       




“Circuit switching makes it easy to route around failed components,” Google fellow Norm Jouppi and Google distinguished engineer David Patterson explained in a blog post about the system. “This flexibility even allows us to change the topology of the supercomputer interconnect to accelerate the performance of a machine learning model.”


This switching system was key to helping Google achieve a performance bump, says Mike Orme, who covers the semiconductor market for GlobalData. “Although each TPU didn’t match the processing speed of the best Nvidia AI chips, Google’s optical switching technology for connecting the chips and passing the data between them made up the performance difference and more,” he explains.

Nvidia’s technology has become the gold standard for training AI models, with Big Tech companies buying thousands of A100s as they attempt to outdo each other in the AI arms race. The OpenAI supercomputer used to train GPT-4 features 10,000 of the Nvidia GPUs, which retail at $10,000 each.

But the A100 is about to be usurped by the company’s latest model, the H100. The recently launched chip topped the pile for power and efficiency in inference benchmarking tests released today by MLPerf, an open AI engineering consortium which tracks processor performance. Inference is the speed at which an AI system can carry out a task once it is trained.

“Nvidia claims [the H100] is nine times faster than the A100s involved in the Google comparison,” Orme says. “That speed premium would eliminate the edge Google’s optical interconnect technology provides. The battle of odious comparisons intensifies.”


Content from our partners

The war in Ukraine has changed the cybercrime landscape. SMEs must beware


The war in Ukraine has changed the cybercrime landscape. SMEs must beware




Why F&B must leverage digital to unlock innovation


Why F&B must leverage digital to unlock innovation




Resilience: The power of automating cloud disaster recovery


Resilience: The power of automating cloud disaster recovery






What are Google’s ambitions in AI chips?

Google says it uses TPUs for 90% of its AI work, but despite the capabilities of the chips, Orme does not expect the tech giant to market them to third parties.



   View all newsletters
   Sign up to our newsletters
   Data, insights and analysis delivered to you
   By The Tech Monitor team
   
   Sign up here


“There is no ambition on Google’s part to compete with Nvidia chips in the merchant market for AI chips,” he says. “The  proprietary TPUs will not make it out of the Google data centre or its AI supercomputers if they were ever intended to do so.”

He adds that very few people outside the company will get to utilise the technology, as Google Cloud is a relatively minor player in the public cloud market. It holds 11% of the market according to figures from Synergy Research Group, trailing in the wake of its hyperscaler rivals Amazon’s AWS and Microsoft Azure, which have 34% and 21% shares respectively.

Google has also done a deal with Nvidia which will see the H100 made available to Google Cloud customers, and Orme says this reflects the fact that Nvidia’s place as the market leader will remain secure for some time to come.

“Nvidia is likely to remain the AI chip kingpin in a market that will reflect the feverish excitement surrounding generative AI as spending soars on training and inference capacity,” he adds.




参考译文
为什么英伟达不担心谷歌的AI超级计算机突破

谷歌本周声称在处理器速度上取得了突破,该公司发布的研究表明,由其内部张量处理单元(tpu)驱动的人工智能超级计算机比运行Nvidia A100 gpu的同等机器性能更好,能效更高。英伟达已经从生成式人工智能的繁荣中获利,对A100的需求正在飙升。A100用于训练大型语言人工智能模型,如OpenAI的GPT-4。但随着新的GPU H100即将上市,它不太可能对谷歌的成就感到担忧。周二发表的研究论文显示,谷歌已经将4000个第四代tpu串在一起,制成了一台超级计算机。据称,这款机器的速度是运行Nvidia A100 gpu的同等机器的1.7倍,效率是1.9倍。在科学论文中,谷歌的研究人员解释了他们如何使用内部开发的光学电路开关连接4000个tpu。谷歌从2020年开始在内部使用TPU v4,并于去年向谷歌云平台的客户提供了该芯片。该公司最大的LLM PaLM是用两台4000 TPU的超级计算机训练的。谷歌的同事Norm Jouppi和谷歌的杰出工程师David Patterson在一篇关于该系统的博客文章中解释说:“电路切换可以很容易地绕过故障组件。”“这种灵活性甚至允许我们改变超级计算机互连的拓扑结构,以加速机器学习模型的性能。”GlobalData负责半导体市场的Mike Orme说,这种交换系统是帮助谷歌实现性能提升的关键。他解释说:“虽然每个TPU的处理速度都比不上最好的英伟达人工智能芯片,但谷歌用于连接芯片和在芯片之间传递数据的光交换技术弥补了性能差异,甚至更多。”英伟达的技术已经成为训练人工智能模型的黄金标准,大型科技公司购买了数千台a100,试图在人工智能军备竞赛中超越对手。用于训练GPT-4的OpenAI超级计算机配备了1万个英伟达图形处理器,每个零售价为1万美元。但是A100即将被该公司的最新型号H100所取代。最近推出的芯片在MLPerf(一个跟踪处理器性能的开放人工智能工程联盟)今天发布的推理基准测试中,在功率和效率方面位居榜首。推理是人工智能系统在接受训练后执行任务的速度。Orme说:“英伟达声称H100比谷歌比较中的a100快9倍。”这种速度优势将消除谷歌光互连技术提供的优势。令人讨厌的比较之战愈演愈烈。谷歌表示,其90%的人工智能工作都使用tpu,但尽管芯片功能强大,Orme并不认为这家科技巨头会将其推销给第三方。”谷歌没有在商用市场上与英伟达(Nvidia)芯片竞争人工智能芯片的野心。”“专有tpu不会走出谷歌数据中心或其人工智能超级计算机,即使它们曾经打算这样做。”他补充说,公司以外的人很少会使用这项技术,因为谷歌Cloud在公共云市场上是一个相对较小的参与者。Synergy Research Group的数据显示,亚马逊的市场份额为11%,落后于规模庞大的竞争对手亚马逊的AWS和微软Azure,后者的市场份额分别为34%和21%。谷歌还与英伟达达成协议,将向谷歌云客户提供H100, Orme表示,这反映了英伟达在未来一段时间内仍将保持市场领导者地位的事实。他补充称:“英伟达可能仍将是人工智能芯片市场的主力军。随着训练和推理能力方面的支出飙升,市场将反映出对生成式人工智能的狂热兴奋。”

您觉得本篇内容如何
评分

评论

您需要登录才可以回复|注册

提交评论

提取码
复制提取码
点击跳转至百度网盘