Google’s new LLM PaLM 2 uses nearly five times more training data than its predecessor

TL;DR:

  • Google’s new large language model, PaLM 2, uses nearly five times more training data than its predecessor.
  • PaLM 2 is trained on a massive 3.6 trillion tokens, teaching the model to predict the next word in a sequence.
  • Google and OpenAI, creators of ChatGPT, have not disclosed specific details about the size and composition of their training data due to competitive reasons.
  • PaLM 2 is smaller than previous models but more efficient, accomplishing sophisticated tasks with fewer parameters.
  • Google’s PaLM 2 is trained in 100 languages and powers 25 features and products, including the chatbot Bard.
  • Facebook’s LLaMA model is trained on 1.4 trillion tokens, while OpenAI’s GPT-3 was trained on 300 billion tokens.
  • Transparency in AI technology is increasingly demanded by the research community.
  • Controversies surrounding AI have emerged, prompting discussions about the need for a new framework to govern its usage.
  • El Mahdi El Mhamdi, a senior Google Research scientist, resigned over the company’s lack of transparency.
  • OpenAI CEO Sam Altman emphasized the responsibility of companies in developing AI tools during a Senate Judiciary subcommittee hearing.

Main AI News:

Google’s cutting-edge large language model (LLM), unveiled at Google I/O, has revolutionized the realm of artificial intelligence. This advanced model, known as PaLM 2, boasts an astounding training data size that is nearly five times larger than its predecessor from 2022. Internal documentation reviewed by CNBC reveals that PaLM 2 has been trained on an impressive 3.6 trillion tokens. Tokens, which form strings of words, play a pivotal role in training LLMs as they enable the model to predict the subsequent words in a sequence.

The earlier version of PaLM, standing for Pathways Language Model, was introduced by Google in 2022 and trained on 780 billion tokens. Despite Google’s inclination to showcase the prowess of its AI technology and its integration into various applications like search engines, emails, and document editing tools, the company has chosen not to disclose specific details about the size and composition of its training data. Similarly, OpenAI, the Microsoft-supported creator of ChatGPT, has kept its latest LLM, GPT-4, under wraps.

Both Google and OpenAI attribute their silence to the competitive nature of the business landscape. With an AI arms race in full swing, these organizations strive to captivate users who seek conversational chatbots for information retrieval rather than relying on traditional search engines. Nonetheless, the research community has grown increasingly insistent on transparency.

Following the introduction of PaLM 2, Google has asserted that this new model is smaller in scale than its predecessors. This development holds immense significance as it indicates that Google’s technology has become more efficient while simultaneously tackling more complex tasks. Internal documents suggest that PaLM 2 is trained on 340 billion parameters, which serves as a testament to the model’s intricacy. The initial PaLM, in contrast, was trained on 540 billion parameters.

Google has refrained from providing immediate comments on this matter. However, in a blog post concerning PaLM 2, the company shared details about a novel technique known as “compute-optimal scaling” employed by the model. This innovation enhances the LLM’s overall performance, leading to faster inference, fewer parameters to serve, and reduced serving costs.

Google’s announcement of PaLM 2 has corroborated CNBC’s previous reports, confirming that the model is trained in a hundred languages and exhibits versatility across a wide array of tasks. Presently, PaLM 2 fuels 25 features and products, including Google’s experimental chatbot, Bard. The model is available in four sizes, ranging from the smallest to the largest: Gecko, Otter, Bison, and Unicorn.

Public disclosures indicate that PaLM 2 surpasses existing models in terms of power. For instance, Facebook’s LLM, LLaMA, announced in February, was trained on 1.4 trillion tokens. The last known training size of OpenAI’s ChatGPT was with GPT-3, which was trained on 300 billion tokens. OpenAI subsequently launched GPT-4 in March, proclaiming its “human-level performance” across numerous professional tests.

As AI applications continue to permeate mainstream usage, discussions surrounding the underlying technology have grown increasingly impassioned. This development has ignited controversies, with more stakeholders demanding transparency. In February, El Mahdi El Mhamdi, a senior Google Research scientist, resigned due to the company’s lack of transparency.

Moreover, during a hearing of the Senate Judiciary Subcommittee on privacy and Technology, OpenAI CEO Sam Altman concurred with lawmakers that a new framework is necessary to address the challenges posed by AI. Altman emphasized that companies like OpenAI bear significant responsibility for the tools they introduce into the world.

Conlcusion:

The introduction of Google’s PaLM 2 and its substantial training data size signifies a significant advancement in the field of artificial intelligence. With the ability to perform more complex tasks and predict words with greater accuracy, PaLM 2 sets a new benchmark for large language models. The competitive nature of the industry, demonstrated by Google and OpenAI’s secrecy, highlights the race to attract users who seek conversational chatbots for information retrieval.

However, as the AI arms race intensifies, demands for transparency from the research community are growing louder. This evolving landscape presents both challenges and opportunities for businesses operating in the market, requiring them to adapt to the rapid advancements in AI technology and the need for ethical frameworks to govern its application.

Source