A legendary man who "always" stands in the C position of the big model technology

"How old are you always??" (How old are you)

This is a soul question recently that netizens have been constantly asking Noam Shazeer (for the convenience of reading, we call him Shage), one of the eight sons of Transformer .

Especially recently, after Meta FAIR researcher Zhu Zeyuan shared the new progress of their "Physics of Language Models" project, some netizens discovered that the 3-token causal convolution mentioned in it had related research three years ago.

Yes, " and ".

Because as long as you sort out his work history, it is not difficult to find that behind the breakthroughs in the AI industry, his name can always be found.

"Not a cult of personality, but why is it always Noam Shazeer?"

Zhu Zeyuan also stood up and said that Sha Ge's achievements were ahead of schedule:

"I also think Shazeer might be a time traveler. I didn't believe in their gated MLP (when writing part 3.3, because the gated multilayer perceptron made training unstable), but now I'm convinced (after adding the Canon layer, we compared the multilayer perceptron and gated multilayer perceptron in Part 4.1)."

Let’s get to know who Brother Sha is?

He is the one who is recognized as the "most contributor" among the eight authors of Transformer, and is also the one who runs to start a business with Character.AI halfway, and is "buyed back" by Google .

He is not a star scientist at OpenAI, nor is he exposed frequently like the founder of DeepMind, but if you look at the core technology of today's LLM, its foundational contributions are hidden throughout.

From "Attention is all you need" with more than 170,000 citations, to Google's early research on introducing MoE to LLM, to Adafactor algorithm, multi-query attention, and gated linear layer (GLU) for Transformer...

Some people sigh that we are actually living in the " Noam Shazeer era ".

Because the evolution of mainstream model architecture today is to continue to advance on the basis it has laid.

So, what did he do?

Attention Is All You Need is one

In the field of AI, there are many short-lived innovators, but few people can continue to define the technological paradigm.

Sha Ge belongs to the latter. His work not only laid the foundation for today's large language model, but also frequently provides key breakthroughs when technical bottlenecks arise.

One of its most influential work is the 2017 "Attention Is All You Need" .

One day in 2017, Shago, who had been with Google for several years, accidentally heard a conversation between Lukasz Kaiser, Niki Parmar, Ashish Vaswani and others in the corridor of the office building.

They were excited about how to use their self-attention, and Shago was attracted at the time, and he felt that this was a group of interesting and smart people doing promising work.

Then, Sha Ge was persuaded to join the team of seven people, becoming the eighth member and the last.

But the last person who came here rewrites the entire project code based on his own ideas in just a few weeks , taking the system to a new level, making the Transformer project "start the sprint".

Brother Sha was very powerful but didn't know it. He was still a little surprised when he saw that he was listed as the first author in the draft of the paper.

After a discussion, the eight authors finally decided to break the rules of communication between one and two in the academic world, sort them randomly, and give each person asterisk a name, and the footnotes indicate that they are all equal contributors .

But everyone knows that Sha Ge's joining played an important role. Later, the paper "Attention Is All You Need" caused a sensation.

What makes Sha Ge scary is that he always seems to see technological trends several years ahead of the industry , not just Transformer.

Before and after "Attention Is All You Need", Sha Ge also collaborated with one of the three giants, Turing Award winner Geoffrey Hinton , Google veteran and employee No. 20, Jeff Dean , and others to publish another representative work-

"Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" .

Back then, it laid the foreshadowing for the new paradigm of Mixture of Experts (MoE), which is now a big fire.

This work creatively introduces Sparsely-Gated Mixture-of-Experts, applying MoE to language modeling and machine translation tasks, and proposes a new architecture where MoE with 137 billion parameters is applied in a convolutional manner between stacked LSTM layers.

The scale is also a super big cup today.

Although the idea of MoE was proposed as early as the early 1990s, represented by "Adaptive Mixtures of Local Experts" by Michael I. Jordan, Geoffrey Hinton, etc., the research participated by Shago made it possible for the model to break through larger-scale parameters through dynamic activation of sub-networks, and inspired many subsequent MoE-based model improvements and innovations.

Moreover, Sha Ge's exploration of MoE is much more than that.

In 2020, GShard was proposed in Google's "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding" .

It provides an elegant way to express various parallel computing patterns with small changes to existing model code.

Through automatic sharding technology, GShard expands the multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts to a scale of over 600 billion parameters.

The following year, Switch Transformers ' work, combined with expert parallelism, model parallelism and data parallelism, simplified the MoE routing algorithm, and proposed a large Switch Transformer model with parameters reaching 1.6 trillion.

Not only did it promote the scale of the language model, it also achieved a speed of 4 times faster than the T5-XXL model at that time.

On the one hand, the expansion of model scale has opened up new areas for natural language processing, and on the other hand, it also faces obstacles from instability in the training process and quality uncertainty in the fine-tuning stage.

In 2022, a research on this issue, "ST-MoE: Designing Stable and Transferable Sparse Expert Models", was released.

The project expanded the parameter scale of a ST-MoE-32B sparse model to 269 billion, and its computational cost is similar to that of a intensive encoder-decoder Transformer model with 32 billion parameters.

Among the list of authors with a series of key progress, Brother Sha is indispensable .

Time proves that Sha Ge’s prediction is correct.

Today, the mainstream combination of MoE and Transformer architecture has all developed in the ideas of this series of work.

It is not just these to say that Brother Sha is trampling on the life gate of the times.

In order to solve the problem of memory limitation in training large-scale models, Sha Ge also jointly proposed the Adafactor optimizer , which was indispensable to early Google models such as PaLM.

Multi Query Attention (MQA), which acts on the acceleration of large-scale inference, is also his work.

MQA was first proposed in Shago's solo paper "Fast Transformer Decoding: One Write-Head is All You Need" in 2019 , aiming to solve the problem of inefficiency in the incremental reasoning stage of Transformer.

In addition, he also proposed the Gated Linear Layer (GLU) that is widely used in various Transformer models .

GLU has brought significant improvements to the Transformer architecture. Through the gating mechanism, GLU can dynamically adjust the information transmission according to the input, thereby better capturing complex patterns and dependencies in the data and improving the model's expression ability.

This dynamic adjustment capability helps the model process long sequence data and effectively utilize context information.

In the words of netizens, the research that Sha Ge participated in is often simple and crude, introducing technical details in detail. At that time, everyone may not fully understand the mystery, but later they will find it very useful.

3 years old, self-taught arithmetic, full score in 1994

Sha Ge's technical sense comes from his nearly legendary growth trajectory.

In 1974, Sha Ge was born in the United States and started to learn arithmetic by himself at the age of 3.

In 1994, he participated in the IMO (International Mathematics Olympiad), and after a nine-hour exam, he obtained full marks . This was the first time in the 35-year history of the event that a student had a full mark (and another 5 students received full marks in the same year).

In the same year, Shago entered Duke University to study mathematics and computer science.

While at school, Shago, as a member of Duke University's team, won awards in several math competitions. For example, in 1994 and 1996, they ranked 6th and 10th in the Putnam Mathematics Competition respectively.

After graduating from undergraduate studies, Shago went to UC Berkeley to study for graduate studies, but did not complete his studies (his LinkedIn now only contains undergraduate education experience).

Then the millennium arrived, Sha Ge joined Google and became the 200th employee , from a software engineer to a chief software engineer.

In 2001, his involvement in improving Google search spelling correction was launched, an important early achievement.

He has since developed the Google Advertising System PHIL, which is able to decide which ads to display on specific pages, while avoiding inappropriate or irrelevant content, becoming the core of the Google Advertising System.

In 2005, he became the technical director of Google's advertising text ranking team; in 2006, he created Google's first machine learning system for spam detection; in 2008, he developed a machine learning system for ranking news articles...

It didn't come out at all, but it would not be an exaggeration to say that he had achieved great success during his time at Google.

Although he briefly left Google from 2009 to 2012, he had been at Google for 18 years as of 2021 when he started a business with Character.AI .

After returning to Google and joining Google Brain in 2012, Sha Ge was in full force -

He turned his research direction to deep learning and neural network research, and promoted the implementation of neural machine translation (NMT) in 2016, significantly improving the quality of translation; in 2017, "Attention Is All You Need".

In August last year, Sha Ge bid farewell to the entrepreneurial track and returned to Google as vice president of engineering and joint technology director of Gemini. He has been working at Google for almost a year now.

True Google man, Google soul.

This is true, because Sha Ge rolled up his sleeves with his Google colleagues even when he started his business.

How dramatic is it?

Time goes back to 2021. At that time, since Google did not publicly release the chatbot Meena and its subsequent project LaMDA , developed by his colleague Daniel De Freitas , Sha Ge and De Freitas turned their heads and said bye bye to their old boss~

They discussed it and decided to further study more personalized super intelligence, so there was a company called Character.AI in the world .

After more than two years of development, Character.AI has accumulated more than 20 million users with "various AI roles".

In March 2023, Character.AI completed a US$150 million financing with a valuation of US$1 billion, led by a16z, and former GitHub CEO Nat Friedman, Elad Gil, A Capital and SV Angel participated.

However, after this, the star AI unicorn began to fall into trouble, and the new round of financing was difficult to advance. On July 4 last year, Character.AI was revealed to be considering selling it to Google and Meta.

In August, everything was settled, and Google brought Character.AI technology to the company for $2.7 billion and invited Shago to return to co-lead Google's Gemini project.

One More Thing

A story that may not be known, in the early stages of OpenAI, Shago was one of the consultants.

He has strongly recommended Ultraman to serve as CEO of OpenAI.

And another thing worth mentioning-

In 2020, after the release of Google Meena chatbot, Sha Ge sent an internal letter called "Meena Devours the World" .

The key conclusion is that language models will increasingly integrate into our lives in various ways and will dominate global computing power.

"Special statement: The content of the above works (including videos, pictures or audio) is uploaded and published by users of the "Dafenghao" self-media platform under Phoenix.com. This platform only provides information storage space services.

Notice: The content above (including the videos, pictures and audios if any) is uploaded and posted by the user of Dafeng Hao, which is a social media platform and merely provide information storage space services."

[Editor in charge: Huang Mengyu PT136]

Tags

Comment

News Guide

Related recommendations