The AI Revolution Part 2: How Attention Revolutionized the Revolution

Remember when we wrote code in procedures? It's been a while—I was in college in the mid-90s when it fell out of vogue. Procedures were landlocked sub-routines inside applications. They usually Do A Thing and are then done---they do not return a value: whatever you send to the procedure is lost at the end of its execution. If you have a task that takes ten input values, compares them, and then sends a result back to the application for future use, you cannot do it in a procedure—you must do it in your main code. This results in INCREDIBLY long code, which means more opportunities for compile issues, flaws, vulnerabilities, etc.

By contrast, functions can Do A Thing and are designed to return a value. Need to compare ten inputs over and over and over again? One function. Functions: procedures == meteors: dinosaurs.

In language modeling, attention & transformation did the same thing as transitional probability indexing.

I attended a presentation at CyberConVA on February 1. It's a fantastic event that I can't wait to see grow and evolve over time, but the focus this year was (rather predictably) AI and one of the early presenters on the day was a Microsoft Data Scientist and PhD. Bro knows data. He gave the most fantastic overview of how all this works and described attention through the lens of translating text between English and French.

Remember our transitional probability index? We defined those transitions as occurring between current states and future states. The current states can also be called tokens; they don't necessarily have to be individual letters. They can be words, phrases, whatever.

Those tokens, if they are words, can have both definite and implied meanings. They can be metaphorical or literal. They can even be homographs, have different grammatical functions within a given sentence, or have other behaviors entirely across dialects. All of that information about the token is what we would call metadata, but a language model calls them "vectors" or "hidden states", and they help define how the model will translate a set of words in a sentence.

So we have our observable current states (the word) and its hidden states (all possible meanings and uses) working to figure out what word is probably next statistically and based on context.

Translating text to French benefits particularly from this behavior because of how adjectives are ordered in French (we have an implied adjective order in English, too, but it's not really taught), especially about nouns.

"Today, I am driving my blue car." In English, we understand that generally, the adjective comes before the noun. But in French, the order is (again, typically) reversed. If we based our translation solely on words in their specified order, we would get pidgin French: "Aujourd'hui [today] je [I] suis [am] conduissant [driving] ma [my] bleu [blue] voiture [car]."

But because of attention, transformation, vectors, and the hidden states of meaning and structure, we know the French don't directly say "I am driving" any more than they say "blue car." Thus, the model can look at the tokens and vectors and determine the intent was to say: "Aujourd'hui, je conduis ma voiture bleue."

It's this attention and transformation that make natural language modeling possible. It allows us to ask for responses in specified tones and for the models to learn our tones. Nothing is conscious about it: it's just tokens and vectors and a lot of math.

And then a couple of profound Somethings happened in rapid succession. The first was a groundbreaking paper that declared, in the title, that "Attention is all you need." There is no need for transitional probability: layer in tons and tons and tons of attention and skip the first bits entirely.

Remember that in coding, we reduced our application footprint by moving from procedures to functions, but in doing so, we also asked more of our computers. We put more content into memory, and therefore, we needed more. So, too, with the advancement of language modeling.

Early models, even those with nothing but attention, had a predictable failure rate. But then someone had the bold idea of increasing the input dataset and the compute power by…SEVERAL ORDERS OF MAGNITUDE.

I want to think this happened late one night after repeatedly running into the same error and the engineer saying, "Fine, you know what: HERE'S ALL THE DATA EVER. OVERLOAD ON IT"

And…the model didn't. It produced better results. So, they increased by orders of magnitude repeatedly, and the results improved. Almost by accident, data scientists discovered that they had the inverse of a scaling problem. And when I say it improved, it wasn't by 5 or 10%. The improvements were in the order of a 60% reduction in failures, and they scaled linearly with each increase.

Suddenly, we just needed more and more data to get better results. Remember how weather forecasting improves over time because there's just so much historical data to sample? It's the same deal, only (much) bigger. Have you noticed how quickly ChatGPT's answers and Dall*E's images have improved? It's not because they're nearing self-awareness; it's because it only took two months to reach 100M users, and all of those users are constantly feeding it new data. The input scaling issue is solved by just letting the world use it.

Obviously, this is a simplified view of how we got here. The work that went into making natural language modeling is phenomenal and overwhelming, and we're only in the early stages of its development. Every university in America either has or is developing graduate and undergraduate programs in data science, and the field continues to grow—the industry is not under threat from its own tools.

If your organization wants to improve productivity by using Microsoft Copilot, Synergy Technical can help. Our Microsoft Copilot for Microsoft 365 Readiness Assessment will validate your organization’s readiness for Copilot and provide recommendations for configuration changes prior to implementation. We’ll help you make sure that your team’s data is safe, secure, and ready for your Copilot deployment.

The AI Revolution Part 2: How Attention Revolutionized the Revolution

Recent Posts

How Copilot Researcher Became My Personal Chief of Staff

AI Isn’t Replacing People. It’s Augmenting Them.

Copilot Chat: The AI Assistant Already in Your Microsoft Toolkit

Comments