Three Easy Pieces: How AI Learns to Talk and Reason
Have you ever wondered how a machine learns to talk like a human?
Today we are going to explore language models like OpenAI's ChatGPT, Meta's Llama, Google's Gemini and others. We're going to dissect them and examine them closely to understand the magic going on under the hood. Our goal is to understand fundamentally how they produce text, keeping things non-technical and easy to process.
Who this is for: Anyone curious about how AI actually works
Prerequisite: Just curiosity and 15 minutes
Who this is not for: Someone looking for a technical guide, code, or architectural underpinnings.
We will break it into 3 core components:
- Language Understanding
- Attention
- Modelling Head
The moment you explain a magic trick, it loses all its charm.
We are going to reveal the trick, but I promise its charm will only increase.
Don't worry about the math; we'll keep everything in plain English. When technical terms pop up, I'll break them down simply.
Buckle up for a journey behind the curtain.
1. Language Understanding
The first component we are going to discuss is language, because that's how we first interact with our language models, isn't it?
Language is so fundamental to these models, they are called language models.
Human language is very intricate and complex and deeply inherent to what it means to be human. It's a way to communicate that uses symbols, spoken sounds, written marks, to convey meaning. It's a medium through which we connect and share, as I am sharing this piece of writing with you.
If language is that intricate, how do we translate its richness into numerical (numbers) to work with mathematical models?
The only way to work with mathematical models, we need mathematical language.
Vocabulary Building
We are going to look at a way to convert natural language into numbers.
Let's try to understand from first principles: if we are going to build a language model that can learn to talk like humans, which are purely mathematical in their essence, we have to convert our language into some numerical representation or simply into numbers to work with these models.
The most straightforward approach that comes to mind is to assign each word a unique number, indexing them from 1 to N. This simple mapping system allows us to represent our whole language numerically.
For example: We have 3 words in our vocabulary
Word | ID |
---|---|
Pizza | 1 |
Dog | 2 |
Cat | 3 |
You may know the number 3 represents cat, but you don't know anything else about cat. This method creates a numerical representation of our language, but we can't capture the depth of human language by just indexing our vocabulary. This indexing introduces unintended ordering (2 > 1) which is harmful to work with mathematical models, and there is no meaningful relationship that we can capture.
Despite this limitation, indexing remains a fundamental building block; we still need to catalog our vocabulary, and this approach gives us that foundation to build upon.
Vector Embeddings
Labeling words with numbers can't capture the richness of our human language. We may understand which number represents which word but nothing more than that.
We need to capture our language into numbers in a way that can encapsulates the intricacies, complexities and semantic meanings of language as well.
This challenge brought us to a fundamental principle in linguistics:
Words that occur in similar contexts tend to have similar meanings.
In simple words: This implies that we can understand a word from the company it keeps. These insights are transformative; it means instead of treating each word as a single entity we should infer its meaning from surrounding words, and create a numerical representation of a word from its surrounding words to capture the depth of human language.
Now, instead of mapping each single word to a single index or number, we create a fixed-length vector for each word. Vector is just a fancy word for a list of numbers.
Let's see a dummy example:
Word | Vector |
---|---|
cat | [0.25, 0.75, 0.10] |
dog | [0.30, 0.70, 0.15] |
apple | [0.80, 0.10, 0.20] |
banana | [0.78, 0.15, 0.22] |
If you take a close look at the table, the words cat and dog have similar vectors, while banana and apple have similar vectors. Cat and dog are close because they are animals (four legs, pet food, etc.), and banana and apple are similar and often appear in fruit contexts (grocery aisles, smoothies, desserts).
But this example was dummy, just to explain the concept. In the real world, the vector can contain from 256, 512, or 1024 entries or numbers. Having long vectors helps encode the richness, complexity and interconnected relations of human language.
Thanks to embeddings, the model can capture semantic meanings as well. For instance, it recognizes that 'vehicle' and 'car' are similar concepts because they tend to occur in similar context, so their vector representations are similar. This enables the model to understand word relationships and meanings more effectively.
Imagine a giant semantic map
Picture a vast, starry night sky where every star is a word. Words glowing near each other share meaning, context, or purpose, like constellations telling stories of language.
Examples
Words | Why They're Close |
---|---|
Car & Vehicle | Same constellation: modes of transport (practically overlapping!). |
Lahore & 🇵🇰 Pakistan | Like London & UK: capitals anchor nations. |
Dolphin & Shark | Swimming together in the oceanic creatures cluster (but desert is far off!). |
- Synonyms nestle side-by-side → car ≈ automobile.
- Themes create micro-clusters → democracy, election, vote form a governance group.
One of the very interesting things about vectors is called vector arithmetic. If you take the vector of king and minus the vector of man and add the vector of woman, you will approximately get the vector of queen. King – Man + Woman ≈ Queen
Now, we have successfully turned our language into numerical representation or into numbers, so we can manipulate it and our model can make use of it.
2. Attention
While embeddings capture the general meaning of words, they don't handle context. What I mean by that is, the words bat (animal) and bat (cricket) are going to have the same vector because each word has a unique vector. So, how would we know which bat we are referring to? Human language is very context-specific; words have meaning depending on the context. So, we need embeddings of words that change dynamically to give us surrounding context.
Attention is often called the heart of today's modern language models.
Attention means where we need to attend. It helps us establish relationships between words so we can know where we need to focus, and which words are more important in the context.
When we speak or talk with someone, in our discussion every single word is not as important as others; our brains have learned where we need to attend. Let's see some examples to understand it better.
- The chickens didn't cross the road, because it was wide. (Example 1)
- The chickens didn't cross the road, because it was tired. (Example 2)
Now, according to the static embedding that we learned in the previous section, it will have the same numerical representation. Static embeddings may encode that it is used for animals and inanimate entities, but in context, it has a much richer meaning.
So, if we put both these examples into numerical representation, or compute the meanings of these examples, we will need the meaning of it to be associated with the road in the first example and it associated with chickens in the second example. The meaning and use of it is sensitive to context. It must have a different representation in the first example and a different representation in the second example.
Now, if you understood it clearly, you must understand what is meant by to attend or put attention, and also how our static embedding fails to capture context. So, we need dynamic embeddings or vectors that change based on the context we are using them.
Let's see closely how we create dynamic embeddings that tell us the context? Don't worry, math is here to save us.
Example - Processing Bank
Let's say our model is reading this sentence word by word:
I walked to the bank to deposit money.
The model processes the sentence word by word from left to right, and can't look ahead at the next words. To get the complete picture, think of each word in our example as a static embedding vector as we discussed above. When the model reaches the word bank, it can only look at the previous words: I walked to the
Here's how attention helps:
Complete word-by-word processing:
Word 1: I
- Model reads I
- No previous words to compare with
- Creates initial representation
Word 2: walked
- Model reads walked
- Compares walked with I → medium score (I is the one walking)
- Now walked has context about who is walking
Word 3: to
- Model reads to
- Compares to with I → low score
- Compares to with walked → high score (walked TO somewhere)
- Model understands this is about direction/movement
Word 4: the
- Model reads the
- Compares with all previous words → mostly low scores
- Just a determiner, but signals something specific is coming
Word 5: bank
- Model reads bank (the confusing moment!)
- Compares bank with I → low score
- Compares bank with walked → medium score (people walk to places)
- Compares bank with to → medium score (destination)
- Compares bank with the → low score
- Still ambiguous - could be river bank or financial bank
Word 6: to (second occurrence)
- Similar process, helps with sentence structure
Word 7: deposit
- Model reads deposit
- Game changer! Compares with all previous words:
- deposit with bank → very high score! (financial context)
- Now bank gets an updated representation - definitely financial!
Word 8: money
- Model reads money
- Compares with bank → extremely high score!
- Compares with deposit → very high score!
- Triple confirmation - this is definitely about financial banking
This creates a feedback loop where bank gets more weight/attention when associated with financial terms, helping the model understand we're talking about a financial institution, not a river bank.
The model builds understanding gradually, and attention scores help it figure out which earlier words become more important as it reads more context.
![[Screenshot from 2025-07-02 20-43-12.png]]
3. Modeling Head
This is the final component of our model. Until now, we have done the heavy lifting of understanding language and context. Now our model understands both the language and context.
First, we converted our language into numerical form to capture all its richness. Then, we learned to understand context through attention. Now comes the exciting part: it's time to generate language! We need our model to pick the next word.
Did you ever roll a die? Let's take the example of 6-sided fair die, you can't accurately predict, because there is 1/6 probability to get any number. Each roll is independent of prior one, so prior roll doesn't affect current score. But fortunately, human language isn't random like die rolls. Our language is highly contextual and predictable. If someone says The weather today is really... you can probably guess they'll say nice, hot, cold, or beautiful - not elephant or mathematics.
The Probability Engine
Here's the fascinating part: suppose our model was trained on a 100,000-word vocabulary. Every time it needs to predict the next word, it generates a probability score for all 100,000 words, yes, you heard right!
For The weather today is really... it might calculate: _rolling a die....
- nice - 35% probability
- hot - 25% probability
- cold - 20% probability
- beautiful - 15% probability
- elephant - 0.001% probability
- 99,995 other words with tiny probabilities
We don't necessarily pick the most probable word because then the response would be very predictable; instead, we choose from one of the highly probable words.
This is why these models are computationally expensive: there's a massive amount of math happening under the hood just to predict each single word.
But there is still a question: How do they get so good? They sound like a pretty intelligent guy.
Let's look at their training; it would add a little length to our section but it would be worth it.
Imagine you're visiting an alien planet; nobody speaks your language. A friendly alien teacher decides to help you learn their language, which has 50,000 different words.
Day 1 of Learning:
- The teacher shows you the sentence: Zorp glims to shkola daily
- The teacher points to Zorp and asks: What comes next?
- You have no idea, so you randomly guess: banana
- The teacher shakes their head: No, it's 'glims'
- You think: Okay, after 'Zorp' comes 'glims'
The Process Continues:
- Now you have Zorp glims and the teacher asks: What's next?
- You guess: elephant
- The teacher corrects: No, no, it's 'to'
- This continues, word by word, mistake by mistake
After making millions of mistakes and corrections across thousands of sentences, you start recognizing the patterns. You learn that Zorp (which means I) is often followed by action words, and shkola (meaning school) often comes after to.
The same happens with our models. Today's modern language models are trained on Wikipedia articles, books and almost whole text of human civilization. The training incredibly scale trillions of words. Now, this model hasn't just learned to talk like humans, but can imitate human intelligence, problem-solving, creativity, and reasoning, just by encoding the statistical patterns hidden in our natural language.
Conclusion
We started our discussion by asking a very simple question: How does a machine learn to talk like a human? Now, after peeling back the layers, we see it's not magic at all; it's an elegant dance of mathematics, pattern recognition, and computational power.
We turned language into numerical representation while encoding its richness. Words with similar meanings and that appear in similar contexts tend to cluster together and have similar vector embeddings.
Then we explored attention, the heart of our model. We learned how to build dynamic embeddings that capture context. We saw how bank transforms from an ambiguous word into a clear financial institution through attention to deposit and money.
Finally, we explored the Modeling Head, the final piece in our puzzle: the probability engine that rolls dice to generate each single word by generating probability scores for the whole vocabulary, then predicts one word with high probability.
Next time ChatGPT writes a poem, remember: it's not magic, just math, attention, and a very sophisticated game of word dice.
The trick has been revealed, but the magic if anything has only grown stronger.
This is my first ever publication, and I'd love to hear from you, whether it's appreciation, constructive criticism, questions, or just thoughts on how AI actually works.