![Inside Out: Through the Looking Glass](https://miro.medium.com/v2/resize:fill:48:48/1*8r-W4OLWwZ5pWVyRiSFzNw.png)
To Know Words By The Company They Keep
Much of modern NLP is statistics driven understanding of language as opposed to a Chomsky-esque approach of understanding language as syntactic trees. For many years Google Translate relied on syntactic tree based translations but as with much tree based methods, it was rigid, inflexible, and not robust enough. When Word2Vec and Recurrent Neural Networks came on the scene, Google Translate switched to these statistic driven methods to great success leading all the way up to present day transformers. The ideas behind statistic driven NLP are captured well in the much cited quote by John Rupert Firth, “You shall know a word by the company it keeps” (link).
Continuous Bag of Words (CBOW)
The earliest form of NLP focused on capturing statistics of various words within matrices. Two matrices in particular — the Term Document Matrix (TDM) and the Co-occurrence Matrix (CoM) were used to capture statistics of words across documents. The term document refers to some separate file e.g. website A could be a document and website B could be another document. How you choose to split data into documents is up to you, but the key idea is that there are statistics of words within a document and statistics of words across documents. The TDM and CoM are combined together to produce the CBOW matrix.
Term Document Matrix
This matrix has dimensions terms x documents. Each value represents the frequency of a term within a document.
This can be combined with what’s called the Inverse-Document Frequency (IDF) which is a combination of Term-Frequency (TF) and Inverse Document Frequency (IDF):
All this yields the same dimensionality matrix of terms x document. What each entry in the TF-IDF (Term Frequency Inter Document Frequency) matrix tells us is given a specific term and a specific document, how likely are we to see this term? Or in other words how surprised should we be in seeing this term in some other document?
Co-Occurrence Matrix
This uses a concept called a sliding window. You pick a center word, then pick a sliding window context around that word. Say you had “My favorite color is yellow” with a sliding window size of 1. If we chose color as our center word, we would then get a full context window of “favorite is”. We exclude the center word. Then we count up the co-occurrence of each word with each word in its sliding window context. We add this to a co-occurrence matrix. The matrix will end up looking something like this where p(t_i, t_j) is a normalized probability of the frequency of two token co-occurrences divided by the total co-occurrence of all tokens with token j.
t_i and t_j above are tokens i and j. It’s important to note here that while Term Document Matrices are counting FREQUENCIES across DOCUMENTS, Co-Occurrence Matrices are collecting PROBABILITIES WITHIN a SINGLE DOCUMENT.
The size of this matrix is tokens x tokens.
You can use a Co-Occurrence matrix to find the most-similar other tokens to a single token by taking the cosine similarity (dot product) between the probabilities contained in a single row (or column) and the CoM. The idea being, two words which are similar to one another will also have similar co-occurrences (to know words by the company they keep).
In practice because a terms x terms size matrix is huge, we would use a matrix factoring technique called Singular Value Decomposition to shrink the size of this matrix. This technique is called SVD; if you don’t truncate the final output will give you the exact same matrix represented as three matrices multiplied with one another. In practice, we would reduce the dimensionality of the output so that we can get a matrix which is very close to the original matrix but in much smaller dimensionality.
Continuous Bag of Words (CBoW)
This is a way to combine the TDM (or TF-IDF) and CoM into a single matrix which captures both the statistics together in one place. That’s why it’s called a continuous bag of words since it’s essentially mixing up the two statistic matrices in some way which we think preserves meaningful information. You end up with a documents x dimensionality reduction matrix. This is a nice concise matrix which captures a lot of statistics without having any matrix which is token size (i.e. vocabulary size). Storing a matrix which contains a row or column for every token will be huge and so this matrix is a very useful compact representation of the two above matrices.
Note — you can change the quality of these statistics by altering the sliding window size used in the CoM, changing gamma weighting (i.e. weighting words further from the center word in a sliding window context higher or lower), changing the amount you reduce dimensionality, whether you included spaces in the tokenized input, etc.
Word2Vec
Word2Vec is a technique developed by Tomas Mikolov at Google in 2013. The core idea is to find meaningful vector representations of words which can be used in language modeling. It uses an encoding matrix and a decoding matrix. You multiply your input word (or words if you have multiple words as context) by an encoder matrix. That encoder matrix is multiplied by a decoder matrix. You then take the softmax of the output of the decoder matrix (i.e. find the predicted value with the highest probability) and select that as the next predicted word. You can also use it to simply find the most similar words to the context you are putting into the system. How you choose to use the output of the decoder is your choice, but the key concept here is the idea of an encoder-decoder system.
Notes:
|w| means vocabulary size
1-Hot Encoded vector means a vector with all values 0 except for a single location
CBOW Variant
CBOW Variant of Word2Vec’s objective is to predict a center token given your context tokens. This gives you a good model of high frequency tokens.
The general steps of CBOW variant of Word2Vec are:
- Create 1-hot encoded vectors for context
2. Use 1-hot encoded vectors to select for embedding vectors from V relevant to context. E.g. <1,0,0> could be “Ram”, <0,1,0> could be “is”, and “<0,0,1> could be “awesome”; the embedding length x 1 vector in matrix V at row 2 associated with Ram would be the vector encoding of Ram.
3. Average embedding vectors to get
(if input is multiple words); NOTE — this will later be the crucial area which transformers improve things
4. Generate score vector by multiplying
5. Turn score vector into probabilities using softmax (or contrastive loss/negative sampling optimization of softmax);
6. Use cross-entropy loss to measure how close yHat_i is to the actual center word
7. Optimize U and V using gradient descent
This is the training algorithm for Word2Vec where the training is self-supervised. In other words, it uses its own language data to predict what the center word of a context should be given the words around it (what we saw in the Co-Occurrence Matrix). The closer it gets to guessing the center word (or token) correctly, the better the model is. This whole process ensures learning of good vector embeddings (the encoder matrix V). Later we will see the decoder U can be abstracted into neural networks or made more complicated but this basic embedding matrix decoder matrix idea is used a lot.
Side Note on Softmax
The classic softmax equation is quite computationally expensive as it requires a summation of all terms within a row in the denominator for every element in the row. We do still want to keep the power of the denominator in the softmax which is that it compares each element of a row to every other element in the row. But we can cut back on the computation by using something called a logistic sigmoid function. This compares each element to $1 + e^{-itself}$. This gates the output to between 0 and 1 (maintaining a probability like we want) but gets rid of the massive summation in the denominator of softmax. To get back the comparison across all elements in a row softmax gave us, we can choose some samples which are “negative samples”. In other words choose center tokens which are the wrong token and add the probability of not that to our loss function. So loss function would be a combination of cross entropy of correct prediction and cross entropy of (1-wrong prediction).
Some math associated with that I won’t lay out here — key takeaway is that we can optimize the training efficiently by using a negative sampling or contrastive loss gradient based on logistic sigmoid function rather than raw cross-entropy using the softmax activation function.
Skip-Gram Word2Vec Variant
While Word2Vec was trained on predicting a center word based on neighboring words, Skip-Gram’s training objective is to predict neighboring words given a center word. The result is that skip-gram gives you a better model of low frequency tokens while the CBOW Word2Vec Variant gives you a good model for high frequency words.
The algorithm steps of Skip-Gram are:
- Represent token via encoding vector V as
2. Decode center to get a score vector:
3. Pass score vector through activation function (softmax or contrastive-loss) and transform into probabilities:
4. Evaluate loss using cross entropy for all token targets
Sub-Words/FastText Algorithm
The idea of a word is loosely defined. Some languages don’t contain spaces between words unlike English (e.g. Japanese). As we saw with the CBoW, keeping matrices with vocabulary size dimensions is huge and untenable. Also we have an issue where if we see a word which is out of vocabulary, we don’t have a good way of handling that. We default to an out of vocabulary embedding row. It quickly becomes infeasible to create embeddings for every possible word as the calculations and matrix size needed to store this is huge.
The key idea is that we can use subword phrases — similar to phonemes like “ca” or “de” — to cut down on vocabulary size. If you observe, the alphabet has a limited 26 letters in English. As you combine these letters into words and phrases, the number of possibilities rises exponentially. However, the closer you get to representing things as individual letters, the smaller your vocabulary size has to be. Also this helps solve the out of vocabulary problem where new words can still be recognized because we have existing sub-word phrases which are contained in that word.
The tradeoff is that it can be slower to train a subword model as tokenization (splitting input data into separate tokens) takes longer and inference time takes longer. We’ll see later transformers address the inference time issue by having the encoder portion of the network deal with subword encodings but the decoder portion of the network deal with whole word encodings. Another tradeoff is that training a model on subwords requires a deeper neural network than if your model assumed all inputs were words.
N-Grams
In order to split tokens into subword pairings, we use an idea called n-grams. Each word create a gram set $G_w$. We would select a range of n e.g. $n \in \set{3,4,5,6}$. We also prepend and append angled brackets to help the model distinguish between two words.
For example, if we had the word “where” with n-gram range defined above, we’d get:
"<wh"
"whe"
"her"
"ere"
"re>"
"<whe"
"wher"
"here"
"ere>"
"<wher"
"where"
"here>"
"<where"
"where>"
"<where>"
We can compare this to n-gram of her which would have:
"<he"
"her"
"her>"
"<her"
"her>"
"<her>"
We can see that having the angled brace helps distinguish the phrase her in where from her because the model can see her in the case of where is contained not at the beginning of the word but in the middle of the word.
The full words also get their own separate embedding so the full word “<where>” and “<her>” have their own embedding separate from a sub-word embedding. The angle brackets are maintained to help distinguish the full word embedding vs a sub-word which may otherwise be identical to a full word.
The idea of how to use sub-words is to do essentially the same thing as skip-gram variant of Word2Vec, but change the loss function slightly.
Skip-Gram Loss Reviewed
In the skip-gram variant of Word2Vec we had in general form a loss function which looked to minimize the log-likelihood of the outputted probability.
As a refresher, log-likelihood is an objective function which says, given a bunch of probabilities, give me the joint probabilities of all of these probabilities. More formally, if you have a bunch of observations {x_1,x_2,…x_n} and you want to write the joint probability of all of these observations, you’d say {p(x_1)*p(x_2)*…*p(x_n)}. We can write this succinctly as the product sum of all these probabilities:
We can rewrite this joint probability by taking the log of the whole thing. The log product rule means we can turn it into a summation:
This naturally leads to our skip-gram/FastText objective function. In the CBOW Word2Vec we used a cross-entropy objective function to score how off we were from our correct prediction. If you work through the math, you can simplify the cross-entropy function to $-log(p(w_{center}|w_{context})$. In other words the loss function becomes a log-likelihood. This means that the general form of our skip-gram CBOW loss function is given a center token, what is the log loss probability of all its context tokens? In other words, what is the joint probability of all the context tokens given a center token.
C_t is the set of all tokens int he context window centered around w_t.
The noise-contrastive version of this general form loss function equation is the same general form, but instead of a single log of probability just for the positive sample, you have a second log probability function for the negative samples.
FastText
The idea of FastText is to replace the probability inside the log with a custom scoring function. We can use our n-grams from earlier to create this scoring function.
We have the vector representations of the n-grams defined as $z_g$ and the vector representation of the whole word as $v_j$. The idea is to represent our probability not as the logistic sigmoid activation of the result of the encoder-decoder architecture, but instead as the dot product between all the n-gram vector embeddings and the word embedding.
This means that we’ve drastically simplified the training process as instead of having to multiply by a big decoder matrix for every word, we simply train directly on the encoder matrix such that encoder embeddings can self-supervise the training. So it can take less than 10 minutes to train on 1 billion words.
There are some other ways of forming subword pairs such as Byte-Pair Encoding. I won’t go into it here, but the basic idea is to collapse pairs of characters into symbols and recursively do this until there is no pair that’s repeated more than once.
The tradeoff to remember is that pre-processing data for n-gram or Byte-Pair Encoding takes much longer although training time is greatly reduced.
Attention
The idea of attention revolves around finding ways to focus on the relative importance of various tokens in the input to the model. For example if you had the sentence “Dog bites man” that’s not a newspaper headline, but “Man bites dog” surely is! So we want the model to pay attention not just to the words in the context but to their order and relationships.
There was a long period of time where Recurrent Neural Networks (RNNs) were the state of the art at these types of tasks. RNN’s are neural networks with a loop in them. They look something like this in their simplest form:
The issues with RNN’s are two fold — one is that they generally need to be quite deep to have good outputs That increases computation time for training and inference. The consequence of the deepness and the fact that there is a loop means that the models are susceptible to the vanishing and exploding gradient problem. This is a problem where the backpropagated gradient signal explodes or becomes 0 because the more gradient multiplication there is, a very small or very large gradient will get grow exponentially with more layers (or with increased t).
Another problem with these systems was that they would heavily weight the most recent contribution to the model as that is fed directly into the model and all the previous outputs are contained in one massive term which makes the contribution of early words lose their impact.
So the key question became how do we make language models pay attention to the right things in their input? And can we find a better way than RNN’s?
Transformers
Let’s back up to what attention could mean on a very high level. It essentially is some weighting assigned to each word in your context. If you remember our simple Word2Vec model which took an average of all the input context word embeddings, instead of a simple average we want a weighted average.
For example for “dog bites man” we can assign the weight 0.4 to dog 0.3 to bites and 0.4 to man. Then when man and dog are flipped, the model maintains high attention on those words. This gets to a subtle point though. The famous paper “Attention is All You Need” that set off the transformer revolution is misleading. Attention is NOT all you need as a key part of that paper was the introduction of what is called positional embeddings. If we return to our “dog bites man” example, just knowing the weights of how much to pay attention to each word doesn’t distinguish “dog bites man” from “man bites dog”. But, if you also knew the position of man and dog, then you can clearly see that there’s a difference between those two sentences.
But for now, lets stick with the idea of attention. We can write attention formally as:
W_a is a attention weight matrix
V_g is the n-gram embedding matrix
Vbar_g is the n-gram weight matrix with attention weights applied
Psi is an activation function (e.g. softmax)
The basic idea is that we learn a weight matrix to apply to the input and then multiply that back against the input.
The problem with the above is that it will always weight two words in the same way regardless of where in the sentence they appear. The “Dog bites man” vs “Man bites dog” problem.
Before getting to positional embeddings, lets try modifying how we are applying the attention weights.
First set up a simple equation where we get the n-gram matrix to pay attention to itself.
V_g*V_g^T here is a self-attention block, but because there’s no weight matrices, there will be no learning. So lets add in weight matrices for attention.
Adding the weight matrices allows us to contextualize our input as it changes AND the attention we pay to the input.
While there’s no standard on what is the key or query, you can interpret as follows:
In other words, the key and query combine together to behave like a neural network whose weights change depending on the input it sees! This is a very cool idea which has a lot of general use cases outside of just transformers and NLP which is why you see transformers used everywhere and not just in NLP tasks.
The last piece of the self-attention weighted input puzzle is dividing by the square root of the weight matrix dimension size — this is just done to stabilize the gradient as the system learns. So the full self-attention equation looks like:
Finally, we can project the result of this into some hidden dimension by multiplying the unweighted input $V_g$ by another weight matrix. This is basically the same thing as passing the result of the self-attention block through a fully connected layer that keeps the input in a reduced or modified embedding space.
Our final equation will look like this:
The three terms you’ll commonly see for this equation are query, key, and value. We’ve gone over query and key above. The value is the term outside the \psi i.e. activation function. Here that is the V_g * W_v term.
We can write everything above in simplified algebra by setting values as follows:
The final dimension of
Normally the W_v weight matrix is of dimension d which is the same dimension as the key and query matrices.
Above is what is known as a single attention head. We can combine multiple heads together and learn multiple key, query, value matrices.
Now we can begin describing the full transformer architecture as outlined in the paper “Attention is All You Need”. Here’s a diagram from the paper describing the full architecture:
Here’s the encoder portion of the transformer blown up and diagrammed:
Here’s the decoder portion of the transformer blown up and diagrammed:
You’ll note a few key things in the diagrams:
1. The Key, Value, Query Matrices we talked about above split up into multiple heads. The original paper splits it up into 8 as diagrammed above. They are also concatenated together for faster processing.
2. There is this idea of a “residual connection” also called a “skip connection” or “highway connection”. These are the arrows in the original transformer diagram which take the original input and add it back to the output of an attention block. Or we see it between the output of an attention block→add and norm to the output of a feed forward layer. These connections basically make it so that when training, there is enough information preserved of the original input that the model can actually learn well at the beginning. As training progresses the residual connection doesn’t contribute much more to the learning. It is just there to help guide early learning of the system.
3. There is an idea called “positional encoding”. The idea here is that the relative position of a word is passed through something like a sine function. So if a word is in position 1 it has a different value added to it than if it’s at position 3. These values help the model learn the positions of words explicitly which helps with the “man bites dog” problem.
Here’s an example of what positional embedding might look like:
1. There is a split between the decoder and the encoder. This is an important note. Some LLMs will be decoder only, others will be encoder only, and others will be encoder-decoder models. For most seq-seq models like Chat-GPT or Llama today they are all decoder only models. But because they are used for sequence to sequence tasks (e.g. translating a sentence from one language to another), they can also be encoder-decoder models. The idea of an encoder and a decoder is just our original idea from Word2Vec blown up in complexity with a transformer architecture.
2. The decoder has this special layer called a masked layer. All this is is a way to train the model on a single input sequence all at once. For example, if your input sequence was “I like cheese”, the masking would separate this out into a 3×3 matrix that looks like:
| I | -infinity | -infinity |
| — — | — — | — — |
| I | like | -infinity |
| I | like | cheese |
This lets the model predict the word it expects after I, then after I like without being able to “cheat” and see the next word already. And it lets us train the model to predict every next word in an input sequence all at once in a single batch. We have an input in the decoder of multiple batches (batch size) that lets us do this same thing on a bunch of different sentences all at once. That speeds up training of the model.
3. There are two models of attention. Self-attention is when the input of the transformer block is used to multiply against the query, key, and value attention matrices. Cross-attention is when the query is from the decoder but the key and value matrices are from the encoder. In the decoder case we want the decoder to take its decoded attention weight matrix and use that to weight the encoder’s attention matrices. This allows the decoder to pay attention to the encoder outputs directly in its downstream task (e.g. language translation).
3 Classifications of Transformers:
I mentioned in point 4 above, that there are three types of transformer models. Here is a rule of thumb understanding of what they are and when they are used. You will often see on hugging face models are described in one of these three categories.
- Autoregressive
A stack of decoder blocks. Pretrained on casual language modeling. Typically used for text generation.
2. Autoencoding
A stack of encoder blocks. Pretrained on correction of corrupted sentences. Typically used for token-level and sequence-level classification tasks.
3. Sequence to Sequence
Full encoder-decoder architecture. Pretrained typically for desired down-stream task like translation, summarization, or question-answering.