Intuition and mathematics behind NLP and latest architectures

Maharshi Yadav
13 min readMay 8, 2020

Bringing to your plate the foundations of NLP and different designs you can learn with all the math (probability models) needed.

Table of contents:

  1. Bag of words
  2. Markov assumption
  3. Tf-Idf
  4. POS tagger, lemma
  5. Hidden markov model
  6. Word Embedding
  7. CBOW and Skip gram
  8. Attention and Transformers

Okay let’s give it a thought! Texts are sequentially aligned vectors in n-dimension which together makes sense or gives information to humans as we evolved conversing in dialogues which had nothing to do with maths. In NLP domain we have found different ways to interpret language and different ways to map (a group of)words/characters to numerical representation which carries information. E.g. a very fun and to play with example using gensim.

Dataset is of movie dialogues taken from:

Now we aim to create vectors for all words by training them on the movie dialogue dataset. I have already preprocessed dataset and have put the dialogues in chronological order.

# creating word embeddings by training using gensim
# “King - Man + Woman = Queen”
import gensim
documents = [] #contains list of list, each list is a list of tokens
F = open("encoder_inputs.txt", 'r') # this is already preprocessed
for i, line in enumerate(F):
templine = []
for tok in line.split():
templine.append(tok)
documents.append(templine)
# Model training
model = gensim.models.Word2Vec(documents, size=150, window=10, min_count=2, workers=10, iter=10)

Now comes the fun part! we can play with the vectorized form of words and also see top k most similar words (of course this can be a bit baised due to data and movie scene context).

import numpy as np
from scipy import spatial
model.wv.similarity("king", "queen") # 0.65954125
model.wv.similarity("drive", "king") # -0.083335534
model.wv.similarity("man", "woman") # 0.5871115
model.wv.similarity("man", "king") # 0.40426093
king = np.array(model.wv.word_vec("king")) # actual embedded vector
queen = np.array(model.wv.word_vec("queen"))
man = np.array(model.wv.word_vec("man"))
woman = np.array(model.wv.word_vec("woman"))
lhs = np.add(np.subtract(king, man), woman) # K-M+W

result = 1 - spatial.distance.cosine(lhs, queen) #0.4159473478794098
# As we can see they are 40% alike which is a good indicator

So if our n-D vector for the word “King” should be similar to the vector following same representation for a “Man” same with woman and queen but when we change the characteristics (mathematically) the resultant n-D vector should be similar to what we expect linguistically/logically.

Now the question remains how can we translate language to vectors. One thing comes in mind is probabilistic distribution maybe??? Or do a one hot encoding to all the words somehow and then train the values. Seems fair so let’s dig in to this more.

We cannot know before hand about all the words so we have to start with some corpus (set of documents we use for reference). Let’s take example of 3 lines.

S1 := “i am doing great as i was before”
S2 := “this is great”
S3 := “playing sport is a good habit”

taking all words and their occurrences into consideration, we get:

Bag of words

By just seeing this example we can get some sense of how words are distributed across sentences in our corpus. E.g. an application of this can be like we have to classify our test text and output which class this belongs to in our existing corpus or we can treat them as features and apply Regression to predict based on this base model.

This approach is called the “Bag of Words

But these are just words, what about catching the context of what is happening or gaining more information, as e.g. “very good” and “good” are similar but the emphasizing on its betterment is conveyed in the first sentence. So Let’s take pairs of text as our features and see what we can get.

2-Grams approach

As you can see by combining consecutive 2 words we could extract much more information from the documents. We can extend this to n-words meaning “N-Grams” approach (Natural extension of BOW model). So we will have a sentence mapped to features (columns as mentioned in diagram) and the number of features representing this will be L-N+1, L being number of tokens and N being N-grams.

Training N-grams:

We are dealing with variation of frequencies meaning a distribution of words and hence we can translate this model to prediction of a word given (N-1) words occurred before in the sequence.

Predicting probability of sequence of words: (w = (w1 w2 w3 w4 w5 … wk))

P(w) = P(w1).P(w2|w1).P(w3|w2,w1)… P(wk | w1, w2, … , wk-1)

This is too complicated as computing so many conditional probabilities is a pain!

Markov Assumption:

Instead of considering all the words let’s just take top n words, equation changed to:

P(wi | w1, w2, …, wi-1) = P(wi | wi-n+1, …., wi-1)

Bi-Gram language model:

Since we are dealing with pairs, we can say Probability of next word occurring given previous word occurred.

P(w) = P(w1).P(w2 | w1).P(w3 | w2)…. P(wk | wk-1)

S1 := "This is a bar"
S2 := "This is not a challenge"
S3 := "Assume this was a challenge"
P(this is) = P(this) . P(is | this) = (2/14) . (2/3) = 2/21P(this is a challenge) = P(this).P(is|this).P(a|is).P(challenge|a)
= (2/14). (2/3) . (1/2) . (2/3) = 2/63
probabilities in 2-gram model
probabilities of N-gram model, wi-1 are all last n-terms

Essentially we are doing nothing but counting so we can also express these sub-probabilities in terms of counts of group of words.

P(wi | wi-1) = count(wi-1, wi) / count(wi-1)

For N-grams, we definitely extend this model just that instead of considering one prior words in condition, we have n-prior words (n comes from markov assumption which can be varied)

Likelihood Maximization:

Since we have to train our corpus with probabilities we maximize the logarithm of them and our estimates for parameters are the individual probabilities which are nothing but counting as discussed before.

Perplexity(inverse) is a measure of how good a probability model can predict, you can read it in detail here: https://towardsdatascience.com/perplexity-intuition-and-derivation-105dd481c8f3

One corner case is when an out of vocab word arrives in the test set so we can set our perplexity to be min(0, perp).

Problems with N-grams:

  1. For a new word/set of words our vocabulary will increase and also the length of vectors.
  2. Our feature matrix will be sparse as there will be many zeros.
  3. No retention of information on grammar, ordering, context, etc.

TF-IDF

Term Frequency Inverse document frequency literally indicates distribution of words in the document compared to all documents in the corpus. As shown in below example the term frequency of “This” in first sentence is (1/4) and it occurs in 3 of the documents out of 4.

TF: Number of times word appears in doc / total words in that document

IDF: log (total #documents in corpus / #documents having this term)

So by this we can notice that words occurring in all documents should have TFIDF= 0 but we are getting 1. This is because in scikitlearn they have added 1 to the final idf value so that such ever occurring terms are not entirely ignored in training. If you notice i have passed min_df = 2 parameter which is basically ignoring tokens which appear less than 2 in the entire corpus.

By using tfidf we automatically eradicate tokens which are not that useful for differentiating as they occur in too many documents. This approach really gives a very good accuracy. I encountered it and have mentioned in previous post. We can set a benchmark by this accuracy for other methods to beat.

from sklearn.feature_extraction.text import TfidfVectorizercorpus = ["This is very strange", "This is very nice", "This is   really good", "My name is sam"]vectorizer = TfidfVectorizer(min_df=2, smooth_idf=True)X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
print (dict(zip(vectorizer.get_feature_names(), idf)))
# Output: {'is': 1.0, 'this': 1.2876820724517808, 'very': 1.6931471805599454}

Basic taggers, stemmers, lemmatizers

A well known POS tagger can be invoked to get information about the words and their role in the sentence/s.

import spacynlp = spacy.load("en_core_web_sm")
text = "I want to go to Mumbai. The cost for flight is 100$.\
Indigo flight is at 18:30 sharp"
doc = nlp(text)
for token in doc:
print(token ,'\t', token.pos_ ,'\t',token.lemma_)

As you can see the Part Of Speech tags for every token in the sentence. Lemma is basically the root form of the token. e.g. running and ran will have lemma of ‘run’ and in the above example, lemma for ‘is’ is ‘be’. Stemming is removing prefixes and suffixes of the tokens to make it nearer to base form. One of the famous stemmer is Porter stemmer. E.g. “Possesses” is cut down to Posses as its rules.

Hidden Markov models (HMM) for POS tagging

Let xi ’s be the sequence of words and yi’s be the sequence of tags then we have to find the most probable tags given the sentence. So we have to find the probability of the word having a specific tag. Basically maximizing P(y | x) but since y is hidden variable we have to maximize P(x,y) using conditional expansion.

Markov assumption:

Probability can be vectorized into separate pieces as probability of next step given previous step, starting part will have a common start token just like we did in bi-gram model. Secondly, calculating conditional probability is rather complicated so the assumption made on output independence.

Markov assumption and output independence

Thus P(x,y) = P(x|y).P(y) is the parameter we have! We use these assumptions to update maximum likelihood parameters in the Baum Welch (EM) algorithm which I am not going into it currently but it is basically fine tuning these parameters to maximize the likelihood. E being posterior probabilities for hidden variables and M being maximum likelihood updates for parameters.

We also perform smoothing on the probability parameter.

K-Smoothing:

Here by adding k-smoothing we have handled Prob=0 and smoothed across entire vocab of size V.

Word Embedding

“ You know a word by the company it keeps! ”

By this we mean that words who are similar or depict similarity should be closer to each other in the n-D space. Thus one thought we have is to compute the co-occurrences of the words. This is measured by Point-wise Mutual Information (PMI)

PMI = log( P(u,v) / P(u).P(v) )

To understand it in a way, if we have two Independent random variables then P(u,v) = P(u).P(v) as per independence principle making PMI = 0

For avoiding zero probabilities we can define PMI to be max(0,PMI)

SOFTMAX

Before going further I want to clarify SOFTMAX function which we will be using often for getting a smooth probability distribution across the data. Basically we have the exact term frequency approach just that each number is an exponent of ‘e’

import numpy as np
a = [1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0]
print(a/np.sum(a))
print(np.exp(a) / np.sum(np.exp(a)) ) #softmax
#Output:
#[0.0625 0.125 0.1875 0.25 0.0625 0.125 0.1875]
#[0.02364 0.064 0.1746 0.474833 0.02364054 0.06426166 0.1746813 ]

Continuous Bag Of Words (CBOW) model

Whenever we are trying to predict some words we have to train model on some corpus and also to fine tune it so we always use hidden layer/s for this computation. I am not going into depths of how many layers we should put etc. Herein we predict next word given set of words as input(all are one hot encoded).

CBOW architecture

We usually keep N to be 300 as it is known to give better results. To deep dive into what actually happens in the prediction we can see below

Probability distribution: P(wout | w1,w2,…,wC)

Hidden layer function: W(w1,w2,…,wc)*(1/C) avg of all weights

Cost function(discussed before): -log(P(w0|w1,w2,…,wC))

Skip-gram model

Does the exact opposite thing compared to CBOW. It is trained to provide surrounding words given one word. So we can make the flow in opposite manner to get skip gram!

given wt, we predict surrounding words for all t

e.g. our training sentence is: “a very good morning”

Then our training sample for c = 1 will be: (very, a), (very, good) ; (good, very), (good, morning). Output of the model is SOFTMAX of all the output values of the vector. There are mode adaptations in this space like e.g. Glove on which you can read this article https://towardsdatascience.com/nlp-101-negative-sampling-and-glove-936c88f3bc68

Using Recurrent Neural Networks

We had a problem of long term dependencies and maintaining contexts throughout the document and also keeping up the sequence of words. RNNs help us solve this problem. I won’t describe much on this part as I recommend you to look at this article which really explains RNNs and LSTM architectures in much better way. https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Attention mechanism

Since it is hard to train long term dependencies and the encoder decoder mechanism didn’t focus on the entire context of the input again while generating translations. Using weights for every input vector to determine the next decoder state will be a better approach.

Training:

So the train data is a dialogue(from one language to another or question answer data used in chatbot) wherein your input embedding will be in the encoder part and the output in the decoder part. So you train your weights by EM/gradient descent algorithm and then test on your test dataset. We use SOFTMAX function at end of all points so that we get a distribution.

Bidirectional attention mechanism

Decoder state Sj is Summation over all the states

encoder -> weight * encoder state embedding.

Weights is a SOFTMAX of similarities of Hi and Sj-1 hence the previous decoder state and all the encoder states together determine the next decoder state which makes sense!

Attention mechanism, source: https://labs.eleks.com/2019/06/neural-machine-translation-attention-mechanism.html

The context vector carrying the information of all the encoder states remains constant just that the weights change resulting into decoder state.

Live example of attention, Remember the vectors used here! Showcased this for an easy transition to transformer model.

Transformer (Attention is all you need!!!)

We have seen all sequential models and these are pretty tough to train and so we must have some parallelization which computes all things in parallel so that the time to train on very huge corpus having tons of data will be easy.

So we have approaches that use matrix of the inputs and Encode them!

Attention mechanism with parallelization using matrices #gif

Now we have achieved parallelism, our model lack sequential information and that we can add positional vectors in the corresponding embedding. To give an example in short we can add binary vectors e.g. in “I study at school” we can give I: [0,0,1], study: [0,1,0], at:[0,1,1] and school:[1,0,0]. You can see these in grey colors in further animations.

Also we can use softmax over these encoding and add them to the embedding vector. Combining all the ideas and generating an architecture.

Self attention:

The weight matrices are the deep learning matrices and are optimized during the front and back propagation training. Here there are three types of matrices used Query, Key and Value. Attention score is how much should query word focus on the key word.

Encoder layer = Multi-head attention + Feed forward !!!

In practice we use 8 layers of Wq, Wk and Wv in parallel. The way both components converse mathematically is by transformation of the encoding to a matrix of dimension same as the original one.

By stacking such encoding layers, we have our final encoder architecture!

Full encoder architecture

Decoder mechanism

The only new part in the decoder is the masked multihead attention which basically prevents future words to be a part of attention. Now the question remains how these modular parts work? Everything is driven by Q,K,V.

The multihead attention part takes Keys and Values from the encoder output and the Query from the previous decoder state just like attention! The leftmost matrix is different for every decoder layer as it is a stacked up previous decoder word embedding.

Stacking up all the decoder layers makes our final architecture ready! Remember, the decoder layer always takes <start> token before doing it’s amazing job!

Finally we come across the overall architecture of the transformer which is mentioned in the research paper (mentioned at the end)

I conclude this blog post and I know there are some models I haven’t covered yet in here like BERT, etc. as I thought this will set a good benchmark on what we will discuss next hopefully soon!

References:

Transformer video: https://www.youtube.com/watch?v=z1xs9jdZnuY&t=556s

--

--

Maharshi Yadav

Developer turned data scientist to research and contribute to the community state of the art technologies. Writer in ‘The Startup’