Comprehensive linguistics:From Uni-grams to N-grams
- Yeshwanth G
- Jun 29, 2023
- 3 min read
Updated: Jul 10, 2023


Alright,so N-gram model is a classic model where we use conditional probability to predict a certain word to be the next word ,given the previous words of the sentence,More like the probability of word being after the previous one or more words considering the dataset that we provide.You can think of N-gram as a sequence of words and notice how I have said one or more words,This is exactly what the N in the “N-gram” is for.
If N=1,Its called Unigram.
If N=2,Its called Bigram.
If N=3,Its called Trigram.
And so on…
We as the trainers get to choose the value for N during training models and it is known that for most text processing tasks N=3 to 4 are ideal but again you could always try different values for N and see which value gives you the desired response.
Let’s understand it better with an example,
Consider,the sentence “She ate her dogs homework”
Unigrams would be: “She”, “ate” , “her” , “dogs”, “homework”.
Bigrams would be: “She ate”, “ate her”, “her dogs”, “dogs homework”.
Trigrams would be: “She ate her”, “ate her dogs”, “her dogs homework”.
So you get the idea…
Now you might be wondering how exactly are we does the conditional probability work,If you don’t well you should be hehe
Lets go back to our good old sentence,
The joint probability of the sentence is decomposed as
P(She ate her dogs homework) =P(She) * P(ate | She) * P(her | She ate) *
P(dogs | She ate her) * P(homework | She ate her dogs(hopefully not lol))
So this gives us the probability of sequence occurring as a whole in a given context or language.
In this decomposition, each conditional probability represents the likelihood of a word given its preceding context.
For example, P("ate" | "She") represents the probability of the word "ate" given the preceding word "She". Similarly, P("her" | "She ate") represents the probability of the word "her" given the preceding context "She ate", and so on.
By decomposing the sentence in this manner, we break down the joint probability of the sentence into a sequence of conditional probabilities, allowing us to model and estimate the probabilities of individual words based on their preceding context.
But like everything,Theres shortcomings in this approach as well.i.e
It is computationally expensive and not very feasible considering decomposing each sentence in a huge dataset with millions of sentences.
So ,instead we make use markov’s assumption.i.e the probability of a word depends only on limited number of previous words.(small window size).Best example is
Bigram
In bigram model,we approximate the probability of a word given all the previous words by using only the conditional probability of one preceding word
Decomposing the sentence as P(ate/she),P(her/ate),P(dogs/her),P(homework/dogs).
Intuitively ,P(ate/she) it means what is the probability of the word “ate” follow the word “she” in the entire dataset.That is done by traditional counting i.e P(ate/she)=count(“she ate”)/count(“she”) in the entire dataset containing millions of sentences like these.This knowledge is used to learn context and predict the next word.Its called Bigram since we use 2 words at time(window size=1).
Generally,

IMPLEMENTATION:
With this blog i have attached a link to my implentation of N-grams to predict the sentiment upon analyzing financial news with a reasonably good accuracy with step-by-step explanation of the code.
DATASET:
To give y'all a quick heads up about the dataset.
->It has a total of about 4845 entries with two columns
where the first column is The sentiment type and the other column containing news headlines.
After this,we shall be building a word2vec model which is a little advanvced than this model . So stay tuned!!
Comments