An overview of N-gram language modeling (2025)

The goal of language modeling is to assign probabilities to sequences of words, such as a phrase or a sentence. But in the simplest case, a language model may also operate at the level of words. If we have a language model, it’s possible to ask questions like what is the probability of the sentence “all is well” or the phrase “once in a blue moon.” In this blog article, I’ll talk about N-gram as a probabilistic language model.

To answer the question on why we need to assign probabilities to word sequence, I present some real-world applications of language modeling.

Speech Recognition

One good example application of language modeling is for speech recognition. In speech recognition, the problem is we want to transcribe an audio signal as a sequence of words. The objective of speech recognition is to find the most likely word sequence given the audio signal. For example, in the two sentences below, we know based on experience that “I ate a cherry” is more probable than “Eye eight a Jerry.” By assigning probabilities on sequence of words, we are able to identify which transcription of audio signal is more accurate.

An overview of N-gram language modeling (2)

Spelling Correction

Another application of language modeling is for spelling correction and to some extent grammatical error correction. Basically, the goal with this use case is to find and correct errors in writing. For example, in the first sentence below, the word “There” was mistyped as “Their”. Another example is in the second sentence wherein “improve” should have been in the past participle form, improved. By having probabilities assigned to sequence of words, we are able to determine which sequence of words is more probable, and thus will allow us to help in detecting and correcting these errors.

An overview of N-gram language modeling (3)

Machine Translation

Assigning probability to word sequences is also critical for machine translation. Assume we are interpreting a sentence written in Chinese. We cannot simply translate the text word for word, as the translated sentence would most likely make no sense. Thus, in machine translation, we need to discover the most probable and logical translation; and we can do so automatically only if we have already assigned probabilities to word sequences.

An overview of N-gram language modeling (4)

Autocompletion

Apart from providing probabilities to individual word sequences, language models can also assign probability for the likelihood of a given word to follow a sequence of words. Thus, autocompletion is another application of language modeling. This function is used in search engines such as Google and YouTube, where it enables the quick generation of query recommendations.

An overview of N-gram language modeling (5)

The most intuitive method for estimating the probability of sequences of words is to use relative frequency. For instance, if the task is to determine the probability that the following word in the phrase “This is the” will be house, a simple solution is to count the instances of the phrase “this is the” followed by the word “house” in the training corpus. For this particular example, we can deduce from the training corpus’ relative frequency counts that the probability of “house” being the next word for “This is the” is 0.25.

An overview of N-gram language modeling (6)

Hence, given any sequence of words we can theoretically calculate the joint probability of the sequence by using the chain rule. Basically, the chain rule allows us to estimate the joint probability of an entire sequence of words by only multiplying the conditional probabilities.

An overview of N-gram language modeling (7)

This method of estimating probability, however, is impractical because it is impossible to keep track of all possible histories for all words. Thus, rather than computing the probability of a word given its whole history, we can approximate it by using only the most recent few words. This assumption that the probability of a word is only dependent on the preceding words is referred to as a Markov assumption. Markov models are a class of probabilistic models that make the assumption that we can predict the probability of some future unit without going back too far into the past.

In a bigram model, for example, we estimate the probability of a word solely on the basis of the preceding word. Thus, to calculate the probability of a sequence of words using the bigram model, we can simply multiply the bigram probabilities.

An overview of N-gram language modeling (8)

We can generalize the bigrams to higher order N-grams such as trigrams, 4-grams, 5-grams and so on. The simplest case of N-grams is the unigram or your bag of words, wherein N is just equal to 1. The general equation for N-gram models is given below. The small letter n here refers to the total number of words while the capital letter N is the N-gram order that you set.

An overview of N-gram language modeling (9)

Extrinsic evaluation

The best way to evaluate the performance of a language model is to put the models into application and evaluate their performance. This type of evaluation is called extrinsic evaluation. The basic idea is to compare how much the different models improves the task at hand. For example, for spelling corrector, we can compare the performance of the two language models by comparing how many misspelled words were corrected properly. In speech recognition application, you can count the number of words that were transcribed correctly. Unfortunately, extrinsic evaluation of language models is time-consuming and expensive to perform.

Intrinsic evaluation

Intrinsic evaluation is another method for evaluating language models. Intrinsic evaluation measures the model’s quality independent of its application. The idea behind an intrinsic evaluation of a language model is to evaluate your model based on an unseen data called the test set. This is a just your standard test set from your typical machine learning workflow; however, the distinction in language modeling is that there are no labels against which your predictions may be easily compared. So how are you going to compare the performance of the two language models then? The answer is simple: the better model is the one that assigns a greater probability to the test data.

In practice, however, it is more common to use perplexity instead of the raw probabilities to compare models. The perplexity of a language model with respect to a test set is equal to the test set’s inverse probability normalized by the number of words. The perplexity decreases as the conditional probability of the word sequence increases. Hence, minimizing perplexity is the same as maximizing the test probability.

An overview of N-gram language modeling (10)

Now, I present you some of the practical issues in N-gram models.

Long distance dependencies

Bigrams, four grams, and five grams are usually insufficient because of the long distance dependencies in language. For instance, consider the following sentence: “the computer which I had just put into the machine room on the fifth floor crashed.” An N-gram model with fewer than five orders would miss the context of the sentence that the object that crashed was a computer and not the floor. Thus, we would expect that in order to achieve a lower perplexity, we need a higher order N-gram, which also implies the need for a large corpus to train our data with.

Numerical underflow

Another practical issue is the computation of probabilities; because the larger the number of probabilities multiplied together, the smaller the number gets. This could result in a numerical underflow, in which the exact value cannot be represented in the CPU’s memory. To circumvent this, we can use the log probabilities instead of the raw probabilities; since adding in log space is the same as multiplying in linear space. As a result, we get numbers that aren’t as small, and we can just use the exponential function of the log probabilities to convert it back into the probabilities.

Zero probability

One last practical consideration in implementing an N-gram model is having zero probabilities to unseen events during training. This occurs when the test corpus contains words or N-grams that are not present in the training corpus. This presents an issue when performing maximum likelihood estimation, because if the probability of any word in the test set is zero, the probability of the entire test set would also be zero. Furthermore, if we use perplexity as the metric to optimize, we cannot have a valid perplexity value because we cannot divide by zero. To prevent a language model from assigning zero probability to these unseen events, the basic idea is to remove a small amount of probability mass from certain more frequent events and assigned to the unseen events. This technique is also referred to as smoothing or discounting. For additional information on the different smoothing techniques, I highly recommend reading more from Jurafsky and James H. Martin’s book chapter.

We discussed first how we can use language models to effectively assign probability to a sequence of words and then use that probability to provide a recommendation for the next word based on the previous words. We also talked about N-grams as Markov models, in which we do not need to know the entire history of words to estimate the probability of a series of words using only a defined window of previous words. I’ve also introduced in this article the idea of extrinsic and intrinsic evaluation. In extrinsic evaluation, models are ranked according to how much they improve a given task, whereas in intrinsic evaluation, models are ranked according to which model assigned a higher probability to the test set. Finally, I’ve discussed some practical issues with N-Gram modeling and provided solutions on how to address these issues.

  1. Jurafsky, D., & Martin, J. H. (2021). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Third Edition draft, 29–55.
  2. Potapenko, A. (n.d.). Episode 1: Count! N-gram language models [MOOC lecture]. Natural Language Processing: Language modeling and sequence tagging by National Research University. Coursera. https://www.coursera.org/lecture/language-processing/count-n-gram-language-models-IdJFl
An overview of N-gram language modeling (2025)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Mr. See Jast

Last Updated:

Views: 5424

Rating: 4.4 / 5 (75 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Mr. See Jast

Birthday: 1999-07-30

Address: 8409 Megan Mountain, New Mathew, MT 44997-8193

Phone: +5023589614038

Job: Chief Executive

Hobby: Leather crafting, Flag Football, Candle making, Flying, Poi, Gunsmithing, Swimming

Introduction: My name is Mr. See Jast, I am a open, jolly, gorgeous, courageous, inexpensive, friendly, homely person who loves writing and wants to share my knowledge and understanding with you.