Evaluating Text Quality with BLEU Score in Python Using NLTK

Mansoor Aldosari
2 min readOct 22, 2023
Photo by Matteo Vistocco on Unsplash

When it comes to assessing the quality of text generation or translation, one of the most widely used metrics is the BLEU (Bilingual Evaluation Understudy) score. BLEU provides a quantitative measure of how well a machine-generated text aligns with human-generated references. In this article, we’ll explore how to calculate BLEU scores using Python and the NLTK library.

What is BLEU Score?

BLEU score was introduced in a paper by Kishore Papineni and his colleagues in 2002. It aims to address the difficulty of evaluating the quality of machine-generated text, such as translations, in an automated and quantitative way. The core idea behind BLEU is to compare the similarity between a candidate text (the text generated by the machine) and reference texts (human-generated translations or ground truth) by examining the n-grams, or sequences of words, they share.

BLEU operates based on the following principles:

1. Precision: BLEU evaluates how many of the n-grams in the candidate text exist in the reference texts. It calculates a precision score for each n-gram size (typically 1 to 4) and then computes a geometric mean of these precisions.

2. Brevity Penalty: BLEU penalizes overly short candidate texts. If the candidate text is too short compared to the references, its score is reduced.

Calculating BLEU Score with NLTK

Now, let’s calculate BLEU scores for a sample sentence. Assume we have a reference sentence “this is a simple example,” and a candidate sentence “this is an example.” We’ll calculate both the sentence-level BLEU score and the corpus-level BLEU score.

from nltk.translate.bleu_score import sentence_bleu, corpus_bleu

# Reference and candidate sentences
reference = “this is a simple example”.split()
candidate = “this is an example”.split()

# Calculate BLEU score for a single sentence
bleu_score = sentence_bleu([reference], candidate)
print(“BLEU Score (sentence):”, bleu_score)

# Calculate BLEU score for a corpus (list of candidate sentences)
corpus_reference = [[“this is a simple example”.split()]]
corpus_candidate = [“this is an example”.split()]

corpus_bleu_score = corpus_bleu(corpus_reference, corpus_candidate)
print(“BLEU Score (corpus):”, corpus_bleu_score)

In this code, we calculate the BLEU score for a single sentence using sentence_bleu and for a corpus of sentences using corpus_bleu. You can replace the reference and candidate sentences with your own data for evaluation.


The BLEU score is a valuable tool for evaluating the quality of machine-generated text in a quantifiable manner. By comparing n-grams in candidate texts to reference texts, BLEU provides a meaningful metric for assessing text quality. With Python and NLTK, calculating BLEU scores is straightforward and can be applied to a wide range of natural language processing tasks, including machine translation, text generation, and more.