If you’ve ever worked with text data, you know that it can be messy. Words can be written in different forms, tenses, or even languages. And when you’re trying to extract meaning from this data using machine learning, accuracy is everything. That’s where lemmatization comes in.
What is Lemmatization?
Lemmatization is the process of reducing a word to its base form, or lemma. This is done by considering the word’s context and morphological analysis. Essentially, lemmatization looks at a word and determines its dictionary form, accounting for its part of speech and tense.
cats -> cat
cat -> cat
study -> study
studies -> study
run -> run
runs -> run
Why is Lemmatization Important?
By reducing words to their base form, lemmatization helps to eliminate redundancy and ensure consistency in language processing. For example, consider the words “walk,” “walking,” and “walked.” While they may appear different, they all have the same base form, “walk.” Without lemmatization, a machine learning model may treat these words as separate entities, leading to inaccurate analysis.
Using Lemmatization in Natural Language Processing
In natural language processing, lemmatization is a crucial step in pre-processing text data. By lemmatizing words before analyzing them, machine learning models can better understand the meaning behind the words and accurately classify them.
As the experts at NLTK.org explain, “A major goal of natural language processing is to transform input text into an abstract representation that captures the meaning conveyed by the input.” Lemmatization is an essential tool in achieving this goal.
Lemmatization vs. Stemming
While lemmatization and stemming both involve reducing words to their base form, they are not the same. Stemming simply chops off the end of words, leaving the root word intact. While this can be useful in certain contexts, it can also lead to inaccuracies in language processing.
For example, consider the word “intelligent.” If stemmed, it would become “intelligen,” which is not a word in the English language. However, if lemmatized, it would become “intelligent,” the correct dictionary form.
Useful Python Libraries for Lemmatization
- nltk: WordNetLemmatizer, LancasterStemmer
- spaCy: Lemmatizer
- gensim: lemmatize
Below are examples of how to do lemmatization in Python with NLTK, SpaCy and Gensim.
Simple Lemmatization
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
# Create a WordNetLemmatizer object
lemmatizer = WordNetLemmatizer()
# Define some example words
words = ['cats', 'cat', 'study', 'studies', 'run','runs']
# Lemmatize each word and print the result
for word in words:
lemma = lemmatizer.lemmatize(word)
print(f"{word} -> {lemma}")
Lemmatization in NLTK
import nltk
from nltk.stem import WordNetLemmatizer, LancasterStemmer
from nltk.corpus import wordnet
# Download the required resources
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
# Initialize the lemmatizer and stemmer
lemmatizer = WordNetLemmatizer()
stemmer = LancasterStemmer()
# Define an example sentence
sentence = "The cats are chasing the mice"
# Tokenize the sentence and tag each token with its part of speech
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
# Lemmatize and stem each token based on its part of speech
lemmatized_tokens = []
stemmed_tokens = []
for token, tag in pos_tags:
if tag.startswith('N'):
lemma = lemmatizer.lemmatize(token, wordnet.NOUN)
lemmatized_tokens.append(lemma)
stemmed = stemmer.stem(token)
stemmed_tokens.append(stemmed)
elif tag.startswith('V'):
lemma = lemmatizer.lemmatize(token, wordnet.VERB)
lemmatized_tokens.append(lemma)
stemmed = stemmer.stem(token)
stemmed_tokens.append(stemmed)
elif tag.startswith('J'):
lemma = lemmatizer.lemmatize(token, wordnet.ADJ)
lemmatized_tokens.append(lemma)
stemmed = stemmer.stem(token)
stemmed_tokens.append(stemmed)
elif tag.startswith('R'):
lemma = lemmatizer.lemmatize(token, wordnet.ADV)
lemmatized_tokens.append(lemma)
stemmed = stemmer.stem(token)
stemmed_tokens.append(stemmed)
else:
lemmatized_tokens.append(token)
stemmed_tokens.append(token)
# Print the original, lemmatized, and stemmed versions of the sentence
print("Original sentence:", sentence)
print("Lemmatized sentence:", " ".join(lemmatized_tokens))
print("Stemmed sentence:", " ".join(stemmed_tokens))
This code uses NLTK’s WordNetLemmatizer and LancasterStemmer to lemmatize and stem each token in a sentence, respectively. It first downloads the required resources, then tokenizes the sentence and tags each token with its part of speech.
Based on the part of speech, the code then applies either lemmatization or stemming to each token. Finally, it prints the original, lemmatized, and stemmed versions of the sentence.
Original sentence: The cats are chasing the mice
Lemmatized sentence: The cat be chase the mouse
Stemmed sentence: The cat ar chas the mic
Lemmatization in spaCy
import spacy
from spacy import displacy
# Load the small English model
nlp = spacy.load("en_core_web_sm")
# Define the text to be lemmatized
text = "I am running in the park"
# Process the text with spaCy
doc = nlp(text)
# Print the lemmas of each token in the processed text
for token in doc:
print(token.text, token.lemma_)
# Define the options for the visualization
options = {"compact": True, "color": "blue"}
# Visualize the processed text with lemmas
displacy.render(doc, style="dep", options=options)
As you can see, spaCy has lemmatized the words “am” and “running” to their base forms “be” and “run”, respectively. It has also recognized that “I” is a pronoun and replaced it with “-PRON-“.

Lemmatization in Gensim
The Gensim lemmatization is kind of archaic and not the right option. Better to use spaCy or NLTK. This function is only available when the optional ‘pattern’ package is installed.
pip install pattern
from gensim.utils import lemmatize
text = "The cats are playing with the mice"
processed_text = lemmatize(text)
print(processed_text)
Useful Dataset for Lemmatization
NLTK’s Brown Corpus
# Python example
import nltk
nltk.download('brown')
from nltk.corpus import brown
#Access a sentence from the corpus
sentence = brown.sents()[0]
sentence
['The',
'Fulton',
'County',
'Grand',
'Jury',
'said',
'Friday',
'an',
'investigation',
'of',
"Atlanta's",
'recent',
'primary',
'election',
'produced',
'``',
'no',
'evidence',
"''",
'that',
'any',
'irregularities',
'took',
'place',
'.']
To Know Before You Learn Lemmatization
- Basic knowledge of Python programming
- Understanding of Natural Language Processing (NLP) concepts
- Knowledge of tokenization, stemming, and Part-of-Speech (POS) tagging
- Familiarity with Python libraries like NLTK, spaCy, and gensim
- Knowledge of how to preprocess text data
- Understanding of different types of text data and their applications
Important Concepts in Lemmatization
- Tokenization
- Part-of-Speech (POS) Tagging
- Morphology
- Stemming
- Word Embeddings
- Language Modeling
- Stop Words
- Corpus
What’s Next?
- Named Entity Recognition (NER)
- Text Classification
- Topic Modeling
- Sentiment Analysis
- Text Summarization
- Dependency Parsing
- Machine Translation
- Question Answering
Relevant Entities
Lemmatization | The process of reducing a word to its base form, or lemma |
Word | A sequence of characters that has a meaning and can be spoken or written |
Lemma | The base form of a word |
Part of Speech | A category assigned to a word based on its grammatical function within a sentence |
Morphological Analysis | The study of the structure of words and the rules for combining them into larger units |
Stemming | The process of reducing a word to its root form by removing the suffixes or prefixes |
Frequently Asked Questions
Lemmatization reduces a word to its base form.
Lemmatization ensures accuracy in language processing.
Lemmatization is a crucial step in pre-processing text data.
Stemming chops off the end of words, while lemmatization accounts for part of speech.
Lemmatization helps eliminate redundancy and improve language processing accuracy.
Use lemmatization tools or libraries like NLTK or spaCy.
In Conclusion
In today’s world, where we rely on technology to process vast amounts of text data, lemmatization is more important than ever. By accurately reducing words to their base form, we can improve the accuracy of natural language processing and extract meaningful insights from text data. So, if you’re working with text data and haven’t yet incorporated lemmatization into your workflow, it’s time to give it a try!
Sources
The most popular pages for the topic of Lemmatization in machine learning that are available online are:
- Official documentation of the NLTK library
- lemmatization">Official documentation of the spaCy library
- lemmatization-and-stemming-3789be1c55bc">Blog post on Towards Data Science explaining the difference between lemmatization and stemming
- Analytics Vidhya article on text preprocessing in NLP
- GitHub repository of the Natasha library for Russian language processing
- YouTube video on lemmatization by Krish Naik