What is Stemming in NLP (with Python Examples)

In this article, we will explore the concept of stemming in Natural Language Processing, its importance, and how it is used in machine learning, along with Python examples.

Text preprocessing is an important step in Natural Language Processing (NLP). It involves transforming raw text data into a form that is more easily processed and analyzed by machine learning models. One of the most commonly used techniques in text preprocessing is stemming.

What is Stemming?

Stemming is the process of reducing a word to its root or base form, also known as a stem. This is done by removing suffixes, prefixes, and other inflections. For example, the word “running” can be stemmed to its base form “run”. Stemming is used to group together words with the same root, which can help in tasks such as document classification, sentiment analysis, and information retrieval.

Example in Python


import nltk
from nltk.stem import PorterStemmer

# create an instance of the Porter Stemmer
stemmer = PorterStemmer()

# define some words to stem
words = ["cats", "trouble", "troubling", "troubled", "having", "branded", "religiously", "studies"]

# stem each word and print the results
for word in words:
    print(word, "=>", stemmer.stem(word))
cats => cat
trouble => troubl
troubling => troubl
troubled => troubl
having => have
branded => brand
religiously => religi
studies => studi

Why is Stemming Important?

Stemming is important because it helps to reduce the size of the vocabulary used in text analysis. Consider the example of a search engine.

If a user searches for “running shoes”, the search engine will need to retrieve all documents that contain the words “running” and “shoes”.

However, if stemming is applied, the search engine will only need to retrieve documents that contain the stem “run” and “shoe”, which can significantly reduce the number of documents that need to be searched. This is especially useful for large datasets where processing time can be a limiting factor.

Types of Stemming

There are two main types of stemming:

  • rule-based stemming
  • statistical stemming.

Rule-based stemming

Rule-based stemming involves applying a set of pre-defined rules to remove suffixes and prefixes from a word. This approach is fast and simple but can result in errors due to irregularities in the language.

Statistical stemming

Statistical stemming, on the other hand, uses machine learning algorithms to learn the patterns of word formation and inflection in a language. This approach is more accurate but requires more computational resources.

Main Stemming algorithms

Comparing Each Stemming Algorithm with NLTK


from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer
from nltk.tokenize import word_tokenize

# Text to be stemmed
text = "It is important to be very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."

# Tokenize the text
tokens = word_tokenize(text)

# Perform stemming using different algorithms
porter = PorterStemmer()
snowball = SnowballStemmer('english')
lancaster = LancasterStemmer()

porter_stemmed = [porter.stem(token) for token in tokens]
snowball_stemmed = [snowball.stem(token) for token in tokens]
lancaster_stemmed = [lancaster.stem(token) for token in tokens]

# Print the stemmed tokens
print("Original Text: ", text)
print("Porter Stemmer: ", porter_stemmed)
print("Snowball Stemmer: ", snowball_stemmed)
print("Lancaster Stemmer: ", lancaster_stemmed)

In this example, we import the PorterStemmer, SnowballStemmer, and LancasterStemmer classes from the nltk.stem module. We also import the word_tokenize function from the nltk.tokenize module to tokenize the text.

We then define a variable text that contains the text to be stemmed, and tokenize it into individual words using the word_tokenize function.

We perform stemming using each of the three algorithms and store the results in separate lists (porter_stemmed, snowball_stemmed, and lancaster_stemmed).

Finally, we print out the original text, followed by the stemmed tokens for each of the algorithms.

Original Text:  It is important to be very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once.
Porter Stemmer:  ['it', 'is', 'import', 'to', 'be', 'veri', 'pythonli', 'while', 'you', 'are', 'python', 'with', 'python', '.', 'all', 'python', 'have', 'python', 'poorli', 'at', 'least', 'onc', '.']
Snowball Stemmer:  ['it', 'is', 'import', 'to', 'be', 'veri', 'python', 'while', 'you', 'are', 'python', 'with', 'python', '.', 'all', 'python', 'have', 'python', 'poor', 'at', 'least', 'onc', '.']
Lancaster Stemmer:  ['it', 'is', 'import', 'to', 'be', 'very', 'python', 'whil', 'you', 'ar', 'python', 'with', 'python', '.', 'al', 'python', 'hav', 'python', 'poor', 'at', 'least', 'ont', '.']

Other Stemming Algorithms

Applications of Stemming

Stemming is widely used in various applications of NLP. Some of the common applications are:

  • Search engines: As mentioned earlier, stemming is useful in reducing the size of the vocabulary used in search engines.
  • Sentiment analysis: Stemming is used to group together words with the same root, which can help in identifying the sentiment of a sentence or document.
  • Language translation: Stemming can be used to reduce the complexity of a language, which can make it easier to translate to another language.

Tools for Stemming

There are many tools available for stemming in different programming languages. Some of the popular ones are:

  • NLTK: The Natural Language Toolkit is a popular library for NLP tasks in Python. It includes a stemmer that implements the Porter stemming algorithm.
  • SpaCy: SpaCy is another popular NLP library for Python that includes a stemmer based on the Snowball algorithm.
  • Stanford CoreNLP: The Stanford CoreNLP toolkit includes a stemmer that uses a rule-based approach.

Stemming with Python Examples

Stemming in NLTK

import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# Download the punkt tokenizer if needed
nltk.download('punkt')

# Create a Porter stemmer object
stemmer = PorterStemmer()

# Define a function to perform stemming on a text string
def stem_text(text):
    words = text.split()
    stemmed_words = [stemmer.stem(word) for word in words]
    stemmed_text = " ".join(stemmed_words)
    return stemmed_text

# Example text to stem
text = "I am running in the park with my dogs"

# Stem the text using the function
stemmed_text = stem_text(text)

# Print the stemmed text
print(stemmed_text)

In this example, we first import the PorterStemmer class from the nltk.stem module. We then create a PorterStemmer object, which we will use to perform stemming.

We then define a function called stem_text that takes a text string as input and returns the stemmed version of that text. To perform stemming, we split the text into individual words using the split method, and then apply the stem method of the PorterStemmer object to each word. We then join the stemmed words back into a string using the join method.

We then define an example text to stem, which in this case is “I am running in the park with my dogs”. We call the stem_text function on this text, which returns the stemmed version of the text, “I am run in the park with my dog”.

Finally, we print the stemmed text using the print function.

i am run in the park with my dog

Stemming in spaCy


import spacy

# Load the small English model
nlp = spacy.load("en_core_web_sm")

# Define a function to perform stemming on a text string
def stem_text(text):
    doc = nlp(text)
    stem_text = " ".join([token.lemma_ for token in doc])
    return stem_text

# Example text to stem
text = "I am running in the park with my dogs"

# Stem the text using the function
stemmed_text = stem_text(text)

# Print the stemmed text
print(stemmed_text)

In this example, we first load the small English model of SpaCy. We then define a function called stem_text that takes a text string as input and returns the stemmed version of that text. To perform stemming, we use the lemma_ attribute of each token in the text.

We then define an example text to stem, which in this case is “I am running in the park with my dogs”. We call the stem_text function on this text, which returns the stemmed version of the text, “I be run in the park with my dog”.

Finally, we print the stemmed text using the print function.

I be run in the park with my dog

nction, and close the CoreNLP server using the close method of the StanfordCoreNLP class.

Relevant Entities

EntityProperties
Stemmingreduces words to their base or root form
Stemthe base or root form of a word
Porter Stemming Algorithmthe most widely used stemming algorithm
Snowball Stemming Algorithma more advanced and multilingual version of the Porter algorithm
Lancaster Stemming Algorithma more aggressive stemming algorithm than the Porter algorithm

Frequently Asked Questions

What is stemming in NLP?

Reducing words to their base/root form.

Why is stemming important in NLP?

Helps with text normalization, removing inflectional endings to improve accuracy.

What is the difference between stemming and lemmatization?

Stemming only removes the end of the word, while lemmatization transforms it to its base form.

What are some common stemming algorithms?

Porter Stemming Algorithm, Snowball Stemming Algorithm.

Can stemming lead to inaccurate results?

Yes, as it may incorrectly reduce words that have different meanings.

Is stemming useful for sentiment analysis?

It can be, as it helps with text normalization to improve accuracy.

Conclusion

Stemming is an important technique in text preprocessing that helps to reduce the size of the vocabulary used in text analysis. It is widely used in various applications of NLP, including search engines, sentiment analysis, and language translation. There are many tools available for stemming in different programming languages, including NLTK, SpaCy, and Stanford CoreNLP. By using stemming, we can significantly improve the efficiency and accuracy of text analysis in machine learning.

Sources:

  • https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
  • https://towardsdatascience.com/stemming-lemmatization-what-ba782b7c0bd8
  • https://www.datacamp.com/community/tutorials/stemming-lemmatization-python
  • https://www.geeksforgeeks.org/python-stemming-words-with-nltk/
  • https://tartarus.org/martin/PorterStemmer/
  • https://snowballstem.org/
  • https://www.sciencedirect.com/topics/computer-science/stemming