Stopwords Removal Using NLTK in Python

Stopwords are common words in a language that are often removed in natural language processing (NLP) tasks because they add little semantic value to text data. Words like "is", "the", "and", or "to" are typically filtered out before further processing such as tokenization, stemming, or machine learning.

The Natural Language Toolkit (NLTK) is a powerful Python library that provides tools for text processing. One of its most widely used features is access to built-in lists of stopwords for various languages. This article will demonstrate how to use NLTK to remove stopwords from text, and how to customize the list for specific applications.

Step 1: Installing NLTK and Downloading Stopwords Before you begin, make sure NLTK is installed and the stopwords corpus is available:

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

This downloads and imports the standard stopword lists provided by NLTK.

Step 2: Accessing English Stopwords To load the English stopword list:

en_stopwords = stopwords.words('english')
print(en_stopwords[:10])  # Show first 10 stopwords

These words will be filtered out from your text.

Step 3: Removing Stopwords from a Sentence Let’s take a sentence and remove all stopwords from it:

sentence = "it was too far to go to the shop and he did not want her to walk"
filtered_sentence = ' '.join([word for word in sentence.split() if word not in en_stopwords])
print(filtered_sentence)

Output: far go shop want walk

This filtered output keeps only the meaningful words.

Step 4: Customizing Stopword Lists Sometimes, the default list doesn't fit your needs. You might want to keep some words or add domain-specific stopwords:

en_stopwords.remove('did')
en_stopwords.remove('not')
en_stopwords.append('go')

custom_filtered = ' '.join([word for word in sentence.split() if word not in en_stopwords])
print(custom_filtered)

Output: far shop did not want walk

Here, "did" and "not" were retained for context, and "go" was removed as it was added to the stopword list.

Stopword removal is a foundational step in text preprocessing. Using NLTK, you can efficiently clean your data and improve downstream NLP tasks.

Join Anik on Peerlist!

Join amazing folks like Anik and thousands of other builders on Peerlist.