NLP Simplified Part 1 - Text Cleaning and Preprocessing

Welcome to the amazing world of Natural Language Processing (NLP), where computers are getting better and better at understanding what we say.

Have you ever thought about how computers understand human language? The answer lies in a special secret: NLP.

Most of the NLP models talk to us fluently, all thanks to careful text refinement. We’re going to talk about the crucial steps that NLP takes in this article.

We’ll talk about things like punctuation, special characters, stopwords, stemming, and even lemmatization. From breaking down sentences to fixing spelling mistakes – we’re here to make these ideas easy to understand.

But that’s not all – we’re also diving into more complex topics like dealing with numbers, making contractions clear, and handling tricky HTML tags. Each part of our article will show you the hidden work that makes NLP shine. Get ready to learn how text cleaning and preprocessing make NLP work its wonders in ways you never imagined.

What is NLP?

Natural Language Processing(NLP) is like combining computer science, human language, and artificial intelligence together. It’s about making special instructions and tools for computers so they can understand, explain, and create human language.

NLP uses computer science and AI methods to help computers figure out what words and sentences mean, just like we do. This helps computers talk to us better and feel more like real conversations. This special area has led to cool things like translating languages, helpful chatbots, and understanding feelings in text. It’s where technology and language meet.

For Example, NLP is useful in

Question Answering Chatbots
Audio Mobile/Computer Locks
Speech Recognition
Spam Detection in Email
Summarization

There are two key aspects of Natural Language Processing (NLP):

Natural Language Understanding (NLU) focuses on teaching computers to comprehend and interpret human language. NLU algorithms help computers “understand” what humans are saying or writing. NLU helps computers grasp the meaning, context, and structure of language, allowing them to respond accurately and effectively.
Natural Language Generation (NLG): NLG is all about getting computers to produce human-like text. NLG algorithms take structured data or information and convert it into readable, coherent sentences or paragraphs. It’s like teaching computers to write! NLG uses patterns and rules to generate human-friendly content that feels natural and understandable.

Basic Terminology in NLP

Corpus: It is a large collection of text documents like news, tweets, information, etc. It is useful for computers to learn from. Ex- Corpus consists of documents (Document-1, Document-2, Document-3……….)

Document: The document consists of different paragraphs.

Token and Tokenization: Tokens are smaller parts of sentences, and Tokenization is the process of converting sentences into Tokens. Ex – Suppose this is a sentence: “My Name is Sanket”. So, after tokenization, we will get [“My”, “Name”, “is”, “Sanket”.]

Morphine (Base Word): A smallest meaningful word without a prefix and suffix. Ex- “Uncomfortable”: “Un” + “Comfort” + “Able”. So, Here our base word is Comfort.

Now, Let’s explore each of the steps and techniques in text cleaning!

Text Cleaning Techniques

Noise Entity Removal

Noise entity removal is a crucial step in NLP, where irrelevant or meaningless entities are identified and removed from text data.

By eliminating entities like generic terms, symbols, and unrelated names, the text becomes more focused and accurate, enhancing the quality of NLP tasks such as sentiment analysis and named entity recognition. This process ensures that analyses are based on meaningful content, free from distracting or inconsequential elements.

a] Removal of Special Characters and Punctuation:

One part of this process is cleaning out special characters and punctuation. These are the not-normal letters and symbols, like smiley faces or foreign letters. They also include those dots, commas, and marks that help sentences look organized.

Here’s why this matters – These characters and marks can make reading words tricky. They can confuse machines that try to understand words or figure out what they mean. So, we take them out to ensure the machines understand the words better and can do their jobs, like figuring out what words mean or finding out how people feel in writing.

Some examples of special characters and punctuation that are often removed are:

| # ’ ” , .% & ^*! @ ( ) _ + = – [ ]$ > \ { } ` ~ ; : /? <

Here’s a Python code that will let you provide an understanding of how you can remove punctuation from text :

import nltk

from nltk.tokenize import word_tokenize

nltk.download('punkt')  # Download the punkt tokenizer


# Define function
def remove_punctuation(text):

    words = word_tokenize(text)  # Tokenize the text into words

    words_without_punct = [word for word in words if word.isalnum()]  # Keep only alphanumeric words

    clean_text = ' '.join(words_without_punct)  # Join the words back into a cleaned text

    return clean_text


# Example input text with punctuation

input_text = "Hello, Everyone! How are you?"

# Remove punctuation using nltk

cleaned_text = remove_punctuation(input_text)


print("Input Text:", input_text)

print("Cleaned Text:", cleaned_text)

Output:

Input Text: Hello, Everyone! How are you?

Cleaned Text: Hello Everyone How are you

b] Stopwords removal:

Stopwords are common words that we use like “the”, “and”, “is”, and “in” that frequently appear in a language but often don’t carry significant meaning on their own. Since NLP aims to uncover meaningful insights from text data, removing stopwords is crucial. Stopwords can affect tasks like text classification, sentiment analysis, and topic modeling by introducing unnecessary complexity and potentially skewing results.

To enhance the effectiveness of NLP tasks, various methods are used to identify and remove stopwords. One common approach involves utilizing predefined lists of stopwords that are specific to a language. These lists contain words that are generally considered to have little semantic value. NLP libraries, like NLTK or spaCy in Python, offer built-in stopword lists that can be employed for this purpose.

import nltk

from nltk.corpus import stopwords

nltk.download('punkt')

nltk.download('stopwords')


# Sample sentence with more stopwords

sentence = 'There are so many movies in the world, but few of them were able to win Oscar'


# Tokenize the sentence

words = nltk.word_tokenize(sentence)


# Load English stopwords

stop_words = set(stopwords.words('english'))


# Remove stopwords

filtered_words = [w for w in words if w.lower() not in stop_words]


print("Original Sentence:", sentence)

print("After Stopword Removal:", " ".join(filtered_words))

Output:

Original Sentence: There are so many movies in the world, but few of them were able to win Oscar.

After Stopword Removal: Many movies world, few able win Oscar.

Let’s move forward towards Spelling correction!

c] Spell Checking And Correction:

Spell checking plays a crucial role in text cleaning by identifying and correcting spelling mistakes in written content. Inaccurate spelling can create confusion and negatively impact the credibility of the text. Automated spell-checking ensures that text is error-free and communicates the intended message effectively. In various fields, including communication, content creation, and data analysis, accurate spelling is essential for maintaining professionalism and ensuring accurate analysis.

Several techniques are employed for automatic spell correction. One common approach is using pre-built dictionaries or language models that contain a list of correctly spelled words.

from spellchecker import SpellChecker


# Create a SpellChecker object

spell = SpellChecker()


# Sample text with spelling errors

text = "I Havv a gret idea To improov the Efficiensy of our procEss."


# Tokenize the text

words = text.split()


# Find and correct misspelled words

corrected_words = [spell.correction(word) for word in words]


# Join corrected words to form corrected text

corrected_text = " ".join(corrected_words)


print("Original Text:", text)

print("Corrected Text:", corrected_text)

Output:

Original Text: I Havv a gret idea To improov the Efficiensy of our procEss.

Corrected Text: I have a great idea to improve the efficiency of our process.

Now, we are going to understand how to handle numerical values in NLP:

d] Handling Numericals and Name Entities:

Numerical and date information are common types of data that appear in natural language texts. They can consist of important information such as quantities, measurements, prices, dates, times, durations, and so on.

For example, the number 1000 can be written as “one thousand”, “1,000”, “1K”, “103”, or “千” in different contexts.

To overcome these challenges, NLP systems need to use some strategies for encoding or normalizing numerical and date data. Encoding means transforming the data into a representation that is suitable for processing or analysis. Normalizing means converting the data into a standard or consistent form easier to compare or manipulate.

Strategies to overcome these challenges:

Tokenization: Dividing text into smaller units like words, numbers, symbols, and punctuation. It aids in recognizing numerical and date data within text. (As mentioned previously)

Parsing: Analyzing text structure and meaning using grammar and logic. It clarifies the context of numerical and date data, resolving ambiguity.

Conversion: Altering numerical and date formats for consistency. It standardizes information using a common system.

Extraction: Identifying and isolating numerical and date data through patterns or rules. It captures relevant information for analysis or processing.

This code will extract numerical values using regular expressions. We are using dateutil library to extract data:

import re

from dateutil.parser import parse


# Sample text containing numerical and date information

text = "Sales for the year 2023 reached $100,000 on 2023-08-31."


# Extract numerical information using regular expressions

numerical_entities = [int(match) for match in re.findall(r'\d+', text)]


# Extract date information using dateutil library

date_entities = parse(text, fuzzy_with_tokens=True)[0]


print("Numerical Entities:", numerical_entities)

print("Date Entities:", date_entities)

Output:

Numerical Entities: [2023, 100000, 2023, 08, 31]
Date Entities: 2023-08-31 00:00:00

Such that we can extract numerical entities from our text.

e] Handling Contractions and Abbreviation:

When we shrink words like “I’m” for “I am” or “NLP” for “Natural Language Processing,” it’s called contractions and abbreviations. Expanding them back is important for NLP systems. It helps clean up the text, making it easier to understand and avoiding any mix-ups or confusion. Expanding contractions and abbreviations can help to:

Enhances consistency and NLP model performance.
Reduces memory usage and computational load.
Clarifies ambiguity and maintains sentiment.
Various methods, including dictionaries, rules, and machine learning, can be used to expand contractions and abbreviations, each with its strengths and considerations.

Approaches for Handling Contractions and Abbreviation:

Dictionary Based
Grammer Based
Machine learning based

Simple dictionary-based approach:

This code expands contractions and abbreviations.

# Sample text with contractions and abbreviations

text = "I'm happy to see you and can't wait to discuss about NLP."


# Contraction dictionary with meaning

contraction_dict = {

    "I'm": "I am",

    "can't": "cannot"

    # Add more contractions and their expansions as needed

}


# Function to expand contractions

def expand_contractions(text, contraction_dict):

    words = text.split()

    expanded_words = [contraction_dict.get(word, word) for word in words]

    expanded_text = " ".join(expanded_words)

    return expanded_text

expanded_text = expand_contractions(text, contraction_dict)


print("Original Text:", text)

print("Expanded Text:", expanded_text)

Output:

Original Text: I'm happy to see you and can't wait to discuss about NLP.

Expanded Text: I am happy to see you and cannot wait to discuss about NLP.

We can add or remove abbreviations and contractions as per our requirements.

f] Dealing with HTML Tags and Markup:

HTML tags are written inside angle brackets (< and >) and usually come in pairs, such as <p> and </p> for a paragraph. HTML tags are codes that format web pages, influencing how content is displayed. They can complicate text analysis by introducing noise and altering structure. However, they offer semantic cues helpful for tasks like summarization.

To deal with HTML tags in text data, we can use some techniques for stripping HTML tags and preserving text content. Stripping HTML tags means removing or replacing the HTML tags from the text data, while preserving text content means keeping or getting the plain text from the text data.

Techniques for managing HTML tags:

Regular Expression: Quickly removes HTML tags but can be inaccurate if tags are complex.

HTML Parser: Accurately understands tags but might be slower and more intricate.

Web Scraper: Easily fetches plain text from web pages, but availability can be limited.

Here’s the code to accomplish this using BeautifulSoup:

from bs4 import BeautifulSoup

import requests

import re


# URL of the web page to extract text from

url = "https://www.example.com"  # Replace with the actual URL


# Fetching HTML content from the web page

response = requests.get(url)

html_content = response.content


# Create a BeautifulSoup object

soup = BeautifulSoup(html_content, 'html.parser')


# Remove HTML tags using regular expressions

cleaned_text = re.sub(r'<.*?>', '', soup.get_text())


print("Original HTML:")

print(html_content)

print("\nCleaned Text (without HTML tags):")

print(cleaned_text)

To run the above code, you have to replace “https://www.example.com” with an actual link. With the help of BeautifulSoup, you will fetch the content from that website. After web scrapping, you will remove HTML tags using regular expressions.

Now, let’s move toward Text Preprocessing.

Text Preprocessing

In any NLP project, the initial task is text preprocessing. Preprocessing involves organizing input text into a consistent and analyzable format. This step is essential for creating a remarkable NLP application.

There are different ways to preprocess text:

Tokenization

Standardization

Normalization

Out of these, one of the most important steps is tokenization. Tokenization involves dividing a sequence of text data into words, terms, sentences, symbols, or other meaningful components known as tokens. There are many open-source tools available to carry out the tokenization process.

Tokenization

Tokenization serves as the initial phase in any NLP process and significantly influences the entire pipeline. By employing a tokenizer, unstructured data, and natural language text are segmented into manageable fragments. These fragments, known as tokens, can be treated as distinct components. In a document, the frequency of tokens can be harnessed to create a vector that represents the document.

This swift transformation converts a raw, unstructured string (text document) into a numerical structure suitable for machine learning. Tokens possess the potential to directly instruct computers to initiate valuable actions and responses. Alternatively, they can function as attributes in a machine learning sequence, sparking more intricate decisions or behaviors.

Tokenization involves the division of text into sentences, words, characters, or subwords. When we segment the text into sentences, it’s referred to as sentence tokenization. On the other hand, if we break it down into words, it’s known as word tokenization.

There are different types of Tokenization:

Example of Sentence Tokenization:

sent_tokenize(“My favorite movie is free guy”)

–> [“My favorite movie is free guy”]

Example of Word Tokenization:

Word_tokenize(“Elon musk is a Businessman”)

–> [“Elon”, “musk”, “is”, “a”, “Businessman”]

a] Whitespace Tokenization

The simplest form of tokenization involves splitting text wherever there’s a space or whitespace. It’s like cutting a sentence into pieces wherever there’s a gap. While this approach is straightforward, it might not handle punctuation or special cases effectively.

Example of WhiteSpace Tokenization:

“Natural language processing is amazing!”

–> [“Natural”, “language”, “processing”, “is”, “amazing!”].

b] Regular Expression Tokenization

Regular expression tokenization involves using patterns to define where to split text into tokens. This allows for more precise tokenization, handling punctuation, and special cases better than simple whitespace splitting.

Example of Regular Expression Tokenization:

“Email me at jack.sparrow@blackpearl.com.”

–> [“Email”, “me”, “at”, “jack”, “.”, “sparrow”, “@”, “blackpearl”, “.”, “com”, “.”]

c] Word and Subword Tokenization

This approach focuses on preserving punctuation marks as separate tokens. It’s particularly useful when maintaining the meaning of punctuation is crucial, such as in sentiment analysis.

Subword tokenization breaks words into smaller meaningful units, such as syllables or parts of words. It’s especially helpful for languages with complex word structures.

Example of Word Tokenization:

“Wow! This is incredible.”

–> [“Wow”, “!”, “This”, “is”, “incredible”, “.”]

Example of Subword Tokenization:

“unbelievable”

–> [“un”, “believable”]

d] Byte-Pair Encoding (BPE) Tokenization

Byte-Pair Encoding is a subword tokenization technique that divides words into smaller units based on their frequency of occurrence. This is particularly useful for handling rare or out-of-vocabulary words.

e] Treebank Tokenization

Treebank tokenization employs predefined rules based on linguistic conventions to tokenize text. It considers factors like contractions and hyphenated words.

Example: “I can’t believe it’s August 23rd!”

–> [“I”, “can”, “‘t”, “believe”, “it”, “‘s”, “August”, “23rd”, “!”].

So, these are the types of Tokenization. Now, we will move toward Standardizing text.

Standardizing Text

Lowercasing and Capitalization

Standardizing text cases means converting the text into the same case, usually lowercase, to make it easier for processing and analysis. Standardizing text cases involves making all the letters in a piece of text the same, often in lowercase, to make it easier to work with. This is useful in things like analyzing language, where we use computers to understand words and sentences. When we do this, we’re like tidying up the text so that computers can understand it better.

Doing this helps to make the text more consistent and removes confusion. For instance, if we count how often words appear, we want “Apple” and “apple” to be seen as the same thing. We don’t want the computer to treat them as different words just because of the capital letter. It’s like being fair to all the words!

But there are times when we don’t follow this rule. Sometimes, big capital letters are important. For example, in computer talk, we have names for things that must stay the same, like “UK” or “NASA”. Also, if someone writes in BIG letters, like “STOP!” or “SOS” the meaning is different or they might showing their emotions.

Another thing to know is that different writing styles exist. Sometimes, in computer code, words are joinedLikeThis or separated_by_underscores.

In short, standardizing text case is like giving text a nice haircut so computers can understand it. But remember, there are special cases when we break the rule for good reasons.

This code checks if two words, “Apple” and “apple,” are the same by converting words to lowercase. If they are, it’ll print they’re the same; otherwise, it prints they’re different.

# Standardizing text case

word1 = "Apple"

word2 = "apple"


# Convert words to lowercase

word1_lower = word1.lower()

word2_lower = word2.lower()


# Comparing standardized words

if word1_lower == word2_lower:

    print("The words are the same when case is ignored.")

else:

    print("The words are different even when case is ignored.")

Output:

The words are the same when case is ignored.

Let’s move toward normalization.

Normalization

Normalization is the process where tokens convert into their base form. In normalization, the inflection is removed from the word to obtain its base form.

The aim of normalization is to reduce variations in the text that don’t carry significant meaning but can affect the accuracy of NLP tasks. Different forms of normalization are used to address specific challenges in text processing.

For Examples,

am, are, is => be

cat, cats, cat’s, cats’ => cat

Let’s apply mapping to the below sentence:

All of the Don’s cats are different colors => All of the don cat be different color

There are popular methods for normalization in NLP.

Stemming
Lemmatization

a] Stemming

Before diving into stemming, let’s get familiar with the term “stem.” Think of word stems as the basic form of a word. When we add extra parts to them, it’s called inflection, and that’s how we make new words.

Stem words are those words that remain after removing prefixes and suffixes from a word. Sometimes, stemming may produce words that are not in the dictionary or without meaning. Therefore, stemming is not as good as a lemmatization technique in various tasks.

Ex- “Quickly”, “Quicker”, “Quickest”

Stemmed word- “Quickli” (Not a dictionary word)

Ex – “Frogs are dancing and dogs are singing.”

Stemmed Tokens: [‘frog’, ‘are’, ‘danc‘, ‘and’, ‘dog’, ‘are’, ‘sing’]

Over Stemming: More letters stem, or more than one stem maps to the same word, and the original meanings are lost.

Ex- “Computer”, “Compute”, and “Computation” –> “Comput”

Under Stemming: Check this example

Ex- “Jumping”, “Jumped”, and “Jumps” –> “Jumping”, “Jumped”, and “Jumps”.

Here the three words should have been stemmed to the base word ‘jump’, but the algorithm has failed to capture it.

Types of Stemmers in NLTK:

PorterStemmer,
LancasterStemmer,
SnowballStemmer, etc.

We are using all three stemmers(PorterStemmer, LancasterStemmer, SnowballStemmer) for stemming. It breaks the sentence into individual words, then tries three different ways to make the words as short as possible, and prints the stemmed tokens for comparison.

import nltk

from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

from nltk.tokenize import word_tokenize

nltk.download('punkt')


# Sample sentence

sentence = "frogs are dancing and dogs are singing."

tokens = word_tokenize(sentence)


# Stemmers

stemmers = [PorterStemmer(), LancasterStemmer(), SnowballStemmer("english")]

for stemmer in stemmers:
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    print(f"{stemmer.__class__.__name__} Stemmed Tokens:", stemmed_tokens)

Output:

PorterStemmer Stemmed Tokens: ['frog', 'are', 'danc', 'and', 'dog', 'are', 'sing', '.']

LancasterStemmer Stemmed Tokens: ['frog', 'ar', 'dant', 'and', 'dog', 'ar', 'sing', '.']

SnowballStemmer Stemmed Tokens: ['frog', 'are', 'danc', 'and', 'dog', 'are', 'sing', '.']

There are many other libraries to do the same. Now, let’s try to understand what is lemmatization:

b] Lemmatization

Lemmatization is also the same as Stemming with a minute change. Lemmatization converts words into meaningful base forms. Lemmatization is a procedure of obtaining the base form of the word with proper meaning according to vocabulary and grammar relations.

Lemmatization is a way of changing a word to its basic or normal form, called the lemma. For example, the normal form of the word “cats” is “cat”, and the normal form of “running” is “run”.

Lemmatization needs to know the right word type and meaning of a word in a sentence and also the bigger situation around that sentence. Unlike cutting off word endings, lemmatization tries to choose the right normal form depending on the situation.

You can choose any of the below lemmatizers as per your need:

Wordnet Lemmatizer

Spacy Lemmatizer

TextBlob

CLiPS Pattern

Stanford CoreNLP

Gensim Lemmatizer

TreeTagger

We are going to use the NLTK library to perform lemmatization using WordnetLemmatizer. It first tokenizes the sentence, then lemmatizes to find meaningful base words from the vocabulary, and then puts them back to print the lemmatized sentence.

import nltk

from nltk.stem import WordNetLemmatizer

from nltk.tokenize import word_tokenize

nltk.download('punkt')

nltk.download('wordnet')


# Create a lemmatizer object

lemmatizer = WordNetLemmatizer()


# Sample sentence

sentence = "boys are running and mosquitos are flying."


# Tokenize the sentence

tokens = word_tokenize(sentence)


# Lemmatize the tokens

lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]


# Join the lemmatized tokens to form a sentence

lemmatized_sentence = " ".join(lemmatized_tokens)


print("Original Sentence:", sentence)

print("Lemmatized Sentence:", lemmatized_sentence)

Output:

Original Sentence: boys are running and mosquitos are flying.

Lemmatized Sentence: Boy are run and mosquito are fly.

Now, time to wrap up!

Conclusion

In this article, we’ve explored the basics of Natural Language Processing (NLP). We learned about NLP’s importance in various areas and began our journey by understanding how to clean and prepare text. We covered removing special characters, tokenizing text, and more. We also understood concepts like normalization, stemming, lemmatization, and handling numbers and dates. Additionally, we got a glimpse of dealing with HTML content.

However, our journey continues. There’s a whole world of advanced text-cleaning methods to discover. We’ll dive into Part-of-Speech tagging, explore different tools and libraries, and dive into exciting NLP projects. This article is just the start; be ready for Part 2, where more knowledge awaits you.

If you would like to learn Natural Language Processing, here are some of the best NLP courses.

Sanket Sarwade
Contributor
- LinkedIn
Sanket Sarwade is an expert in reviewing business software and tools, with a focus on Generative AI, Data Science, and Marketing. Sanket provides clear insights about tools and software, helping users make the best decisions when choosing the right solutions for their needs. He helps businesses understand and use the latest technologies to improve their operations and grow.