Developing a Paraphrasing Tool Using NLP (Natural Language Processing) Model in Python

NLP Tutorial Using Python NLTK (Simple Examples)

In this article natural language processing (NLP) using Python will be explained. This NLP tutorial will use the Python NLTK library. NLTK is a popular Python library which is used for NLP.

So, what is NLP? And what are the benefits of learning NLP?

Natural language processing (NLP) is about developing applications and services that are able to understand human (natural) languages.

What is outlined here are practical examples of natural language processing (NLP) like speech recognition, speech translation, understanding complete sentences, understanding synonyms of matching words, and writing complete grammatically correct sentences and paragraphs.

Benefits of NLP

Everyday millions of gigabytes are generated by blogs, social websites, and web pages.

Many software companies are gathering all of this data to better understand users and their passions and make appropriate changes.

These data could show that the people of Brazil are happy with product A, while the people of the US are happier with product B. With NLP, this knowledge can be found instantly (i.e. a real-time result). For example, search engines are a type of NLP that give the appropriate results to the right people at the right time.

But search engines are not the only implementation of natural language processing (NLP). There are a lot of even more awesome implementations out there.

NLP Implementations

Some successful implementations of natural language processing (NLP) are:

Search engines like Google, Yahoo, etc. Google’s search engine understands that you are a tech guy, so it shows you results related to that.

Social website feeds like Facebook news feed. The news feed algorithm understands your interests using natural language processing and shows you related ads and posts more likely than other posts.

Speech engines like Apple Siri.

Spam filters like Google spam filters. It’s not just about your usual spam filtering; now, spam filters understand what’s inside the email content and see if it’s spam or not.

NLP Libraries

There are some open source Natural Language Processing (NLP) libraries below:

Natural language toolkit (NLTK)

Apache OpenNLP

Stanford NLP suite

Gate NLP library

Natural language toolkit (NLTK) is the most popular library for natural language processing (NLP). It was written in Python and has a big community behind it.

In this NLP tutorial, we will use the Python NLTK library.

Install NLTK

If you are using Windows or Linux or Mac, you can install NLTK using pip: # pip install nltk.

You can use NLTK on Python 2.7, 3.4, and 3.5 at the time of writing this post. Alternatively, you can install it from source from this tar.

To check if NLTK has installed correctly, you can open your Python terminal and type the following: Import nltk. If everything goes fine, that means you’ve successfully installed NLTK library.

Once you’ve installed NLTK, you should install the NLTK packages by running the following code:

  1. import nltk
  2. nltk.download()

This will show the NLTK downloader to choose what packages need to be installed.

You can install all packages since they all have small sizes with no problem. Now, let’s start the show!

Ok! Now let’s talk about Text Paraphrasing in Python

Discussing the Steps, Tools, and Examples

The basic steps of text preprocessing are introduced below together with text paraphrasing tools. These are steps needed for transferring text from human language to machine-readable format for further processing.

After a text is obtained, we start with text normalization. Text normalization includes:

  • converting all letters to lower or upper case
  • converting numbers into words or removing numbers
  • removing punctuations, accent marks and other diacritics
  • removing white spaces
  • expanding abbreviations
  • removing stop words, sparse terms, and particular words
  • text canonicalization
  • We will describe text normalization steps in detail below.
  • Convert text to lowercase

Preprocessing is the text technology used in grammar checkers and paraphrasing tools. Let’s work through the processes utilized in paraphrasing tool and grammar checking tools.

Convert Text to Lowercase

Python code:

input_str = ”The 5 biggest countries by population in 2017 are China, India, United States, Indonesia, and Brazil.”

input_str = input_str.lower()

print(input_str)

Output:

The 5 biggest countries by population in 2017 are china, india, the united states, indonesia, and brazil.

Removing Numbers

When paraphrasing, it is important to remove numbers if they are not relevant to your analyses. Usually, regular expressions are used to remove numbers.

Python Code:

import re

input_str = ’Box A contains 3 red and 5 white balls, while Box B contains 4 red and 2 blue balls.’

result = re.sub(r’\d+’, ‘’, input_str)

print(result)

Output:

Box A contains red and white balls, while Box B contains red and blue balls.

Removing Punctuations

The following code removes this set of symbols [!”#$%&’()*+,-./:;<=>[email protected][\]^_`{|}~]:

Punctuation removal

Python code:

import string

input_str = “This &is [an] example? {of} string. with.? punctuation!!!!” # Sample string

result = input_str.translate(string.maketrans(“”,””), string.punctuation)

print(result)

Output:

This is an example of string with punctuation

Removing Whitespaces

To remove leading and ending spaces, you can use the strip() function:

White spaces removal

Python code:

input_str = “ \t a string example\t “

input_str = input_str.strip()

input_str

Output:

‘a string example’

Tokenization

Tokenization is the process of splitting the given text into smaller pieces called tokens. Words, numbers, punctuation marks, and others can be considered as tokens. In this table (“Tokenization” sheet) several tools for implementing tokenization are described.

Tokenize Text Using Pure Python

First, we will grab some web page content. Then, we will analyze the text to see what the page is about. We will use the urllib module to crawl the web page:

  1. import urllib.request
  2. response = urllib.request.urlopen(‘http://php.net/’)
  3. html = response.read()
  4. print (html)

As you can see from the printed output, the result contains a lot of HTML tags that need to be cleaned. We can use BeautifulSoup to clean the grabbed text like this:

  1. from bs4 import BeautifulSoup
  2. import urllib.request
  3. response = urllib.request.urlopen(‘http://php.net/’)
  4. html = response.read()
  5. soup = BeautifulSoup(html,”html5lib”)
  6. text = soup.get_text(strip=True)
  7. print (text)

Now, we have clean text from the crawled web page.

Finally, let’s convert that text into tokens by splitting the text like this:

  1. from bs4 import BeautifulSoup
  2. import urllib.request
  3. response = urllib.request.urlopen(‘http://php.net/’)
  4. html = response.read()
  5. soup = BeautifulSoup(html,”html5lib”)
  6. text = soup.get_text(strip=True)
  7. tokens = [t for t in text.split()]
  8. print (tokens)

Tokenize Text Using NLTK

We just saw how to split the text into tokens using the split function. Now, we will see how to tokenize the text using NLTK. Tokenizing text is important since text can’t be processed without tokenization. Tokenization process means splitting bigger parts to small parts.

You can tokenize paragraphs to sentences and tokenize sentences to words according to your needs. NLTK is shipped with a sentence tokenizer and a word tokenizer.

Let’s assume that we have a sample text like the following:

  1. Hello Adam, how are you? I hope everything is going well.  Today is a good day, see you dude.

To tokenize this text to sentences, we will use sentence tokenizer:

  1. from nltk.tokenize import sent_tokenize
  2. mytext = “Hello Adam, how are you? I hope everything is going well. Today is a good day, see you dude.”
  3. print(sent_tokenize(mytext))

The output is the following:

  1. [‘Hello Adam, how are you?’, ‘I hope everything is going well.’, ‘Today is a good day, see you dude.’]

You may say, This is an easy job; I don’t need to use NLTK tokenization, and I can split sentences using regular expressions since every sentence is preceded by punctuation and a space.

Well, take a look at the following text:

  1. Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude.

The word Mr. is one word by itself. OK, let’s try NLTK:

  1. from nltk.tokenize import sent_tokenize
  2. mytext = “Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude.”
  3. print(sent_tokenize(mytext))

The output looks like this:

  1. [‘Hello Mr. Adam, how are you?’, ‘I hope everything is going well.’, ‘Today is a good day, see you dude.’]

Great! It works like a charm. Let’s try the word tokenizer to see how it will work:

  1. from nltk.tokenize import word_tokenize
  2. mytext = “Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude.”
  3. print(word_tokenize(mytext))

The output is:

  1. [‘Hello’, ‘Mr.’, ‘Adam’, ‘,’, ‘how’, ‘are’, ‘you’, ‘?’, ‘I’, ‘hope’, ‘everything’, ‘is’, ‘going’, ‘well’, ‘.’, ‘Today’, ‘is’, ‘a’, ‘good’, ‘day’, ‘,’, ‘see’, ‘you’, ‘dude’, ‘.’]

The word Mr. is one word, as expected. NLTK uses PunktSentenceTokenizer, which is a part of the nltk.tokenize.punkt module. This tokenizer is trained well to work with many languages.

Tokenize Non-English Languages Text

To tokenize other languages, you can specify the language like this:

  1. from nltk.tokenize import sent_tokenize
  2. mytext = “Bonjour M. Adam, comment allez-vous? J’espère que tout va bien. Aujourd’hui est un bon jour.”
  3. print(sent_tokenize(mytext,”french”))

The result will be like this:

  1. [‘Bonjour M. Adam, comment allez-vous?’, “J’espère que tout va bien.”, “Aujourd’hui est un bon jour.”]

Removing Stop words

“Stop words” are the most common words in a language like “the”, “a”, “on”, “is”, “all”.

At this point we start to see uniquely paraphrasing features at work.

These words do not carry important meaning and are usually removed from texts. It is possible to remove stop words using Natural Language Toolkit (NLTK), a suite of libraries and programs for symbolic and statistical natural language processing.

Stop Words Removal Process

Code:

input_str = “NLTK is a leading platform for building Python programs to work with human language data.”

stop_words = set(stopwords.words(‘english’))

from nltk.tokenize import word_tokenize

tokens = word_tokenize(input_str)

result = [i for i in tokens if not i in stop_words]

print (result)

Output:

[‘NLTK’, ‘leading’, ‘platform’, ‘building’, ‘Python’, ‘programs’, ‘work’, ‘human’, ‘language’, ‘data’, ‘.’]

A scikit-learn tool also provides a stop words list:

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

It’s also possible to use spaCy, a free open-source library:

from spacy.lang.en.stop_words import STOP_WORDS

Removing Sparse Terms and Words

In the case of paraphrasing tools, it is essential to remove sparse terms and words. This task can be done using stop words removal techniques considering that any group of words can be chosen as the stop words.

Stemming

Stemming is a process of reducing words to their word stem, base or root form (for example, books — book, looked — look). The main two algorithms are Porter stemming algorithm (removes common morphological and inflexional endings from words [14]) and Lancaster stemming algorithm (a more aggressive stemming algorithm). In the “Stemming” sheet of the table some stemmers are described.

Stemming using NLTK:

Code:

from nltk.stem import PorterStemmer

from nltk.tokenize import word_tokenize

stemmer= PorterStemmer()

input_str=” There are several types of stemming algorithms.”

input_str=word_tokenize(input_str)

for word in input_str:

print(stemmer.stem(word))

Output:

There are several types of stem algorithms.

Lemmatization

The aim of lemmatization, like stemming, is to reduce inflectional forms to a common base form. As opposed to stemming, lemmatization does not simply chop off inflections. Instead it uses lexical knowledge bases to get the correct base forms of words.

Lemmatization tools are presented libraries described above: NLTK (WordNet Lemmatizer), spaCy, TextBlob, Pattern, gensim, Stanford CoreNLP, Memory-Based Shallow Parser (MBSP), Apache OpenNLP, Apache Lucene, General Architecture for Text Engineering (GATE), Illinois Lemmatizer, and DKPro Core.

Lemmatization using NLTK:

Code:

from nltk.stem import WordNetLemmatizer

from nltk.tokenize import word_tokenize

lemmatizer=WordNetLemmatizer()

input_str=”been had done languages cities mice”

input_str=word_tokenize(input_str)

for word in input_str:

print(lemmatizer.lemmatize(word))

Output:

Be have do language city mouse

Count Word Frequency

Obviously, you need to count and optimize the word count of a body of text to create an effective paraphrasing tool. The text is much better now. Let’s calculate the frequency distribution of those tokens using Python NLTK. There is a function in NLTK called FreqDist() that does the job:

  1. from bs4 import BeautifulSoup
  2. import urllib.request
  3. import nltk
  4. response = urllib.request.urlopen(‘http://php.net/’)
  5. html = response.read()
  6. soup = BeautifulSoup(html,”html5lib”)
  7. text = soup.get_text(strip=True)
  8. tokens = [t for t in text.split()]
  9. freq = nltk.FreqDist(tokens)
  10. for key,val in freq.items():
  11. print (str(key) + ‘:’ + str(val))

If you search the output, you’ll find that the most frequent token is PHP.

You can plot a graph for those tokens using plot function like this: freq.plot(20, cumulative=False).

From the graph, you can be sure that this article is talking about PHP. Great! There are some words like “the,” “of,” “a,” “an,” and so on. These words are stop words. Generally, stop words should be removed to prevent them from affecting our results.

Part of Speech Tagging (POS)

Part-of-speech tagging aims to assign parts of speech to each word of a given text (such as nouns, verbs, adjectives, and others) based on its definition and its context, which is the foundation of paraphrasing.

There are many tools containing POS taggers including NLTK, spaCy, TextBlob, Pattern, Stanford CoreNLP, Memory-Based Shallow Parser (MBSP), Apache OpenNLP, Apache Lucene, General Architecture for Text Engineering (GATE), FreeLing, Illinois Part of Speech Tagger, and DKPro Core.

Part-of-speech tagging using TextBlob:

Code:

input_str=”Parts of speech examples: an article, to write, interesting, easily, and, of”

from textblob import TextBlob

result = TextBlob(input_str)

print(result.tags)

Output:

[(‘Parts’, u’NNS’), (‘of’, u’IN’), (‘speech’, u’NN’), (‘examples’, u’NNS’), (‘an’, u’DT’), (‘article’, u’NN’), (‘to’, u’TO’), (‘write’, u’VB’), (‘interesting’, u’VBG’), (‘easily’, u’RB’), (‘and’, u’CC’), (‘of’, u’IN’)]

Get Synonyms from WordNet

Do I even need to state how important word synonyms are in a paraphrasing tool? Since the basic purpose of the tool is to replace words with similar words.

If you remember we installed NLTK packages using nltk.download(). One of the packages was WordNet. WordNet is a database built for natural language processing. It includes groups of synonyms and a brief definition.

You can get these definitions and examples for a given word like this:

  1. from nltk.corpus import wordnet
  2. syn = wordnet.synsets(“pain”)
  3. print(syn[0].definition())
  4. print(syn[0].examples())

The result is:

  1. A symptom of some physical hurt or disorder
  2. [‘the patient developed severe pain and distension’]

WordNet includes a lot of definitions:

  1. from nltk.corpus import wordnet
  2. syn = wordnet.synsets(“NLP”)
  3. print(syn[0].definition())
  4. syn = wordnet.synsets(“Python”)
  5. print(syn[0].definition())

The result is:

  1. the branch of information science that deals with natural language information
  2. large Old-World boas

You can use WordNet to get synonymous words like this:

  1. From nltk.corpus import wordnet
  2. Synonyms = []
  3. For syn in wordnet.synsets(‘Computer’):
  4. For lemma in syn.lemmas():
  5. Synonyms.append(lemma.name())
  6. Print(synonyms)

The output is:

  1. [‘computer’, ‘computing_machine’, ‘computing_device’, ‘data_processor’, ‘electronic_computer’, ‘information_processing_system’, ‘calculator’, ‘reckoner’, ‘figurer’, ‘estimator’, ‘computer’]

Get Antonyms from WordNet

You can get the antonyms of words the same way. All you have to do is to check the lemmas before adding them to the array.  it’s an antonym or not.

  1. from nltk.corpus import wordnet
  2. antonyms = []
  3. for syn in wordnet.synsets(“small”):
  4.     for l in syn.lemmas():
  5.         if l.antonyms():
  6.             antonyms.append(l.antonyms()[0].name())
  7. print(antonyms)

The output is:

  1. [‘large’, ‘big’, ‘big’]

This is the power of NLTK in natural language processing. 

Chunking (Shallow Parsing)

Chunking is a natural language process that identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and links them to higher order units that have discrete grammatical meanings (noun groups or phrases, verb groups, etc.) [23]. Chunking tools: NLTK, TreeTagger chunker, Apache OpenNLP, General Architecture for Text Engineering (GATE), FreeLing.

Chunking using NLTK:

The first step is to determine the part of speech for each word:

Code:

input_str=”A black television and a white stove were bought for the new apartment of John.”

from textblob import TextBlob

result = TextBlob(input_str)

print(result.tags)

Output:

[(‘A’, u’DT’), (‘black’, u’JJ’), (‘television’, u’NN’), (‘and’, u’CC’), (‘a’, u’DT’), (‘white’, u’JJ’), (‘stove’, u’NN’), (‘were’, u’VBD’), (‘bought’, u’VBN’), (‘for’, u’IN’), (‘the’, u’DT’), (‘new’, u’JJ’), (‘apartment’, u’NN’), (‘of’, u’IN’), (‘John’, u’NNP’)]

The second step is chunking:

Code:

reg_exp = “NP: {<DT>?<JJ>*<NN>}”

rp = nltk.RegexpParser(reg_exp)

result = rp.parse(result.tags)

print(result)

Output:

(S (NP A/DT black/JJ television/NN) and/CC (NP a/DT white/JJ stove/NN) were/VBD bought/VBN for/IN (NP the/DT new/JJ apartment/NN)

of/IN John/NNP)

It’s also possible to draw the sentence tree structure using code result.draw()

Named Entity Recognition

This feature is essential for any well-built paraphrasing tool, imagine you build a paraphrasing or grammar checking tool that changes the names of people or places. Like I’ve implied, Named-entity recognition (NER) aims to find named entities in text and classify them into predefined categories (names of persons, locations, organizations, times, etc.).

Named-entity recognition tools: NLTK, spaCy, General Architecture for Text Engineering (GATE) — ANNIE, Apache OpenNLP, Stanford CoreNLP, DKPro Core, MITIE, Watson Natural Language Understanding, TextRazor, FreeLing are described in the “NER” sheet of the table.

Named-entity recognition using NLTK:

Code:

from nltk import word_tokenize, pos_tag, ne_chunk

input_str = “Bill works for Apple, so he went to Boston for a conference.”

print ne_chunk(pos_tag(word_tokenize(input_str)))

Output:

(S (PERSON Bill/NNP) works/VBZ for/IN Apple/NNP so/IN he/PRP went/VBD to/TO (GPE Boston/NNP) for/IN a/DT conference/NN./.)

Coreference resolution (anaphora resolution)

Pronouns and other referring expressions should be connected to the right individuals. Coreference resolution finds the mentions in a text that refer to the same real-world entity.

For example, in the sentence,

“Andrew said he would buy a car” the pronoun “he” refers to the same person, namely to “Andrew”.

Coreference resolution tools: Stanford CoreNLP, spaCy, Open Calais, Apache OpenNLP are described in the “Coreference resolution” sheet of the table.

Collocation Extraction

This is something you add to give your paraphrasing tool some style. Collocations are word combinations occurring together more often than would be expected by chance. Collocation examples are “break the rules,” “free time,” “draw a conclusion,” “keep in mind,” “get ready,” and so on.

Collocation Extraction using ICE [51]

Code:

input=[“he and Chazz duel with all keys on the line.”]

from ICE import CollocationExtractor

extractor = CollocationExtractor.with_collocation_pipeline(“T1” , bing_key = “Temp”,pos_check = False)

print(extractor.get_collocations_of_length(input, length = 3))

Output:

[“on the line”] 

Relationship Extraction

Relationship extraction allows obtaining structured information from unstructured sources such as raw text. Strictly stated, it is identifying relations (e.g., acquisition, spouse, employment) among named entities (e.g., people, organizations, locations). For example, from the sentence “Matthew and Emily married yesterday,” we can extract the information that Matthew is Emily’s husband.

In Conclusion

In this post, text preprocessing is described, its main steps including normalization, tokenization, stemming, lemmatization, chunking, part of speech tagging, named-entity recognition, coreference resolution, collocation extraction, and relationship extraction. Also discussed is text preprocessing tools and examples. After the text preprocessing is done, the result may be used for more complicated NLP tasks, for example, machine translation or natural language generation1.