15 Mar 2019

Natural Language Processing (ML)

What is Natural Language Processing?

Natural Language Processing (NLP) refers to the ability of machines to understand and explain the way human write and speak.

It is widely used in information retrieval, sentiment analysis, information extraction, machine translation, and more.

Operations of NLP include:


Tokenization

Tokenization is a process of understanding a sentence or a phrase by splitting them into substrings, words and punctuation.

Code Example:

from nltk.tokenize import word_tokenize, sent_tokenize

string = "The science of today is the technology of tomorrow."

# Split the sentence into words
tokenized_words = word_tokenize(string)
print("Word Tokenization: ", tokenized_words)

Output:

['The', 'science', 'of', 'today', 'is', 'the', 'technology', 'of', 'tomorrow', '.']


Stopword Removal

Stopword Removal involves filtering out words which do not contain significant information. Mostly these are words that are often used as connective tissue such as as, the, be, are etc.

Code Example:

from nltk.corpus import stopwords

string = "The science of today is the technology of tomorrow."
# Load stop words
stop_words = stopwords.words( 'english')
# Remove stop words
nonstop_words = []
for word in tokenized_words:
    if word not in stop_words:
        nonstop_words.append(word)
print('Nonstop Words: ', nonstop_words) 

Output:

['The', 'science', 'today', 'technology', 'tomorrow', '.']

Stemming

Stemming reduces inflection in words to their root forms by removing the suffixes or prefixes used with a word. As a result, stemming a word or sentence may result in words that are not actual words.

Code Example:

from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

string = "The science of today is the technology of tomorrow."
input_words = word_tokenize(string)

# Create stemmer
porter = PorterStemmer()

# Apply stemmer to each word    
[porter.stem(word) for word in tokenized_words]

Output:

['the', 'scienc', 'of', 'today', 'is', 'the', 'technolog', 'of', 'tomorrow', '.']

Lemmatization

Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. For example, runs, running, ran are all forms of the word run, therefore run is the lemma (or dictionary form) of all these words.

Code Example:

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

string = "The sciences of today are tomorrow's technologies."
input_words = word_tokenize(string)

# Create lemmatizer
lemmatizer = WordNetLemmatizer()

# Apply lemmatizer to each word
[lemmatizer.lemmatize(word) for word in tokenized_words]

Output:

['The', 'science', 'of', 'today', 'is', 'the', 'technology', 'of', omorrow', '.']

Part-of-Speech Tagging(POS-tagging)

Chunking is a set of techniques for entity detection used in text processing, which segments and labels multi-token sequences. For example, noun phrase chunking (NP-chunking) searches for chunks corresponding to individual noun phrases. The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging (POS-tagging).

Code Example:

from nltk import pos_tag
from nltk import word_tokenize

string = "The science of today is the technology of tomorrow."

# Apply speech tagging
text_tagged = pos_tag(word_tokenize(string))

print(text_tagged)

Output:

[('The', 'DT'), ('science', 'NN'), ('of', 'IN'), ('today', 'NN'), ('is', 'VBZ'), ('the', 'DT'), ('technology', 'NN'), ('of', 'IN'), ('tomorrow', 'NN'), ('.', '.')]

Information Retrieval

The Information Retrieval is to extract relevant information from the source.

One of the main goals of text analysis is to convert text into numeric form so that we can use machine learning on it.


Bag of Words

One of primary methods of developing simple language models, is known as the “Bag of Words” (BOW) model, which extracts a vocabulary from all the words in the documents and builds a model using a document term matrix.

Using BOW approach, every document from a corpus of texts is represented as a vector whose length is equal to the vocabulary of the corpus. The computation is simplified by sorting token positions of the vector into alphabetical order.


Vector Encoding Methods

Frequency Vectors

The simplest vector encoding model is to simply fill in the vector with the frequency of each word as it appears in the document.

Code Example:

from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

string = "The science of today is the technology of tomorrow."

vectorizer  = CountVectorizer()
document_term_matrix= vectorizer.fit_transform(word_tokenize(string))

bag_of_words = np.array(vectorizer.get_feature_names())
print("\nBagOfWords:")
print(bag_of_words)
print("\nDocument term matrix:")
print(document_term_matrix.toarray())

Output:

BagOfWords:
['is' 'of' 'science' 'technology' 'the' 'today' 'tomorrow']

Document term matrix:
[[0 0 0 0 1 0 0]
 [0 0 1 0 0 0 0]
 [0 1 0 0 0 0 0]
 [0 0 0 0 0 1 0]
 [1 0 0 0 0 0 0]
 [0 0 0 0 1 0 0]
 [0 0 0 1 0 0 0]
 [0 1 0 0 0 0 0]
 [0 0 0 0 0 0 1]
 [0 0 0 0 0 0 0]]

One-Hot Encoding

One-hot encoding is another vector encoding method that marks a particular vector index with a value of true (1), if the token exists in the document; with a value of false (0), if it does not exist. This method is effective for very small documents (sentences, tweets) that don’t contain very many repeated elements.

Code Example:

from nltk import word_tokenize
from sklearn.preprocessing import Binarizer
import numpy as np


string = "The science of today is the technology of tomorrow."
tokenized_word = word_tokenize(string)

vectorizer = CountVectorizer()
document_term_matrix = vectorizer.fit_transform(tokenized_word)
onehot = Binarizer()
document_term_matrix = onehot.fit_transform(document_term_matrix.toarray())

bag_of_words = np.array(vectorizer.get_feature_names())
print("\nBagOfWords:")
print(bag_of_words)
print("\nDocument term matrix:")
print(document_term_matrix)

Output:

BagOfWords:
['is' 'of' 'science' 'technology' 'the' 'today' 'tomorrow']

Document term matrix:
[[0 0 0 0 1 0 0]
 [0 0 1 0 0 0 0]
 [0 1 0 0 0 0 0]
 [0 0 0 0 0 1 0]
 [1 0 0 0 0 0 0]
 [0 0 0 0 1 0 0]
 [0 0 0 1 0 0 0]
 [0 1 0 0 0 0 0]
 [0 0 0 0 0 0 1]
 [0 0 0 0 0 0 0]]

Term Frequency–Inverse Document Frequency

Term Frequency-Inverse Document Frequency (TF–IDF) encoding normalizes the frequency of tokens in a document with respect to the rest of the corpus. This encoding approach emphasis terms that have higher relevance to a document.

Code Example:

from nltk import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np


string = "The science of today is the technology of tomorrow."
tokenized_word = word_tokenize(string)

tfidf = TfidfVectorizer()
document_term_matrix = tfidf.fit_transform(tokenized_word)


bag_of_words = np.array(tfidf.get_feature_names())
print("\nBagOfWords:")
print(bag_of_words)
print("\nDocument term matrix:")
print(document_term_matrix.toarray())

Output:

BagOfWords:
['is' 'of' 'science' 'technology' 'the' 'today' 'tomorrow']

Document term matrix:
[[0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0.]]

Word Embeddings

Word Embeddings represent words in the form of vectors such that similar words are close to each other while antonyms end up far apart in the vector space.

Word2vec by Google is one of several existing models for constructing word-embedding representations, trained to predict a target word from the context of neighboring words. This method is referred to as Continuous Bag Of Words (CBOW).

Going in the other direction, given a set of sentences (corpus) Word2vec analyses the words of each sentence and uses the current word to predict its neighbors, a method called Skip-Gram.

Code Example:

from gensim.models import Word2Vec
sentences = [[ 'data', 'science'],
 ['science', 'data', 'analytics'],
 ['machine', 'learning'],
 ['Woodbury', 'computer', 'science'],
 ['deep', 'learning']]
# train the model on your corpus
model = Word2Vec(sentences, min_count = 1)
print(model.wv.similarity( 'data', 'science'))

Output:

-0.09619948

End –Cheng Gu


Tags: