NLP: Feature Engineering

Here we are going to look at a few ways to engineer and visualize your data. I like to get straight to the point, so apologies if I’m moving too fast. Feel free to contact me with any questions. We are going to look at extracting stopwords, looking at word frequencies, creating ngrams, and using wordclouds. We will be working with tweets from https://data.world/crowdflower/brands-and-product-emotions. So let’s get started.

raw data

Before starting I did a bit of cleaning using regular expression to get rid of urls, links, punctuation, hashtags, any other weird text. Depending on your business case it’s up to you to figure out what is needed. I stored all the clean data in a column called [‘clean_tweet’]. You can see the difference below.

Stopwords:

Looking at text data, it’s impossible to create a grammatically structured sentence with stopwords like ‘the’, ‘is’, ‘at’, ‘a’, or ‘I’. However when trying to clean your data for modeling these types of words provide little to no value. So we have to get these words out of our data. Make sure you download the NLTK library to use the full corpus of stopwords.

Here you can see the list of stopwords

This makes it easy to extract the stopwords in the text and remove them. So after we tokenized our data using word_tokenize() function, we can just apply a lambda function to the text to keep all the words that are not in the stopword list. Easy peasy.

Remove stopwords from text

Counter

Moving on, after tokenizing and removing stopwords, we look at the most frequent words in the data. There are many ways to do this, but let’s focus on one. Counter is part of the collections library. Here it stores the element as a dictionary key and their counts as dictionary values. It has a handy method, most_common(n), that returns a list of the n most common elements(words), as seen below:

turn text to a list, tokenize, and feed into Counter()

Ngrams

This one is just as easy. NLTK is a library full of helpful resources that deals with text data. Collocations are two or more words that tend to appear frequently together. We can use this to find bigrams and trigrams using ‘Ngram-CollocationFinder’. Then we use the ‘Ngram-AssocMeasures’ that measures the likelihood of those words appearing together. See below:

As you can see the first couple makes sense. Dealing with Apple/Google brands and products, apple store seems to be a great reoccurrence. The same with social network, seeing that’s where we got our info. You can use the same code for trigrams by replacing it with bigrams.

WordClouds

Working with Natural Language Processing can be a bit of a hassle, but fun visually. My favorite way to visualize text is using WordClouds. It’s a technique that displays the most important words in the text. So let’s get started.

The first thing you want to do is make sure you install wordcloud into your notebook with pip install wordcloud. Once that is done import WordCloud from the wordcloud library. Personally I like to clean and tokenize my data to get rid of stopwords and crazy text that’s irrelevant. Again it depends on your business problem. After cleaning I use the ‘Counter()’ above to get a list of the most common and I plugged them into my wordcloud like below:

As you can see the bigger and bolder the word the more frequent it is. Just like plotting, you can change the look or size. All up to you. You can even go here https://amueller.github.io/word_cloud/ for some more fancy looking wordclouds, if you dare.

So there you go. Another quick encounter with data science. This is only a smidge of what you can accomplish with these libraries. So go out there and science the heck out of the data files, and don’t forget to share your discoveries. It’s much more fun when you have someone to share something with.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store