Here we are going to look at a few ways to engineer and visualize your data. I like to get straight to the point, so apologies if I’m moving too fast. Feel free to contact me with any questions. We are going to look at extracting stopwords, looking at word frequencies, creating ngrams, and using wordclouds. We will be working with tweets from So let’s get started.

raw data

Before starting I did a bit of cleaning using regular expression to get rid of urls, links, punctuation, hashtags, any other weird text. Depending on your business case it’s up to you…

Starting off, this is my first experience with ternary classification. I have to say that I didn’t like it. I wasn’t sure how to tackle this, especially being taught how to use confusion matrix on binary classification. Of course to practice this I had to choose a large dataset. So we’re going to be looking at the Tanzania Water Pump data, which has about 59k data points. So let’s go.

What is a Confusion matrix?

After cleaning and preprocessing your data, you’re going to want to feed your data to a model and get your probabilities. Before that we have…

When trying to build a regression model with multiple predictors, you can start to notice that some don’t belong with the others. These are categorical variables. Variables you have to deal with before modeling. Well what would a categorical variable look like? Glad I asked.

So let’s look at the data from King County House Prices below. So by glancing at the data and looking at where this data is from, price is the dependent variable. So we want to know how the other columns(predictors) in the dataset affects price.

*doesn’t show all columns

Data Cleaning

I believe that data cleaning is an essential part to being a data scientist. One of the few challenges I’ve faced is dealing with unnecessary data. I had to deal with duplicates, columns not needed for analysis, and datatype issues. Let’s look at how to handle some.

Unnecessary columns:

Looking at data from a movie dataset I was able to combine three datasets that had everything I needed and a few extras. Here’s a look at the original data frame versus the new data frame:

Before the cleaning

Chaquayla Halmon

Going after that Data

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store