Categorical Data, Deal With It!

Chaquayla Halmon
4 min readJan 25, 2021

When trying to build a regression model with multiple predictors, you can start to notice that some don’t belong with the others. These are categorical variables. Variables you have to deal with before modeling. Well what would a categorical variable look like? Glad I asked.

So let’s look at the data from King County House Prices below. So by glancing at the data and looking at where this data is from, price is the dependent variable. So we want to know how the other columns(predictors) in the dataset affects price.

*doesn’t show all columns
Dependent(‘price’) and it’s independents

First thing first, make sure all the data is cleaned and ready to go. Next we have to determine what’s categorical. Usually the datatype would say ‘object’, but this data wanted to go the hard way. LET’S GO!

Here are some ways to tell if your column is categorical. You can take a look at the unique values. For example, let’s look at ‘grade’.

The values go from 3 to 13. So grade would be considered categorical. Not specifically because of the numbers, but more about what they represent. Categorical variables represents, well categories, instead of numerical features. See that simple. With ‘grade’ the number represents the quality of materials used to build said house, where 3 has inferior structure and 13 is a custom designed beauty.

Another great way to ace out those sneaky variables is my crazy pal: The Scatterplot. Certain categorical data produces a signature look. You already know ‘grade’ is categorical, so look for similarities and see if you can point out the others. Let’s see below:

Phew! That’s a lot to threw at you. Sorry not sorry. We’re here to learn. So what you think? What’s categorical? Grade? Of course, that’s what we started with. Looking at grade, right to the left ‘condition’ looks similar, so does ‘has_basement’, ‘month’, ‘view’, ‘floors’, you get the gist. Now some data, like zip codes, takes numeric values, but are not quantitative. So they are considered categorical data. Think of it this way: the sum of two zip codes is not meaningful.

Now after icing out the cats(short form), we need to prepare them to use in the regression models. There are many ways but let’s focus on my favorite: dummy variables. When transforming cats we’re going to use one-hot encoding and convert each cat into a new column, then assign a 1 or 0 to the column. Let’s look at ‘condition’:

First let’s just focus on condition as it’s own data frame. Then change the data type to categorical. Good job! So the values range from 1 to 5. We’re going to make each one of those it’s own column. If the value is True for that row you will see a one, zero for False.

TADA!

Now we have to worry about the dummy variable trap. Still confusing to me, but here we go. When creating dummy variables, we make it possible to predict any single variable using the info of the others. If we look at row 1, we can predict column 1 but adding the values of the other 4 columns and subtracting 1. Hopefully this sounds familiar, like multicollinearity. Everyone knows that can be a problem for regression models. We don’t really want that. So to avoid this, we have to drop one of the dummy variables, usually the first one.

So now you can’t use your super powers of predicting because there’s no longer enough information. Column 1 has been eliminated and that’s about all of the violence I can take right now. So hopefully you learned something new today. Remember to always look for the categorical data. Don’t be a…dummy. Had to!

--

--