Ames Housing Prices Reconsidered Part 1: Simple Encoding

The Ames, Iowa housing prices dataset (Ames data) offers a great opportunity to explore the tools available to us as Data Scientists. We’re going to explore the encoding of categorical data and later, how it we can work with all of those categories by using dimension reduction and Principal Component Analysis. Our target feature is the sale price of a home and the other features describe the characteristics, location, and condition of the home. Before modeling our data and making predictions, we need to process our categorical features or variables.

Categorical variables can be of two types:
- Nominal: Simple distinguishing categories.
- Ordinal: Categories follow a scale that ranks them.

The Ames data gives us 82 features to work with and 46 of them are categorical. We’re going to go through the basic ways that these categorical features can be encoded. Conveniently, half of the categorical features are nominal and the other half ordinal.

Encoding Nominal Features:
After reading in our data and performing preliminary cleaning, we can identify and list the categorical features. We made this list by hand following guidance from the Data Dictionary provided by the Ames data’s publisher.

# read data
df = pd.read_csv('./datasets/ames.csv')

We now have our basis for one hot encoding that we can perform with pandas.

df = pd.get_dummies(data=df,columns=nom_features,drop_first=True)

This is our “one hot encoding” or “Dummifying” of these features. This process is necessary for when we model with this data later. Encoding assigns either a 1 or a 0 for each category based on if an observation (row in the DataFrame) has the category present. This creates a new feature for the category itself. We want to drop the first category in each feature because we can interpret it as already being present and influencing the rest of the features in the dataset. Below is an example of what this encoding looks like in the DataFrame.

The exact category for the Foundation feature is labeled following the underscore.

Encoding Ordinal Features:
For this type of encoding we are now assigning integers in a sequence (e.g. 0,1,2,3,4) This is our ranking of the categories for each ordinal feature. To do this successfully with the Ames data, some additional processing was required. We made a new DataFrame of the ordinal features and stardardized their categories on a scale of being “better” or “worse”. This was done by matching category labels of ‘Ex’, ‘Gd’, ‘TA’, ‘FA’, ‘Po’ and ‘NA’ for each row. The following was done by hand because the original ordinal categories had slightly different naming conventions for the scale of quality.

# Ordinal Features list

After this processing we can use Scikit-learn’s OrdinalEncoder transformer to assign the ranked integers on our ordinal DataFrame.

from sklearn.preprocessing import OrdinalEncoder

OrdinalEncoder did not pick up on the ordering of the original string labels of the categories and the numbers were not ordered properly. We mapped the correct sequence of numbers with the following adjustment.

# Map for correct number ordering
reverse_cats = {0:5,2:4,4:3,1:2,3:1}

These final steps were taken to get our data ready for modeling with all categorical features now encoded.

# drop earlier unencoded ordinal features
df = df.drop(columns=ord_features).copy()

Note: Encoding features with dummies or OrdinalEncoder assumes null values to be 0. This is acceptable for the Ames data as the null values are almost exclusively when a given house did not have a certain feature or characteristic.

State of our Data & Next Steps:
Our encoding more than doubled our number of features in the Ames data that at this point has 196 features from originally 82! There are many choices we can make for modeling but our large number of features suggests we should see what we can discover from using dimension reduction and Principal Component Analysis. We will proceed with this analysis and explore our model results in part 2.

Data professional with a passion for numbers.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store