Ames Housing Prices Reconsidered Part 1: Simple Encoding

4 min readDec 5, 2020

https://livability.com/top-10/culture/10-best-cities-for-singles/2015/ia/ames

The Ames, Iowa housing prices dataset (Ames data) offers a great opportunity to explore the tools available to us as Data Scientists. We’re going to explore the encoding of categorical data and later, how it we can work with all of those categories by using dimension reduction and Principal Component Analysis. Our target feature is the sale price of a home and the other features describe the characteristics, location, and condition of the home. Before modeling our data and making predictions, we need to process our categorical features or variables.

Categorical variables can be of two types:
- Nominal: Simple distinguishing categories.
- Ordinal: Categories follow a scale that ranks them.

The Ames data gives us 82 features to work with and 46 of them are categorical. We’re going to go through the basic ways that these categorical features can be encoded. Conveniently, half of the categorical features are nominal and the other half ordinal.

Encoding Nominal Features:
After reading in our data and performing preliminary cleaning, we can identify and list the categorical features. We made this list by hand following guidance from the Data Dictionary provided by the Ames data’s publisher.

# read data
df = pd.read_csv('./datasets/ames.csv')# List of nominal features for encoding. Features selected with guidance from original Data Dictionary.nom_features = ['MS Zoning','Street','Alley','Lot Config','Land Contour','Neighborhood','Condition 1','Condition 2','Bldg Type','House Style','Roof Style','Roof Matl','Exterior 1st','Exterior 2nd','Mas Vnr Type','Foundation','Heating','Central Air','Garage Type','Sale Type','Misc Feature']

We now have our basis for one hot encoding that we can perform with pandas.

df = pd.get_dummies(data=df,columns=nom_features,drop_first=True)

This is our “one hot encoding” or “Dummifying” of these features. This process is necessary for when we model with this data later. Encoding assigns either a 1 or a 0 for each category based on if an observation (row in the DataFrame) has the category present. This creates a new feature for the category itself. We want to drop the first category in each feature because we can interpret it as already being present and influencing the rest of the features in the dataset. Below is an example of what this encoding looks like in the DataFrame.

The exact category for the Foundation feature is labeled following the underscore.

Encoding Ordinal Features:
For this type of encoding we are now assigning integers in a sequence (e.g. 0,1,2,3,4) This is our ranking of the categories for each ordinal feature. To do this successfully with the Ames data, some additional processing was required. We made a new DataFrame of the ordinal features and stardardized their categories on a scale of being “better” or “worse”. This was done by matching category labels of ‘Ex’, ‘Gd’, ‘TA’, ‘FA’, ‘Po’ and ‘NA’ for each row. The following was done by hand because the original ordinal categories had slightly different naming conventions for the scale of quality.

# Ordinal Features listord_features = ['Lot Shape','Utilities','Land Slope','Exter Qual','Exter Cond','Bsmt Qual','Bsmt Cond','Bsmt Exposure','BsmtFin Type 1','BsmtFin Type 2','Heating QC','Electrical','Kitchen Qual','Functional','Fireplace Qu','Garage Finish','Garage Qual','Garage Cond','Paved Drive','Pool QC','Fence','Overall Qual','Overall Cond']# Create ordinal DataFrame
ordinals = df[ord_features].fillna('NA')# Create dictionaries for mapping standardized category labels 
# Some features already had these labels and were not mappedorder_lotshape = {'Reg':'Ex','IR1':'Gd','IR2':'TA',
'IR3':'Fa','NA':'NA'}
order_util = {'AllPub':'Ex','NoSewr':'Gd',
'NoSeWa':'TA','ELO':'Fa','NA':'NA'}
order_landslope = {'Gtl':'Ex','Mod':'Gd',
'Sev':'TA','NA':'NA'}
order_bsFin = {'GLQ':'Ex','ALQ':'Gd','BLQ':'TA',
'Rec':'Fa','LwQ':'Po','Unf':'Po','NA':'NA'}
order_elec = {'SBrkr':'Ex','FuseA':'Gd','FuseF':'TA',
'FuseP':'Fa','Mix':'Po','NA':'NA'}
order_func = {'Typ':'Ex','Min1':'Gd','Min2':'TA','Mod':'Fa',
'Maj1':'Po','Maj2':'Po','Sev':'Po','Sal':'Po','NA':'NA'}
order_garage_fin = {'Fin':'Ex','RFn':'Gd','Unf':'TA','NA':'NA'}
order_paved = {'Y':'Ex','P':'Gd','N':'TA','NA':'NA'}
order_fence = {'GdPrv':'Ex','MnPrv':'Gd','GdWo':'TA',
'MnWw':'Fa','NA':'NA'}
order_map_Qual = {10:'Ex',9:'Ex',8:'Gd',7:'Gd',6:'Gd',5:'TA',
4:'TA',3:'Fa',2:'Po',1:'Po','NA':'NA'}# Map labels to featuresordinals['Lot Shape'] = ordinals['Lot Shape'].map(order_lotshape)
ordinals['Utilities'] = ordinals['Utilities'].map(order_util)
ordinals['Land Slope'] = ordinals['Land Slope'].map(order_landslope)
ordinals['BsmtFin Type 1'] = ordinals['BsmtFin Type 1'].map(order_bsFin)
ordinals['BsmtFin Type 2'] = ordinals['BsmtFin Type 2'].map(order_bsFin)
ordinals['Electrical'] = ordinals['Electrical'].map(order_elec)
ordinals['Functional'] = ordinals['Functional'].map(order_func)
ordinals['Garage Finish'] = ordinals['Garage Finish'].map(order_garage_fin)
ordinals['Paved Drive'] = ordinals['Paved Drive'].map(order_paved)
ordinals['Fence'] = ordinals['Fence'].map(order_fence)
ordinals['Overall Qual'] = ordinals['Overall Qual'].map(order_map_Qual)
ordinals['Overall Cond'] = ordinals['Overall Cond'].map(order_map_Qual)

After this processing we can use Scikit-learn’s OrdinalEncoder transformer to assign the ranked integers on our ordinal DataFrame.

from sklearn.preprocessing import OrdinalEncoderencoder = OrdinalEncoder()
ordinals_encoded = encoder.fit_transform(ordinals)# Use original feature names for encoded ordinals
ordinal_df = pd.DataFrame(ordinals_encoded,columns=ord_features)

OrdinalEncoder did not pick up on the ordering of the original string labels of the categories and the numbers were not ordered properly. We mapped the correct sequence of numbers with the following adjustment.

# Map for correct number ordering
reverse_cats = {0:5,2:4,4:3,1:2,3:1}for col in ordinal_df.columns:
    ordinal_df[col] = ordinal_df[col].map(reverse_cats)

These final steps were taken to get our data ready for modeling with all categorical features now encoded.

# drop earlier unencoded ordinal features
df = df.drop(columns=ord_features).copy()# Combine encoded ordinals
df = pd.concat([df,ordinal_df],axis=1)# fill remaining null values
df = df.fillna(0).copy()

Note: Encoding features with dummies or OrdinalEncoder assumes null values to be 0. This is acceptable for the Ames data as the null values are almost exclusively when a given house did not have a certain feature or characteristic.

State of our Data & Next Steps:
Our encoding more than doubled our number of features in the Ames data that at this point has 196 features from originally 82! There are many choices we can make for modeling but our large number of features suggests we should see what we can discover from using dimension reduction and Principal Component Analysis. We will proceed with this analysis and explore our model results in part 2.

Data Dictionary for Ames data:
http://jse.amstat.org/v19n3/decock/DataDocumentation.txt

Ames Housing Prices Reconsidered Part 1: Simple Encoding

Written by Aaron Hume