Can we predict the Dream House price in Ames, Iowa using Machine Learning?

Neeraj Somani
8 min readApr 6, 2021

Every human being once or even more than once in their lifetime dreams about a house which they can call their dream home. Every house is different just like we human beings. The question is: how can we draw on the data we have on the house to predict price of our dream home? In this project, we try to predict the house prices in Ames, Iowa with a hope that it can help future buyer for their dream home. We collected this Ames, Iowa dataset from Kaggle competition. This dataset was originally compiled by Dean De Cock for use in data science education. The technology stack that we are going to use is based on Python. Majorly used libraries are pandas, numpy, matplotlib, seaborn, sklearn, etc.

The key to solving a problem is to take a strategic, structured approach to working through a solutions. So the first thing that I did was to prepare a plan. Here are high level steps that I performed:

Data Exploration and Analysis

  • To check the behavior of individual feature with respect to target variable
  • To understand any outliers in the dataset
  • To understand multi-collinearity in the dataset
  • To understand skewness and kurtosis among individual features within the dataset

Data Cleaning

  • Analyze data type for each feature. If needed, change data type as appropriate
  • Handle missing values / imputation / dummification

Data Filtering

  • Removing any outliers, if needed
  • Manually dropping any features that doesn’t make sense at all in the dataset

Data Transformation

  • New features can be created by understanding the behavior
  • Multiple features can be combined into one

Feature Engineering / Feature selection / Feature scaling

  • Run a few models (lasso, ridge, PCA) to perform feature selection on both original feature dataset and on dataset with transformed features

Evaluate different ML models on the training set

Perform CV (cross-validation) and Parameter optimization on models

Utilize best parameters for each model to predict the output (House Sale Price)

Let’s understand each step one by one.

Data Exploration and Analysis

Which feature should I analyze first? How should I select it? These were the questions that comes to my mind first. I decided to start the analysis with target variable. Being a background of engineering, I approached it with reverse engineering mindset. I analyzed and found that average sales price in Ames, Iowa was around 180K USD. This dataset was for years between 2006 and 2010. That includes the 2008 recession. Not so surprisingly, there was a price drop in 2008 and 2009 vs other years. Also, I found right skewness in sale price data, which kind-off explains the same as well. In-order to deal with right skewness, we learned that we can apply log transformations to create a normalized dataset. This normalization is important to implement the ML model properly.

The next thing is to analyze all predictors / variables and their impact on the Sale Price. There are various ways to achieve this. I performed following steps. The first thing I did was to plot a heat map graph, this gives me the information about how many variables are directly correlated to target variable. Although, this doesn’t give us a complete picture of the predictor variables, it gives us enough details to get started. The second thing that I did was to plot multiple scatter plots and histograms against few important numerical features and target variable. This gives us details about any outliers or skewness in the data.

Now lets create scatter plot for these highly correlated features to understand any outliers or behavior of these features

Data Cleaning and missing value imputation

The next big thing is data cleaning, That means analyzing missing values and understand best way of imputation without involving bias and variance in the dataset.

In order to understand the behavior of each feature, I first analyze the data and their data type then divided each feature into numerical or categorical features.

There are many categorical features like PoolQC, fence, etc where missing value has a meaning. That means that there is no pool in that house and no fence in that house respectively. Hence, we manually hard coded missing values into “None” as categorical value and then later converted these categorical features into numerical feature by assigning ordinal numbers to each values. An example snapshot displayed below:

Capture snapshot of categorical feature imputation code here

For many of the numerical feature imputation we used mean and mode values as per respective analysis. Below is an example snapshot for the same.

Capture snapshot of numerical feature imputation code here

Once, I completed imputation of missing values, the next big task is feature encoding. It’s important because ML models works best with numerical values. Also, by doing feature encoding, we can exclude extra information available in the dataset. This will bring more accurate ML model results.

Below are different ways in which I performed this task:

-Feature encoding for ordinal categorical features

- Solution: either do manual encoding or use sklearn labelEncoding

- Feature encoding for nominal categorical features

- Solution: best way to do it using sklearn one-hot-encoding technique or get_dummies function

- Feature encoding for discrete numerical features

- Solution: usually no encoding required

- Feature encoding for continuous numerical features

  • Solution: usually no encoding required

Data Filtering

While doing the thorough analysis of dataset, I realized that there are few features that doesn’t make any sense or doesn’t impact the prediction of target variable. Hence, I drop “Utilities” feature as around 95% of the observations having the same value. Then I drop “LotFrontage” as there were many missing values in the dataset, and it’s hard to make correct imputation for this feature.

Feature Engineering / Feature selection / Feature scaling

In my this EDA approach, I didn’t make any data transformation. As I wanted to see the performance of the model using the actual dataset. Although, I applied a few techniques of feature selection and feature scaling to improve the model performance.

Feature Scaling is the next big task that needs to be performed. In sklearn we have various methods like from MinMaxScaler, minmax_scale, MaxAbsScaler, StandardScaler, RobustScaler, Normalizer, QuantileTransformer, PowerTransformer. For more see the documentation.

This is usually well known for rescaling data, i.e., normalization and standardize. Normalization scales all numeric variables in the range [0,1]. So outliers might be lost. On the other hand, standardization transforms data to have zero mean and unit variance.

Feature scaling helps gradient descent converge faster, thus reducing training time. It’s not necessary to standardize the target variable. However, due to the presence of outliers, we would use sklearn’s RobustScaler since it is not affected by outliers.

In this specific EDA I didn’t perform any feature selection by lasso or Ridge separately. Instead I used these machine learning models to perform feature selection while implementing the model. So, let’s see how I implemented and evaluated these models.

Evaluating different ML models on processed train dataset

Throughout the machine learning study process I learned a lot of models, and every model has a different way of processing the dataset. The first thing that I did was to load the sklearn package and few models under this package namely, Lasso, Ridge, ElasticNet, GradientBoostRegressor and from XGBoost package XGBRegressor. Below is the code snapshot of the model initialization using default parameters.

Now, the most important task comes when wes run these ML models and see the performance by fetching the score. That’s when all the work that we have done so far pays off. Which Here, as per Kaggle standars,d the score is calculated based on RMSE ( Root Mean Squared Error). In order to perform this and calculate RMSE, we need to train and test the dataset. Wwe can’t use the test dataset right now, as our models are not finalized. We need to divide our training dataset in 2 parts to perform model tuning. I used sklearn train_test_split functionality to split the model in 70–30 ratio. First, we fit the model on this 70% of train dataset and then performed the prediction on the remaining 30% of the dataset. Then we calculated the RMSE as below:

These models show substantial variation in terms of RMSE score. Also, these models are not fine tuned. Let’s utilize the cross validation technique to improve the performance of the model. K-fold Cross Validation(CV) divides the data into k-folds and ensuring that each fold is used as a testing set at some point. This improves the overall performance of the model. Below is a screen shot of models score after k-fold cross validation.

There is one more technique that we can try to improve the performance of the model, which is hyper tune the parameters of the model. Earlier we used only default parameters. Below is the result of models after tuning the parameters.

The last thing that we need to do is to use this best tuned hyper parameter models and run it against the test dataset provided by Kaggle.

Below is a screenshot of few records from our prediction csv file.

Lets plot a histogram of output from test dataset. Here I will do my histogram plot.

Future works and learnings

This was my first ever ML project. It gave me a good understanding of how to tackle various challenges in any project. I have thought of a few improvements that I can still work on this project. For example, I can perform few data transformation techniques in the dataset. As many features still brings multicollinearity into models. another technique that I can work on is data imputation and removal of outliers. Since this is the era of recession in the USA, we can include few additional features and see the overall impact. I was also not able to select the best hyper tuned parameters for XGB and GB due to computational limitations. As we all know every project can never be complete, and there will always be a chance for improvements. That’s how an engineering mindset works, and I will keep it improving.

--

--

Neeraj Somani

Data Analytics Engineer, crossing paths in Data Science, Data Engineering and DevOps. Bringing you lots of exciting projects in the form of stories. Enjoy-Love.