Zillow is one of the most often-used real estate websites for people who are interested in buying or selling houses. It has a large database that allows them to publish housing market predictions. However, their predictions aren’t very accurate because they do not capture local social, economic and spatial characteristics.

Inaccurate prediction for housing can be very misleading for homeowners who want to sell their houses and people who want to buy a house. The inaccurate price can cause significant economic losses to individuals or families. In order to improve the accuracy of housing value prediction, we partnered with Zillow to develop a better predictive model.

Building the model is a difficult process because we are uncertain about which factors are influential to house values. Also, how to engineer the feature(e.g. take a percentage or split them into categories) is also a tough task because there are multiple ways to do it, and we are not sure which one could lead to a more successful model.

Our final model could explain 71% of variations of the sale price. When testing with the training set, our model has a MAPE of 22.47%. A cross-validation test has been conducted to ensure that the model is generalizable. We also took a look at Moran’s I, which suggests that our model has no strong spatial autocorrelation and the spatial phenomenon is generally not clustering together.


The primary dataset for this model is “San Francisco home sale”, which consists of information about the housing structure and sale prices of the houses. Other data was gathered from San Francisco Open Data Portal or searched and recorded manually (e.g. location of wholefoods, hospitals, headquarters, etc.).

The model we built is a predictive model based on Ordinary Least Squares (OLS) regression. The dependent variable is the sale price of houses and we shortlisted 29 independent variables to develop this model. They are factors that are likely to influence housing prices and have been divided into four categories, namely:

  • Internal characteristics: Housing structure and characteristics
  • Amenities & public services: Distance to or numbers of amenities/ services close to the property
  • Demographic characteristics: Block-level demographic characteristics
  • Spatial Structures: Average sale price and price per square foot of neighboring properties