1.Introduction

Zillow is one of the most often-used real estate websites for people who are interested in buying or selling houses. It has a large database that allows them to publish housing market predictions. However, their predictions aren’t very accurate because they do not capture local social, economic and spatial characteristics.

Inaccurate prediction for housing can be very misleading for homeowners who want to sell their houses and people who want to buy a house. The inaccurate price can cause significant economic losses to individuals or families. In order to improve the accuracy of housing value prediction, we partnered with Zillow to develop a better predictive model.

Building the model is a difficult process because we are uncertain about which factors are influential to house values. Also, how to engineer the feature(e.g. take a percentage or split them into categories) is also a tough task because there are multiple ways to do it, and we are not sure which one could lead to a more successful model.

Our final model could explain 71% of variations of the sale price. When testing with the training set, our model has a MAPE of 22.47%. A cross-validation test has been conducted to ensure that the model is generalizable. We also took a look at Moran’s I, which suggests that our model has no strong spatial autocorrelation and the spatial phenomenon is generally not clustering together.

2.Data

The primary dataset for this model is “San Francisco home sale”, which consists of information about the housing structure and sale prices of the houses. Other data was gathered from San Francisco Open Data Portal or searched and recorded manually (e.g. location of wholefoods, hospitals, headquarters, etc.).

The model we built is a predictive model based on Ordinary Least Squares (OLS) regression. The dependent variable is the sale price of houses and we shortlisted 29 independent variables to develop this model. They are factors that are likely to influence housing prices and have been divided into four categories, namely:

  • Internal characteristics: Housing structure and characteristics
  • Amenities & public services: Distance to or numbers of amenities/ services close to the property
  • Demographic characteristics: Block-level demographic characteristics
  • Spatial Structures: Average sale price and price per square foot of neighboring properties

2.1 Data Overview

Summary statistics of all the independent variables is presented in the table below

2.2 Correlation Matrix

Map theme, plot theme, and function set-up.

mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
} 

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
} 
paletteY <- c("#F9F871","#FFD364","#FFAF6D","#FF8F80","#F87895", "D16BA5")

qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                 c(.01,.2,.4,.6,.8), na.rm=T)
  }
}

q5 <- function(variable) {as.factor(ntile(variable, 5))}

A correlation matrix can help us to understand if factors have strong correlation with one another and are likely to have similar power or effect on predicting housing values. If two factors are strongly correlated, one of the variables should be removed from the model, because it will not significantly improve the predictivity of the model.

According to the correlation matrix below, there were some factors are strongly correlated with each other. For example, thefts and assaults. In this case, we took thefts out of the model, because we tried to put two variables individually, and we observed that the number of assaults has stronger predicting power for sale price than thefts. Some other variables were also removed because of either colinearity or insignificant predicting power.

SANF.cor <- subset(SANF, select = -c(FID, Shape, ParcelID, Address, PropClassC, SaleDate, LandUse,ZoningCode, X, Y, ConstType, SaleYr, holdOut, name, ID, GEOID, NAME, EduTOT, EduBach, EduMas, EduPro, EduDoc, RaceTOT, RaceWhite, HUtot, HUvac, StruTOT, Stru1d, Stru1a, Unit_single, White, IncLev, waterfront,LotArea.y, Stories.y, BuiltYear.y, PropArea.y, Baths.y, Rooms.y, bedroom, Units, SalePrice, PricePerSq, Stories.x))

SANF.cor <- 
  SANF.cor %>% 
  st_set_geometry(NULL)

cormatrix <- cor(SANF.cor, use = "pairwise.complete.obs")

ggcorrplot(cor(SANF.cor, use = "pairwise.complete.obs"),
           colors = c("#9F71C7","white", "#D16BA5"), 
           title="Correlation Matrix (Correlogram) of Continuous Variables",
           ggtheme=mapTheme)

2.3 Selected independent variables VS sale prices

Finally, we selected 29 independent variables to be included in our prediction model, which are indicated in the correlation scatterplots and histograms below.

The 4 scatterplots below show the relationship between the sale price and our 4 continuous independent variables. According to the plots, the distance to intersections does not have a very strong correlation with the housing price. Both distances to parks and the number of transit stops close to the property have slight relationships with sales price. The sale price will be higher when the property is located at a place has shorter distance to parks and has more transit stops within walking distance. Lag Price has strongest correlation with sale prices. If neighboring houses are expensive, the property will likely also have high sale price.

as.data.frame(SANF.new) %>% 
  dplyr::select(SalePrice, lagPrice, transit, disInt1, disPark1) %>%
  gather(Variable, Value, -SalePrice) %>% 
  ggplot(aes(Value, SalePrice)) +
  geom_point(colour = "#FFAF6D", size = 0.5) +
  geom_smooth(method = "lm", se = FALSE, colour = "#D16BA5") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Price as a function of selected continuous variables") +
  theme(axis.text.x = element_blank()) +
  plotTheme()

Correlation scatterplots of the rest continuous variables.

as.data.frame(SANF.new) %>%
  filter(Rooms.x < 30) %>%
  filter(Baths.x < 25) %>%
  filter(PropArea.x < 10000) %>%
  dplyr::select(SalePrice, LotArea.x, PropArea.x, Rooms.x, Baths.x, houseage, Beds, 
                disInt1, discol, disHosp, dissch, disMix, disVisit, disPark1, disOsp, disWhole,
                disComp, assault, transit, preS, MedHHinc, VacPct, WhitePct, AboveBach,lagPrice2) %>%
  gather(Variable, Value, -SalePrice) %>% 
  ggplot(aes(Value, SalePrice)) +
  geom_point(colour = "#FFAF6D", size = 0.5) +
  geom_smooth(method = "lm", se = FALSE, colour = "#D16BA5") +
  facet_wrap(~Variable, ncol = 4, scales = "free") +
  labs(title = "Price as a function of selected continuous variables") +
  theme(axis.text.x = element_blank()) +
  plotTheme()

The 4 scatterplots below show the relationship between the sale price and our 4 categorical independent variables. The sale year plot suggests that houses sold in 2015 tend to have higher prices than houses sold earlier. This also implies there was an appreciation in the housing market. As for housing stories, properties with 4 stories seems are more popular and have a higher sale price. Single or multiple unit housing structures also played a role in predicting sale price. Multiple units housing structures are likely to have higher sale prices. Lastly, waterfront location played a minimal role in boosting sale price. The sale prices of houses located at waterfront are only slightly higher than the houses are not at waterfront.

as.data.frame(SANF.new) %>% 
  dplyr::select(SalePrice, saleyr20, Stories.x, waterfront, Unit_single) %>%
  gather(Variable, Value, -SalePrice) %>% 
  ggplot(aes(Value, SalePrice)) +
  geom_bar(position = "dodge", stat = "summary", fun.y = "mean", fill = "#FFAF6D" ) +
  facet_wrap(~Variable, ncol = 4, scales = "free") +
  labs(title = "Price as a function of categorical variables", y = "Mean_Price") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  plotTheme()

The map below illustrates the home sale price in San Francisco. According to the map, we can see houses located in the north and middle of San Francisco are likely to have higher housing prices.

ggplot() + 
  geom_sf(data = nhoods, fill = "azure", size = 0.4) +
  geom_sf(data = SANF, aes(colour = q5(SalePrice)), show.legend = "point", size = .75) + 
  scale_colour_manual(values = paletteY, 
                      labels = qBr(SANF, "SalePrice"),
                      name = "Quantile\nBreaks") + 
  labs(title = "Home Sale Price (holdout = 0), San Francisco") + 
  mapTheme()

Property Area does not have such strong spatial clustering as home sale prices. Big and small houses distribute relatively equally across San Francisco.

ggplot() + 
  geom_sf(data = nhoods, fill = "azure", size = 0.4) +
  geom_sf(data = SANF, aes(colour = q5(PropArea.x)), show.legend = "point", size = .75) + 
  scale_colour_manual(values = paletteY, 
                      labels = qBr(SANF, "PropArea.x"),
                      name = "Quantile\nBreaks") + 
  labs(title = "Property area (sqft)") + 
  mapTheme()

Below are three variables for open data sources. The map on the left is the location on the amenities, and the map on the right shows the number of amenities near the properties or the distance from the property to its closest amenities.

Houses located in the middle and east of San Francisco seems to have better accessibility to public transit because they have more transit stops within walking distance.

grid.arrange(
  ggplot() +
    geom_sf(data = nhoods, fill = "azure", size = 0.4) +
    geom_sf(data = transit, shape = 18, colour = "#D16BA5", size = 1.5) + 
    labs(title="Transit stops in San Francisco") +
    mapTheme(),
  
  ggplot() + 
    geom_sf(data = nhoods, fill = "azure", size = 0.4) +
    geom_sf(data = SANF, aes(colour = q5(transit)), show.legend = "point", size = .75) + 
    scale_colour_manual(values = paletteY, 
                        labels = qBr(SANF, "transit"),
                        name = "Quantile\nBreaks") + 
    labs(title = "Number of transit stops within 1/2 mile buffer zone") + 
    mapTheme(),
  ncol = 2)

According to the Assault in 2012 map, assault incidents were denser in the downtown of San Francisco (Northeast). This is also reflected in the map on the right, where property located on the east has significantly a higher number of assaults at surrounding.

grid.arrange(
  ggplot() +
  geom_sf(data = nhoods, fill = "azure", size = 0.4) +
  geom_sf(data = assault, shape = 18, colour = "#D16BA5", size = 1) + 
  labs(title="Assault in 2012, San Francisco") +
  mapTheme(),
  
  ggplot() + 
    geom_sf(data = nhoods, fill = "azure", size = 0.4) +
    geom_sf(data = SANF, aes(colour = q5(assault)), show.legend = "point", size = .75) + 
    scale_colour_manual(values = paletteY, 
                        labels = qBr(SANF, "assault"),
                        name = "Quantile\nBreaks") + 
    labs(title = "Number of assaults within 1/2 mile buffer zone in 2012") + 
    mapTheme(),
  ncol = 2)

This map shows the distance from the property to its closest parks. Properties located at the center of San Francisco seems to have higher accessibility to parks as there are more parks around the area.

grid.arrange(
  ggplot() +
    geom_sf(data = nhoods, fill = "azure", size = 0.4) +
    geom_sf(data = park, shape = 15, colour = "#A6C365", size = 2.5) + 
    labs(title="Parks in San Francisco") +
    mapTheme(),
  
  ggplot() + 
    geom_sf(data = nhoods, fill = "azure", size = 0.4) +
    geom_sf(data = SANF, aes(colour = q5(disPark1)), show.legend = "point", size = .75) + 
    scale_colour_manual(values = paletteY, 
                        labels = qBr(SANF, "disPark1"),
                        name = "Quantile\nBreaks") + 
    labs(title = "Distance to the nearest park") + 
    mapTheme(),
  ncol = 2)

3. Method

The first step for building the model is data gathering and wrangling using GIS and R. The purpose of this step is to clean all the data and assign values of dependent variables. The resulting spreadsheet consists both of dependent variable and independent variables. The second step is building an OLS regression model. Different combinations of independent variables were being put in the model. Additionally, outliers were being removed to ensure extreme cases will not influence the predictivity of the model. The model with the strongest predicting power was selected as our final model. Finally, we test our model with 100-fold training sets to get an overview of the model’s performance. Furthermore, Moran’s I was calculated to see if there is spatial autocorrelation in our model and test our model can be applied in other areas or neighborhoods.

4. Results

4.1 Insample model result

As mentioned before, we included 29 independent variables in our model as predictors. The result of the model is shown in the tables below. Our model has an R squared of 70.85%, meaning that 70.85% of the variance in the sale price can be explained by our predictors.

inTrain <- createDataPartition(
  y = paste(SANF.new$stories.x, SANF.new$PropClassC, SANF.new$name), 
  p = .60, list = FALSE)
SF.training <- SANF.new[inTrain,] 
SF.test <- SANF.new[-inTrain,]  

reg5 <- lm(SalePrice ~ ., data = as.data.frame(SF.training) %>% 
             dplyr::select(SalePrice, LotArea.x, PropArea.x, Stories.x, Rooms.x, saleyr20,
                           Baths.x, houseage, name, Beds, discol, disHosp, dissch, disMix, 
                           disInt1, disVisit, waterfront, disPark1, disOsp, disWhole,disComp, 
                           assault, transit, preS, MedHHinc, VacPct, WhitePct, AboveBach, Unit_single,
                           lagPrice2))
summary(reg5)

4.2 R squared, Mean Absolute Error and Mean Absolute Percent Error

A good predictive model will able to predict datasets it has not encountered before. To test our model, we split our data set into training and test data. The training set is 60% of the randomly selected data from the original dataset, and the test set is the remaining 40% of the data. The model was built based on the training set and was tested against the test set. The result was shown as below:

SF.test2 <-
  SF.test %>%
  mutate(SalePrice.Predict=predict(reg5, SF.test),
         SalePrice.Error=SalePrice - SalePrice.Predict,
         SalePrice.AbsError=abs(SalePrice - SalePrice.Predict),
         SalePrice.APE = (abs(SalePrice - SalePrice.Predict)) / SalePrice) %>%
  filter(SalePrice < 5000000)
  
st_set_geometry(SF.test2, NULL) %>%
  summarize(mean(SalePrice.AbsError, na.rm = T)) %>%
  pull()
  
st_set_geometry(SF.test2, NULL) %>%
  summarize(mean(SalePrice.APE, na.rm = T) * 100) %>%
  round(2) %>%
  paste("%")

test.Rsquare <- 1- (sum((SF.test2$SalePrice.Predict - SF.test2$SalePrice)^2)/sum((SF.test2$SalePrice - mean(SF.test2$SalePrice))^2))

The result suggests that the model can explain 74% of the variance in the sale price, suggesting it is a relatively good mode. The mean absolute error (MAE) for the prediction is $231,359.4, and the mean absolute percentage error (MAPE) is around 22.5%.

4.3 100 K Fold Prediction

To ensure the generalizability of our model, we also conducted 100-fold cross-validation to test our model. Cross-validation is a resampling process, which shuffles data randomly. The dataset will be firstly split into 100 groups, and one group will be taken as a holdout, and the remaining will become a training dataset. Then, the model is fitted in the training set and will be tested against the test set. The process was being repeated, and the performance of the model is recorded.

The result of cross-validation is shown below:

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg3.cv <- train(SalePrice ~ ., data = as.data.frame(SANF.new) %>% 
                   dplyr::select(SalePrice, LotArea.x, PropArea.x, Stories.x, Rooms.x, Beds, saleyr20, Baths.x, houseage, name, discol, disHosp, dissch, disMix, disInt1, disVisit, waterfront, disPark1, disOsp, disWhole,disComp,assault, transit, preS, MedHHinc, VacPct, WhitePct, AboveBach, Unit_single, lagPrice2),
                 method = "lm", trControl = fitControl, na.action = na.pass)

reg3.cv
sd(reg3.cv$resample[,3])

The Root Mean Squared Error (RMSE) is 375,487, which is the standard deviation of the absolute error. The distribution of the standard deviation of MAE can also be reflected in the histogram. The histogram and RMSE both suggest that there is still considerable variation in MAEs. The R squared number implies that on average, around 71% of the variance in the sale price can be explained by the model.

ggplot(as.data.frame(reg3.cv$resample), aes(MAE)) + 
  geom_histogram(bins = 50, colour="white", fill = "#FFD365") +
  labs(title="Distribution of MAE", subtitle = "k-fold cross validation; k = 100",
       x="Mean Absolute Error", y="Count") +
  plotTheme()

4.4 Predicted Prices

This scatter plot shows the relationship between the predicted price and the actual sale price. The plots suggest that our model is likely to under predicts houses have sale prices over 1,000,000 and will over predict houses with sales price under 1,000,000.

ggplot(SF.test2, aes(SalePrice.Predict, SalePrice)) +
  geom_point(size = 0.75, colour = "black") +
  stat_smooth(data=SF.test2, aes(SalePrice, SalePrice),
              method = "lm", se = FALSE, size = 1, colour="#FFD364") +
  stat_smooth(data=SF.test2, aes(SalePrice, SalePrice.Predict),
              method = "lm", se = FALSE, size = 1, colour="#D16BA5") +
  labs(title="Predicted sale price as a function of observed price",
       subtitle="Orange line represents a perfect prediction; Pink line represents prediction") +
  xlim(0, 5000000)+
  plotTheme()

Our model predicts sale price relatively well because the distribution of sale price observations and predictions below suggests our prediction largely overlaps with the actual sale prices. The difference between the two peaks of the histogram implies that the model does not predict enough houses with sale prices between $400,000 to $1,000,000.

as.data.frame(SF.test2) %>%
  dplyr::select(SalePrice,SalePrice.Predict) %>%
  gather(Variable, Value) %>%
  ggplot(aes(Value, fill = Variable)) + 
  geom_density(alpha = 0.5) + 
  scale_fill_manual(values = c("#FFD364", "#D16BA5")) +
  labs(title="Distribution of sale price & sale price predictions", 
       x = "Price/Prediction", y = "Density of observations") +
  xlim(0, 5000000)+
  plotTheme()

The histogram below illustrates that the model predicts the sale price quite well as most of the errors is quite small. And only a few cases have very big errors in prediction.

ggplot(filter(SF.test2, SalePrice.AbsError <= 1000000), aes(SalePrice.AbsError)) +
  geom_histogram(bins = 100, colour = "white", fill="#FFD365") +
  scale_x_continuous(breaks = seq(0, 1000000, by = 100000)) +
  labs(title="Distribution of prediction errors", subtitle = "Errors > $1m are omitted") +
  plotTheme()

The map below shows the predicted sale prices for all the properties in San Francisco.

ggplot() + 
  geom_sf(data = nhoods, fill = "azure", size = 0.4) +
  geom_sf(data = dat, aes(colour = q5(SalePrice.Predict)), show.legend = "point", size = .75) + 
  scale_colour_manual(values = paletteY, 
                      labels = qBr(dat, "SalePrice.Predict"),
                      name = "Quantile\nBreaks") + 
  labs(title = "Predicted property values for the entire dataset") + 
  mapTheme()

4.5 Spatial Auto-correlation

Moran’s I is a correlation coefficient that reflects whether an object is similar to its neighbors. The index can be used to determine spatial clustering or dispersion. Moran’s I of 1 means there is very strong spatial auto-correlation (clustering), and the value of -1 suggests there is perfect dispersion.

The maps of observed sale prices and residuals are shown as below:

grid.arrange(
  ggplot() +
    geom_sf(data = nhoods, fill = "azure", size = 0.4) +
    geom_sf(data = SF.test2, aes(colour = q5(SalePrice)), show.legend = "point", size = .75) +
    scale_colour_manual(values = paletteY,
                        labels=qBr(SF.test2,"SalePrice"),
                        name="Quintile\nBreaks") +
    labs(title="Test set sale prices") +
    mapTheme(),

  ggplot() +
    geom_sf(data = nhoods, fill = "azure", size = 0.4) +
    geom_sf(data = SF.test2, aes(colour = q5(SalePrice.AbsError)), show.legend = "point", size = .75) +
    scale_colour_manual(values = paletteY,
                        labels=qBr(SF.test2,"SalePrice.AbsError"),
                        name="Quintile\nBreaks") +
    labs(title="Absolute sale price errors on the test set") +
    mapTheme(),
  ncol = 2)

We calculated the Moran’s I for absolute errors to see if our model performs better or worse in certain areas. Our model has already taken neighborhood effect into account; therefore, it has a relatively small Moran’s I (see table below). The index suggests that spatial phenomenon (absolute error) has random distribution and the model is generalizable across different places.

coords.test <- 
  filter(SF.test2, !is.na(SalePrice.Error)) %>% 
  st_coordinates() 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")

moranTest <- moran.mc(filter(SF.test2, !is.na(SalePrice.Error))$SalePrice.Error, 
                      spatialWeights.test , nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FFD364",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in red",
       x="Moran's I",
       y="Count"
      # caption="Public Policy Analytics, Figure 6.8"
      ) +
  plotTheme()

4.6 Neighborhood Effect

It is quite common that certain neighborhoods in a city have higher or lower home sale prices, based on the neighborhood’s social, economic or physical characteristics. In this case, the sale price may be clustered based on neighborhoods. To avoid the spatial auto-correlation, it is important to include neighborhood as a categorical factor in the model. By controlling the neighborhood effect, we can improve the accuracy of our model.

The map and the scatterplot below suggest that the MAPE is larger in a certain neighborhood, especially neighborhoods located in the northeast and has relatively low sale prices.

nhoods.MAPE <-
  SF.test2 %>%
  group_by(name) %>%
  summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T),
            mean.SalePrice = mean(SalePrice, na.rm = T)) %>%
  st_set_geometry(NULL) %>%
  left_join(nhoods) %>%
  st_sf()

ggplot() + 
  geom_sf(data = nhoods, fill = "azure", size = 0.4) +
  geom_sf(data = nhoods.MAPE, aes(fill = q5(mean.MAPE))) +
  geom_sf(data = SF.test2, colour = "black", size = .5) +
 # facet_wrap(~Regression) +
  scale_fill_manual(values = paletteY,
                    labels = qBr(nhoods.MAPE, "mean.MAPE", rnd = F),
                    name="Quintile\nBreaks") +
  labs(title = "Mean test set MAPE by neighborhood") +
  mapTheme()

ggplot(nhoods.MAPE, aes(mean.MAPE, mean.SalePrice)) +
  geom_point(size = 2, colour = "#FFD364") +
  labs(title="MAPE by neighborhood as a function of mean price by neighborhood") +
  xlim(0, 1.5)+
  plotTheme()

4.7 Generalizability

To test our model’s generalizability, we established two contexts. The first context is regarding to income level, in which we split the city into three groups, namely, high income (top 25% median household income of all block groups), middle income (between 25% and 75% of the income level), and low income (bottom 25% income level).

The spatial distrbution of the income context is shown as the map below.

ggplot() + 
    geom_sf(data = blockgroup.clean, aes(fill = IncLev)) +
    scale_fill_manual(values = c("#FFD364", "#FF8F80", "#D16BA5"),
                      name="Income Context",
                      labels = c("Low Income", "Middle Income", "High Income")) +
    labs(title = "Income Context") +
    mapTheme() + theme(legend.position="right")

The boxplots below indicate the distributions of continuous variables across blockgroups with different income levels. It can be observed that blocks with higher income level generally have greater percentage of residents holding bachelor’s degree or above, shorter distance to well-known companies, hospitals, open spaces, and Wholefoods, and greater percentage of white residents.

WhiteorNot <- 
  st_set_geometry(SANF.new, NULL) %>%
  dplyr::select(White, SalePrice, LotArea.x, PropArea.x, Rooms.x, Baths.x, houseage, Beds, 
                disInt1, discol, disHosp, dissch, disMix, disVisit, disPark1, disOsp, disWhole,
                disComp, assault, transit, preS, MedHHinc, VacPct, WhitePct, AboveBach,lagPrice2) %>%
  gather(variable, value, SalePrice:lagPrice2)

ggplot(data = WhiteorNot, aes(White, value)) +
  geom_boxplot(aes(fill=White)) +  
  facet_wrap(~variable,scales="free",ncol=5) +
  scale_fill_manual(values = c("#FFD364", "#D16BA5"),
                    name = "Majority Race",
                    labels=c("Non White","White")) +
  labs(title="Variable distributions across White-dominant and Non-White-dominant census tracts",
       x="Race context",
       y="Value") +
  theme(plot.title = element_text(size = 18,colour = "black"))

The table below shows the MAPE of the predicted sale price across different income contexts. The MAPE difference between low-income places and middle-and-high income places suggests that our model cannot predict very well in low-income area. Perhaps, there are some other factors that low-income residents care more about but were neglected by our model

Split.income <-
  SF.test2%>%
  group_by(IncLev) %>%
  summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T)) %>%
  st_set_geometry(NULL) %>%
  spread(IncLev, mean.MAPE)

The second context we developed is about race, in which we split the city into two groups - white-dominated (more than 50% residents are white) and nonwhite-dominated.

ggplot() + 
    geom_sf(data = blockgroup.clean, aes(fill = White)) +
    scale_fill_manual(values = c("#FFD364", "#D16BA5"),
                      name="Race Context",
                      labels = c("Majority Non-White", "Majority White")) +
    labs(title = "Race Context") +
    mapTheme() + theme(legend.position="right")

The boxplots below indicate the distributions of continuous variables across blockgroups with different race contexts. Similarly, it can be observed that blocks where white people dominated generally have greater percentage of residents holding bachelor’s degree or above, shorter distance to colleges, well-known companies, hospitals, open spaces, parks, tourist spots and Wholefoods, and greater median household income.

IncomeLevel <- 
  st_set_geometry(SANF.new, NULL) %>%
  dplyr::select(IncLev, SalePrice, LotArea.x, PropArea.x, Rooms.x, Baths.x, houseage, Beds, 
                disInt1, discol, disHosp, dissch, disMix, disVisit, disPark1, disOsp, disWhole,
                disComp, assault, transit, preS, MedHHinc, VacPct, WhitePct, AboveBach,lagPrice2) %>%
  gather(variable, value, SalePrice:lagPrice2)

ggplot(data = IncomeLevel, aes(IncLev, value)) +
  geom_boxplot(aes(fill=IncLev)) +  
  facet_wrap(~variable,scales="free",ncol=5) +
  scale_fill_manual(values = c("#FFD364", "#FF8F80", "#D16BA5"),
                    name = "Income Level",
                    labels=c("Low Income", "Middle Income", "High Income")) +
  labs(title="Variable distributions across census tracts with different income levels",
       x="Income Level",
       y="Value") +
  theme(plot.title = element_text(size = 18,colour = "black"))

The table below shows the MAPE of the predicted sale price across different race contexts. The slight MAPE difference between white-dominant places and nonwhite-dominant places suggests that our model predicts pretty well across different race contexts.

Split.race <-
  SF.test2%>%
  group_by(White) %>%
  summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T)) %>%
  st_set_geometry(NULL) %>%
  spread(White, mean.MAPE)
  kable(caption = "Mean Absolute Error of test set sales by neighborhood racial context") %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(color = "black", background = "#25CB10")

5. Discussion

Our model is a good model overall. It can explain 71% of variations of the sale prices in San Francisco and its mean absolute percent error is only 22.47%. In addition, the 100 k fold cross-validation test has proved that our model is pretty generalizable. The Moran’s I index suggests our model has no strong spatial auto-correlation, and the spatial phenomenon is generally randomly distributed after we included the neighborhood effect.

Internal characteristics of the property are very important predictors in the model because they describe the quality of houses. We also found that some variables are interesting and also effective predictors for housing prices in the service and amenities category, such as distance to hospitals and distance to well-known headquarters. This suggests that the proximity to amenities and service is something people really value when they make investment decisions in the housing market.

Demographic characteristics like median household income, percentage of the white population, percentage of people with bachelor’s degrees and above are all crucial predictors in the model because they describe social and economic characteristics of place that the property located in.

It is worth to mention that lag price, which is the average price of the property’s nearest neighbors, is also a very powerful predictor. It reflects that housing with similar prices are likely to cluster together. This characteristic can also be captured in the neighborhood effect, as a neighborhood usually shares similar socio-economic characteristics. When people purchase a house, they are also buying the neighborhood environment.

The limitation of our model mainly lies in its generalizability. The performance of this prediction model varies across different contexts. While it predicts pretty well in middle- and high-income places as well as places with different race contexts, its predicting power dims in poor places. The greater MAPE in low-income places suggests that there might be some factors that low-income residents care more about but neglected by this model. To improve this model in the future, more factors related to property sale prices should be included. A larger dataset with more observations would also help enhance this model

6. Conclusion

In conlusion, we recommend that Zillow should adopt this prediction model since it is considerably effective and generalizable. To further enhance this model’s effectiveness, a larger dataset could be used. To further improve its generaliazibility, more factors related to property sale prices should be included so that the model could have greater predicting power across different contexts and even different cities.