1.Introduction

The mismatch between parking demand and parking space availability in big cities can easily cause frustrations for citizens. In Seattle, the business as usual approach for allocating street parking space is following the minimum requirement regulated by Seattle’s Land Use Code. However, the minimum requirement does not necessarily correspond to parking demand. Over and undersupply for parking spaces could lead to inefficient curbside management.

It is not rare that sometimes a car is parked on a bike lane because the driver cannot find a space to park, which can be dangerous for both drivers and cyclists. It will also be a waste of public space if the city oversupply street parking because spaces could be dedicated to bike lanes or public realm improvement. In order to help Seattle to minimize the gap between parking demand and supply and help Seattle to have a more efficient curbside management, we developed this model to predict the parking demand in Seattle’s CBD area.

Developing a model is not an easy process because parking is very dynamic. The demand for parking can change drastically based on time and location. Weekdays, holidays, events and weathers could all have an impact on people’s travel patterns and thus affect parking demand. In this model, we try to capture as many spatial and time components as we can. Socio-economic factors were also put into consideration different neighborhoods may have different travel patterns.

Our final model could explain about 79% of the variation in parking demand. When testing with the training set, the model as a mean absolute error of 1.6, which is relatively small. K-fold cross-validation was conducted to measure the model’s performance and to make sure the model is generalizable. Lastly, we also developed an application for this model, and hope the user case we developed could help Seattle to allocate street space more efficiently.

A quick introduction video about this project can be watched here: https://www.youtube.com/watch?v=Cur6h3eYHUk&feature=youtu.be

Map theme, plot theme, and function set-up.

mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
} 

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
} 
paletteY <- c("#F9F871","#FFD364","#FFAF6D","#FF8F80","#F87895", "D16BA5")

qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                 c(.01,.2,.4,.6,.8), na.rm=T)
  }
}

q5 <- function(variable) {as.factor(ntile(variable, 5))}

2.Data

The primary dataset we use is the Annual Parking Study data, which is accessible in Seattle’s Open Data Portal. There are about 110,000 rows of information after we clean the dataset. Because the dataset is non-spatial, we must join the Seattle block face data to locate the street parking. The combined dataset comes with information on time, parking space available, total vehicle count and parking rate.

The predictive model we build is a linear regression model. The dependent variable for the model is the total vehicle count for a street parking segment. And we shortlisted XX independent variables to develop this model. They are factors that may have an impact on parking demand, and we divided them into 4 categories.

  • Street Parking Features: These are specific details about the parking space at the time it was recorded.
  • Spatial Features: Distance to or the number of amenities/land use types close to the parking segment.
  • Socio-economic Features: Demographics and commute pattern statistics at the census tract level.
  • Spatial and Time structure: : Spatial lag – Average occupancy rate of nearby road segments. Time lag – The vehicle counts at a similar or recent time period.

2.1 Data Overview

2.1.1 Seattle Annual Parking Study

The animation plot below shows the average hourly occupancy rate of street parking. One can see the hourly parking occupancy rate change significantly throughout the day.

Note: Because the parking study is not perfectly consistent for time and, there is someplace we do not have data about, and therefore some street parking segment is missing for some hour.

occu.animation.element <-
  new6 %>% filter(year == 2018) %>%
  group_by(hour(`Date Time`), ELMNTKEY) %>%
  summarize(mean_occu = mean(occup, na.rm=T)) %>% 
  left_join(blockface,by="ELMNTKEY" ) %>%
  st_sf() %>%
  mutate(Occupancy = case_when(mean_occu == 0 ~ "0%",
                               mean_occu > 0 & mean_occu <= 60 ~ "< 60%",
                               mean_occu > 60 & mean_occu <= 85 ~ "60-85%",
                               mean_occu > 85 & mean_occu <=100 ~ "85-100%",
                               mean_occu > 100 ~ ">100% (over-parked)")) %>%
  mutate(Occupancy  = factor(Occupancy, levels=c("0%", "< 60%","60-85%","85-100%",">100% (over-parked)")))

occupancy_animation2 <-
  ggplot() +
  geom_sf(data = studyarea, color = "White") +
  geom_sf(data = water, fill ="#D5F8FC" , colour="#D5F8FC") +
  geom_sf(data=occu.animation.element, aes(colour=Occupancy), show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5, 
                      name = "Occupancy") +
  labs(title = "CBD Parking Mean Occupancy by hour in 2018, Seattle",
       subtitle = "One-hour intervals: {current_frame}") +
  transition_manual(`hour(\`Date Time\`)`) +
  xlim(385000, 389300) + ylim(67000, 73000)+
  mapTheme()

2.1.2 Census data

The four plots below show the socio-economic independent variables at the census tract level. Perhaps due to higher density development in the city center, population density is the highest in the CBD. The number of total housing unit shares the same pattern. As for median household income, census tract is a little bit more well off. Lastly, it is quite surprising that nearly all the census tracts in our study area have a very low percentage of the family have high car ownership.

grid.arrange(
  ggplot() + 
  geom_sf(data = seattle_acs, aes(fill = q5(POPden)), color = "white") +
  geom_sf(data = water, fill ="#D5F8FC" , colour="#D5F8FC") +
  geom_sf(data = SA,fill="white", colour= "#F39C12",alpha=0,size=1)+
  scale_fill_manual(values = paletteY,
                    labels = qBr(seattle_acs, "popden2", rnd = F),
                    name="Quintile\nBreaks") +
  xlim(382397.57, 389575.90) + ylim(61374.50, 74610.66)+
  labs(title = "Population density (per square KM) by census tract (2017)") +
  mapTheme(),
  
ggplot() + 
  geom_sf(data = seattle_acs, aes(fill = q5(HUtot)), color = "white") +
  geom_sf(data = water, fill ="#D5F8FC" , colour="#D5F8FC") +
  geom_sf(data = SA,fill="white", colour= "#F39C12",alpha=0,size=1)+
  scale_fill_manual(values = paletteY,
                    labels = qBr(seattle_acs, "HUtot", rnd = F),
                    name="Quintile\nBreaks") +
  xlim(382397.57, 389575.90) + ylim(61374.50, 74610.66)+
  labs(title = "Total housing units by census tract (2017)") +
  mapTheme(),
  ncol = 2)

grid.arrange(
  ggplot() + 
  geom_sf(data = seattle_acs, aes(fill = q5(MedHHinc)), color = "white") +
  geom_sf(data = water, fill ="#D5F8FC" , colour="#D5F8FC") +
  geom_sf(data = SA,fill="white", colour= "#F39C12",alpha=0,size=1)+
  scale_fill_manual(values = paletteY,
                    labels = qBr(seattle_acs, "MedHHinc", rnd = F),
                    name="Quintile\nBreaks") +
  xlim(382397.57, 389575.90) + ylim(61374.50, 74610.66)+
  labs(title = "Median Household Income by census tract (2017)") +
  mapTheme(),
  
ggplot() + 
  geom_sf(data = seattle_acs, aes(fill = q5(Veh_twomore)), color = "white") +
  geom_sf(data = water, fill ="#D5F8FC" , colour="#D5F8FC") +
  geom_sf(data = SA,fill="white", colour= "#F39C12",alpha=0,size=1)+
  scale_fill_manual(values = paletteY,
                    labels = qBr(seattle_acs, "Veh_twomore", rnd = F),
                    name="Quintile\nBreaks") +
  xlim(382397.57, 389575.90) + ylim(61374.50, 74610.66)+
  labs(title = "Percentage of households with two or more vehicles\nby census tract (2017)") +
  mapTheme(),
  ncol = 2)

2.1.3 Seattle parcel data

The plot below shows the land use a parcel level in our study area. It is clear that the CBD region is dominantly commercial, with some residential parcels scatters in the area. And the outer ring of our study area is mainly residential

ggplot() + 
  geom_sf(data = SA_parcel %>% 
            filter(PROPTYPE=="C"|PROPTYPE=="K"|PROPTYPE=="R"), aes(color = PROPTYPE)) +
  geom_sf(data = water, fill ="#D5F8FC" , colour="#D5F8FC") +
  scale_color_manual(values = paletteY) +
  xlim(382397.57, 389575.90) + ylim(61374.50, 74610.66)+
  labs(title = "Land Use at Parcel Level by census tract (2017)") +
  mapTheme()

2.1.4 Annual Traffic Flow

Traffic data of major roads is downloaded from Seattle’s Open Data. Average weekday traffic in 2018 is shown in the map below.

2.1.5 Time lags & Spatial lags

Creating time lag variables would add additional nuance about parking demand during a given time period - hours before and during that day. Therefore, we first created a study panel with each unique date time and each unique blockface and use this panel to determine parking demand of a specific blockface during the last hour, last 2 hours, last 3 hours, last 4 hours, last 12 hours, and same time period one day before and one week before.

study.panel <- 
  expand.grid(`Date Time` = unique(SA.dat3$`Date Time`), 
              ELMNTKEY = unique(SA.dat3$ELMNTKEY)) 

SA.panel <- SA.dat3 %>%
  right_join(study.panel)

st_geometry(SA.panel) <- NULL
    
SA.panel <- SA.panel %>%     
  group_by(`Date Time`, ELMNTKEY) %>%
  left_join(weather.Panel) %>%
  ungroup() %>%                                 
  mutate(week = week(`Date Time`),
         dotw = wday(`Date Time`, label = TRUE))

SA.panel<-SA.panel %>%
  mutate(year = year(`Date Time`),
         month = month(`Date Time`)) %>%
  arrange(`Date Time`, ELMNTKEY) %>% 
  mutate(lagHour = dplyr::lag(Total_Vehicle_Count,1),
         lag2Hours = dplyr::lag(Total_Vehicle_Count,2),
         lag3Hours = dplyr::lag(Total_Vehicle_Count,3),
         lag4Hours = dplyr::lag(Total_Vehicle_Count,4),
         lag12Hours = dplyr::lag(Total_Vehicle_Count,12),
         lag1day = dplyr::lag(Total_Vehicle_Count,24),
         lag1week = dplyr::lag(Total_Vehicle_Count,168)) 

We also created spatial lag variable which is the the average parking occupany of the five nearest blockfaces for each blockface during each hour in each unique date.Then we transformed this occupancy lag into parking demand lag by multiplying the mean occupancy rate of the five nearest blockfaces with the total parking places of the blockface we are looking at, which would make more sense because blockfaces nearby may have different parking capacities.

new2 <- data.frame()

for (i in unique(test$`Date Time`)) {
  trim_set <- test %>% filter(`Date Time` == i)
  coords <- st_coordinates(trim_set)
  nborList <- knn2nb(knearneigh(coords, 5))
  spatialWeights <- nb2listw(nborList, style="W")
  trim_set$lagOCU <- lag.listw(spatialWeights, trim_set$occup)
  new2 <- rbind.data.frame(new2, trim_set)
}

2.1.6 Overview

Summary statistics of all the independent variables is presented in the table below

2.2 Correlation Matrix

A correlation matrix can help us visualize and understand if any of our independent variables are strongly correlated with each other and might have similar predicting power. If two variables are strongly correlated with each other, one should consider removing one of the variables from the model.

According to the correlation matric below, the percentage of the population drive to work and the percentage of families have more than 2 cars are highly correlated. And we should consider only include one on them in the model. The lag hour variables also have a relatively strong correlation, because the occupancy rate will not change drastically in a few hours.

new.cor <- subset(new6, select = c(Total_Vehicle_Count, Temperature,Precipitation,Wind_Speed,week,lagHour,lag2Hours,
                                    lag3Hours,lag4Hours,lag1day,lag1week,SHAPE_Leng,Parking_Spaces,
                                parking_rate, Resi,Retail,distour,discomm,dishosp,disoff,disrest,disschl,
                                    disterm,HUtot,POPden,MedHHinc,VacPct,Veh_one,
                                    Veh_twomore,Pct_drive,traffic,dis_road,lagSpatial ))
new.cor <- new.cor %>% 
  st_set_geometry(NULL)

cormatrix <- cor(new.cor, use = "pairwise.complete.obs")

ggcorrplot(cor(new.cor, use = "pairwise.complete.obs"),
           colors = c("#DCE319FF","white","#287D8EFF"), 
           title="Correlation Matrix (Correlogram) of Continuous Variables",
           ggtheme=mapTheme)

2.3 Selected independent variables VS sale prices

For the exploratory purposes, we also made correlation plots for the selected independent variables. The first series plots show the relationship between each spatial independent variables and total vehicle count. According to the plots, all the variables have a relatively weak correlation with total vehicle count. Compare to other variables, distance to commercial parcels, hospitals, office and tourist attracts have a relatively stronger correlation without dependent variables.

as.data.frame(new6) %>%
  dplyr::select(Total_Vehicle_Count,distour,discomm,dishosp,disoff,disrest,disschl,
                disterm,dis_road) %>%
  gather(Variable, Value, -Total_Vehicle_Count) %>% 
  ggplot(aes(Value, Total_Vehicle_Count)) +
  geom_point(colour = "#B8DE29FF", size = 0.5) +
  geom_smooth(method = "lm", se = FALSE, colour = "#20A387FF") +
  facet_wrap(~Variable, ncol = 4, scales = "free") +
  labs(title = "Park Count as a function of distance (continuous) variables") +
  theme(axis.text.x = element_blank()) +
  plotTheme()

As for socioeconomic variables, the factors are a more direct link to cars or travel patterns have a more significant correlation with total vehicle count. For example, traffic volume, percentage of the population drive to work, and the percentage of families have 2 or more cars.

as.data.frame(new6) %>%
  dplyr::select(Total_Vehicle_Count,HUtot,POPden,MedHHinc,VacPct,Veh_one,
                Veh_twomore,Pct_drive,traffic) %>%
  gather(Variable, Value, -Total_Vehicle_Count) %>% 
  ggplot(aes(Value, Total_Vehicle_Count)) +
  geom_point(colour = "#B8DE29FF", size = 0.5) +
  geom_smooth(method = "lm", se = FALSE, colour = "#20A387FF") +
  facet_wrap(~Variable, ncol = 4, scales = "free") +
  labs(title = "Park Count a function of social_economic (continuous) variables") +
  theme(axis.text.x = element_blank()) +
  plotTheme()

It is easy to understand that parking space, and blackface length is highly correlated with the total vehicle count because they define the existing parking spaces.

as.data.frame(new6) %>%
  dplyr::select(Total_Vehicle_Count,Temperature,Precipitation,Wind_Speed,SHAPE_Leng,Parking_Spaces,parking_rate,Resi,Retail) %>%
  gather(Variable, Value, -Total_Vehicle_Count) %>% 
  ggplot(aes(Value, Total_Vehicle_Count)) +
  geom_point(colour = "#B8DE29FF", size = 0.5) +
  geom_smooth(method = "lm", se = FALSE, colour = "#20A387FF") +
  facet_wrap(~Variable, ncol = 4, scales = "free") +
  labs(title = "Park Count as a function of selected (continuous) variables") +
  theme(axis.text.x = element_blank()) +
  plotTheme()

Time lag and spatial lag shows impressive correlation with total vehicle count, which implies that they can be very strong predictors for our model. It indicates that the number of car parks on the street is highly correlated with the occupancy rate of its neighboring street parking segment and the number of cars parked on the street in the past hour.

as.data.frame(new6) %>% 
  dplyr::select(Total_Vehicle_Count,lagSpatial,lagHour) %>%
  gather(Variable, Value, -Total_Vehicle_Count) %>% 
  ggplot(aes(Value, Total_Vehicle_Count)) +
  geom_point(colour = "#B8DE29FF", size = 0.5) +
  geom_smooth(method = "lm", se = FALSE, colour = "#20A387FF") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Park Count as a function of time and spatial lag variables") +
  theme(axis.text.x = element_blank()) +
  plotTheme()

3. Method

Data gathering and wrangling is our first step for building the model. We complete this step by using both R and GIS. In this step, we calculated distance or perform spatial joins for all of our variables. Extreme outliers will be removed from the dataset. The resulting spreadsheet will consist of information on both independent and dependent variables. Once we have our predictor ready, we started to build our linear regression model. Different combination of independent variables was tested to get the most desirable prediction result. Finally, we ran cross-validation to measure the performance of the model.

4. Results

4.1 Insample model result

We randomly splited our data set into a 60% training set and a 40% test set. Our prediction model will be trained based on the former and tested on the latter.

inTrain <- createDataPartition (
  y = paste(new6$PROPTYPE, new6$dotw, new6$PAIDAREA), 
  p = .60, list = FALSE)
new.training <- new6[inTrain,] 
new.test <- new6[-inTrain,] 

reg1 <- 
  lm(Total_Vehicle_Count ~  hour(`Date Time`) + dotw + Temperature + Precipitation + year + month
       + lagHour + SHAPE_Leng + parking_rate + Resi+ Retail
       + distour +discomm +dishosp +disoff+disrest+MedHHinc+VacPct+ Veh_twomore
       + traffic +lagSpatial,
     data=as.data.frame(new.training))

summary(reg1)

We shortlisted 21 independent variables in our linear regression model as predictors. As a result, our model as an R squared of 78.74%, meaning that nearly 80% of the variances in the parking demand can be explained by the predictors in our model. In addition, the MAE of our model is 1.60 while the MAPE is 40.65%. One reason for such a high MAPE is that our dependent variable is left-skewed (see the distribution histogram below) since parking count numbers are clustered below 10 and there are only a few elements with a parking capacity of over 15. Therefore, a very small absolute error could still make the percentage error pretty high since the denominator (observed parking count) is small.

new.test2 <-
  new.test %>%
  mutate(Park_Count.Predict=predict(reg1, as.data.frame(new.test)),
         Park_Count.Error=Total_Vehicle_Count - Park_Count.Predict,
         Park_Count.AbsError=abs(Total_Vehicle_Count - Park_Count.Predict),
         Park_Count.APE = (abs(Total_Vehicle_Count - Park_Count.Predict)) / Total_Vehicle_Count) 

as.data.frame(new.test2) %>%
  summarize(mean(Park_Count.AbsError, na.rm = T)) %>%
  pull()

as.data.frame(new.test2) %>%
  summarize(mean(Park_Count.APE, na.rm = T) * 100) %>%
  round(2) %>%
  paste("%")

The detailed result of the model is shown in the summary table below. Among all the independent variables, some of the most important ones are time lags, length of the street segment, parking rate, number of retail stores within the 250m buffer zone, distance to the nearest commercial place, distance to the nearest hospital, distance to the nearest office building, percentage of residents with two or more vehicles in the census tract, traffic of the nearest major road, and spatial lag (occupancy of nearby parking places).

4.2 100 K Fold Prediction

To ensure the generalizability of our model, we also conducted 100-fold cross-validation to test our model. Cross-validation is a resampling process, which shuffles data randomly. The dataset will be firstly split into 100 groups, and one group will be taken as a holdout, and the remaining will become a training dataset. Then, the model is fitted in the training set and will be tested against the test set. The process was being repeated, and the performance of the model is recorded.

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv <- train(Total_Vehicle_Count ~  hour(`Date Time`) + dotw + Temperature + Precipitation + year + month
                + lagHour + SHAPE_Leng + parking_rate + Resi+ Retail
                + distour +discomm +dishosp +disoff+disrest+MedHHinc+VacPct+ Veh_twomore
                + traffic +lagSpatial, data= as.data.frame(new6),
                 method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv
sd(reg.cv$resample[,3])

The result of cross-validation is shown below:

The Root Mean Squared Error (RMSE) is 2.38 which is the standard deviation of the absolute error. The distribution of the standard deviation of MAE is shown in the histogram below. The histogram and RMSE both suggest that there is still considerable variation in MAEs. The R squared number implies that on average, around 78.6% of the variance in the sale price can be explained by the model.

ggplot(as.data.frame(reg.cv$resample), aes(MAE)) + 
  geom_histogram(bins = 50, colour="white", fill = "#B8DE29FF") +
  labs(title="Distribution of MAE", subtitle = "k-fold cross validation; k = 100",
       x="Mean Absolute Error", y="Count") +
  plotTheme()

4.3 Predicted parking demand

This scatter plot below shows the relationship between the predicted parking count and the observed parking count. The plots suggest that our model is likely to under predict parking segments with parking counts over 7 and will slightly over predict parking segments with parking counts under 7.

ggplot(new.test2, aes(Park_Count.Predict, Total_Vehicle_Count)) +
  geom_point(size = 0.75, colour = "black") +
  stat_smooth(data=new.test2, aes(Total_Vehicle_Count, Total_Vehicle_Count),
              method = "lm", se = FALSE, size = 1, colour="#20A387FF") +
  stat_smooth(data=new.test2, aes(Total_Vehicle_Count, Park_Count.Predict),
              method = "lm", se = FALSE, size = 1, colour="#B8DE29FF") +
  labs(title="Predicted parjing count as a function of observed parking count",
       subtitle="Green line represents a perfect prediction; Lime line represents prediction") +
  xlim(0, 40)+
  ylim(0, 40)+
  plotTheme()

Our model predicts parking demand relatively well because the distribution plot below suggests our parking demand prediction largely overlaps with the observed parking count. The difference between the two peaks of the histogram implies that the model does not predict enough parking segments with parking count between 2 to 7.

as.data.frame(new.test2) %>%
  dplyr::select(Total_Vehicle_Count,Park_Count.Predict) %>%
  gather(Variable, Value) %>%
  ggplot(aes(Value, fill = Variable)) + 
  geom_density(alpha = 0.5) + 
  scale_fill_manual(values = c("#B8DE29FF", "#20A387FF")) +
  labs(title="Distribution of parking count & parking predictions", 
       x = "Parking Count/Prediction", y = "Density of observations") +
  xlim(0, 40)+
  plotTheme()

The histogram below illustrates that the model predicts parking demand quite well as most of the errors is quite small. And only a few cases have very big errors in prediction.

ggplot(filter(new.test2, Park_Count.AbsError <= 10), aes(Park_Count.AbsError)) +
  geom_histogram(bins = 100, colour = "white", fill="#B8DE29FF") +
  scale_x_continuous(breaks = seq(0, 1000000, by = 100000)) +
  labs(title="Distribution of prediction errors", subtitle = "Errors > 10 are omitted") +
  plotTheme()

The time series plot below gives a detailed performance analysis of our prediction model, which also suggest that our model predicts pretty well over time.

new.test2 %>%
  filter(is.na(Park_Count.Predict) == FALSE) %>%
  mutate(Observed = Total_Vehicle_Count,
         Prediction = Park_Count.Predict) %>%
  dplyr::select(`Date Time`, Observed, Prediction) %>%
  unnest() %>%
  gather(Variable, Value, -`Date Time`) %>%
  group_by(Variable, `Date Time`) %>%
  summarize(Value = sum(Value)) %>%
  ggplot(aes(`Date Time`, Value, colour=Variable)) + 
  geom_line(size = 1.1) + 
  labs(title = "Predicted/Observed parking counts time series", subtitle = "Seattle; test set as 40% of original data",  x = "Date Time", y= "Parking Count") +
  plotTheme()

The map below shows the overall spatial distribution of errors by parking segment. It can be observed that our model predicts pretty well for street segment with higher parking occupancy rate, especially in areas where commercial activities concentrate but the errors are greater in the periphery of the CBD where there are more residential parcels and parking occupancy drops.

as.data.frame(new.test2) %>% 
  filter(is.na(Park_Count.Predict) == FALSE) %>%
  dplyr::select(ELMNTKEY, Park_Count.AbsError) %>%
  unnest() %>%
  group_by(ELMNTKEY) %>%
  summarize(MAE = mean(Park_Count.AbsError, na.rm = TRUE)) %>%
  left_join(blockface,by="ELMNTKEY" )%>%
  ggplot(.)+
  geom_sf(data = studyarea, color = "White") +
  geom_sf(data = water, fill ="#D5F8FC" , colour="#D5F8FC") +
  geom_sf(aes(colour=q5(MAE)), show.legend = "point", size = .75)+
  scale_colour_manual(values = palette5, 
                      name = "Mean Absolute Error") +
  xlim(385000, 389300) + ylim(67000, 73000)+
  labs(title="Mean Abs Error, Test Set")+
  mapTheme()

Breaking down the errors by different time periods, it can be observed from this facetted time-series map below that errors are greater during afternoon and during overnight or on Sundays, while such difference in error is not very significant.

as.data.frame(new.test2) %>% 
  filter(is.na(Park_Count.Predict) == FALSE) %>%
  dplyr::select(ELMNTKEY, Park_Count.AbsError, period, dotw) %>%
  unnest() %>%
  group_by(ELMNTKEY, period) %>%
  summarize(MAE = mean(Park_Count.AbsError, na.rm = TRUE)) %>%
  left_join(blockface,by="ELMNTKEY" )%>%
  ggplot(.)+
  geom_sf(data = studyarea, color = "White") +
  geom_sf(data = water, fill ="#D5F8FC" , colour="#D5F8FC") +
  geom_sf(aes(colour=q5(MAE)), show.legend = "point", size = .75)+
  scale_colour_manual(values = palette5, 
                      name = "Mean Absolute Error") +
  facet_grid(~period)+
  xlim(385000, 389300) + ylim(67000, 73000)+
  labs(title="Mean Abs Error, Test Set")+
  mapTheme()

The two animations below compares the observed parking count by hour in 2018 and the predicted parking demand.

vehcount.animation.data <-
  new6 %>% filter(year == 2018) %>%
  group_by(hour(`Date Time`), ELMNTKEY) %>%
  summarize(mean.vehiclecount = mean(Total_Vehicle_Count, na.rm=T)) %>% 
  left_join(blockface,by="ELMNTKEY" ) %>%
  st_sf() %>%
  mutate(Veh_Count = case_when(mean.vehiclecount == 0 ~ "0",
                               mean.vehiclecount > 0 & mean.vehiclecount <= 3 ~ "< 3",
                               mean.vehiclecount > 3 & mean.vehiclecount <= 5 ~ "3-5",
                               mean.vehiclecount > 5 & mean.vehiclecount <=8 ~ "5-8",
                               mean.vehiclecount > 8 ~ "> 8")) %>%
  mutate(Veh_Count  = factor(Veh_Count, levels=c("0", "< 3","3-5","5-8","8-10","> 8")))

vehcount_animation <-
  ggplot() + 
  geom_sf(data = studyarea, color = "White") +
  geom_sf(data = water, fill ="#D5F8FC" , colour="#D5F8FC") +
  geom_sf(data=prediction.animation.data, aes(colour=ParkDemand), show.legend = "point", size = .75) +
  #  scale_fill_manual (values = c("#F9F871","#FFAF6D","#F87895", "#9F71C7"),
  #                     name = "Occupancy") +
  scale_colour_manual(values = palette5, 
                      name = "Observed Parking Count") +
  labs(title = "Observed Parking Count by hour in 2018, Seattle",
       subtitle = "One-hour intervals: {current_frame}") +
  transition_manual(`hour(\`Date Time\`)`) +
  xlim(385000, 389300) + ylim(67000, 73000)+
  mapTheme()

animate(vehcount_animation, duration=20)

prediction.animation.data <-
  new6 %>% filter(year == 2018) %>%
  group_by(hour(`Date Time`), ELMNTKEY) %>%
  summarize(mean.Park_Count.Predict = mean(Park_Count.Predict, na.rm=T)) %>% 
  left_join(blockface,by="ELMNTKEY" ) %>%
  st_sf() %>%
  mutate(ParkDemand = case_when(mean.Park_Count.Predict == 0 ~ "0",
                                mean.Park_Count.Predict > 0 & mean.Park_Count.Predict <= 3 ~ "< 3",
                                mean.Park_Count.Predict > 3 & mean.Park_Count.Predict <= 5 ~ "3-5",
                                mean.Park_Count.Predict > 5 & mean.Park_Count.Predict <=8 ~ "5-8",
                                mean.Park_Count.Predict > 8 ~ "> 8")) %>%
  mutate(ParkDemand  = factor(ParkDemand, levels=c("0", "< 3","3-5","5-8","8-10","> 8")))

prediction_animation <-
  ggplot() + 
  geom_sf(data = studyarea, color = "White") +
  geom_sf(data = water, fill ="#D5F8FC" , colour="#D5F8FC") +
  geom_sf(data=prediction.animation.data, aes(colour=ParkDemand), show.legend = "point", size = .75) +
  #  scale_fill_manual (values = c("#F9F871","#FFAF6D","#F87895", "#9F71C7"),
  #                     name = "Occupancy") +
  scale_colour_manual(values = palette5, 
                      name = "Predicted Parking Count") +
  labs(title = "Predicted Parking Demand by hour in 2018, Seattle",
       subtitle = "One-hour intervals: {current_frame}") +
  transition_manual(`hour(\`Date Time\`)`) +
  xlim(385000, 389300) + ylim(67000, 73000)+
  mapTheme()

animate(prediction_animation, duration=20)

4.4 Generalizability

We did not conduct neighborhood effect analysis because the CBD area is small and pretty homogeneous in general and more importantly, our model is a time-space prediction model. To test our model’s generalizability, we made four plots shown below to explore the relationship between errors and some socio-economic factors.

It can be observed that places with greater median household income, greater percentage of residents driving to work, greater percentage of households with two or more vehicles, and lower population density are more resistant to the model. However, such variance is not very significant, suggesting that our model in general is generalizable across different contexts.

as.data.frame(new.test2) %>% 
  filter(is.na(Park_Count.Predict) == FALSE) %>%
  dplyr::select(ELMNTKEY, Park_Count.AbsError, POPden, MedHHinc, Pct_drive, Veh_twomore) %>%
  unnest() %>%
  group_by(ELMNTKEY, POPden, MedHHinc, Pct_drive, Veh_twomore) %>%
  summarize(MAE = mean(Park_Count.AbsError, na.rm = TRUE)) %>%
  left_join(blockface,by="ELMNTKEY" )%>%
  dplyr::select(ELMNTKEY, MAE, POPden, MedHHinc, Pct_drive, Veh_twomore) %>%
  gather(-ELMNTKEY, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE, colour = "#B8DE29FF")+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables",
       y="Mean Absolute Error (Parking Count)")+
  plotTheme()

5. Discussion

Our model is a good model overall. It can explain 79% of variations of the parking counts in Seattle’s CBD area and its mean absolute error is only 1.6. In addition, the 100 k fold cross-validation test and the correlation test of errors with socio-economic factors have both proved that our model is pretty generalizable, although not perfectly.

Some of the most important independent variables affecting parking demand include time lags, length of the street segment, parking rate, number of retail stores within the 250m buffer zone, distance to the nearest commercial place, distance to the nearest hospital, distance to the nearest office building, percentage of residents with two or more vehicles in the census tract, traffic of the nearest major road, and spatial lag (occupancy of nearby parking places). Therefore, it can be imagined that when development change happens in the built environment or road characteristics changes, such as traffic increase, road expansion, new commercial / office building development, or parking rate raise, parking demand in the nearby segments would definitely be affected.

The limitation of our model mainly lies in its generalizability. The performance of this prediction model slightly varies across different contexts. While it predicts pretty well in places with lower median household income, higher population density, lower percentage of households with two or more cars and lower percentage of residents driving to work, its predicting power drops a little in places with opposite features. The greater MAE in residential places suggests that there might be some factors such as monthly parking permit neglected by this model. It is also important to bear in mind that our model is focused on Seattle’s CBD area only due to the lack of available data. Therefore, to improve this model in the future, more factors related to parking demand should be included. A larger dataset with more observations across different places would also help enhance this model.

6. Conclusion

In conlusion, we recommend using our model to predict parking demand in Seattle CBD area. We believe an app can be further developed based upon our model to serve as a tool for government officials and urban planners to have a estimation of parking demand especially when new development happens or road characteristics change, which would help them make better decisions and allocate resources more efficiently.

Three demonstration pages of this app wireframe are shown below. Two sections will be on the welcome page including parking occupancy overview and parking demand prediction.

In the parking overview parge, users could specify which blockface they would like to look at and the page will display both real-time and historical parking occupancy data. Parking rate could also be aquired by indicating the start and end time of the parking period.

In the parking demand prediction interface, a series of widgets are provided to allow the users to input possible chagnes that are going to happen in a specified area, including land use change, building construction, road improvement, and parking rate change. Based on the new input, parking demand of nearby blockfaces during different time periods will be calculated and displayed.