Time/Space prediction - Bike Share Trips in New York City

1. Executive Summary

A successful bike-sharing system is the one that allocate resources – bikes – efficiently. In other words, there will neither be excess bikes in stations where demands are low nor inadequate bikes in stations where demands are high. Therefore, one big operational challenge lies in the predicting the demand and re-balancing the resources.

This project develops a space-time predictive model that anticipates bike share trips demand in New York City across different time periods and places, aiming to provide a tool for the bike share system manager to allocate bikes more efficiently.

The model developed in this project performs pretty well in terms of having a small MAE. However, it is not perfectly generalizable across different contexts or places or time periods since the errors are greater during rush hours and for stations or areas with greater trip counts. Spatial lags can be added into the model to improve its predictive power since it is assumed that bike share trip counts are not randomly distributed across places but tend to be correlated with trip counts of nearby stations.

2. Data

2.1 Citi Bike trip data

The bike share trips of New York City in July 2019 are loaded from Citi Bike Trip data (https://www.citibikenyc.com/system-data), which contains information of the start time and data, stop time and date, duration, start station, end station, and station coordinates of each citi bike trip. There were two special events in July. One is the July 4th holiday on Thursday and the other is the July 13th blackout in west Manhattan from 7PM to midnight on Saturday.

The plot below shows the bike share trips aggregated by hour in July 2019 marked by all the Mondays in grey lines, July 4th in red line, and the blackout in blue line. It can be observed that there are more bike share trips taking place on weekdays than on weekends or on holiday, suggesting that plenty of people are using Citi Bike only to commute.

This series of maps below visualize the total number of bike share trips in each census tract by each hour of the day. It can be observed that most trip origins are clustered in the midtown and lower Manhattan area, especially Chelsea and Greenwich Village, and the Central Park. Moreover, trips are concentrating between 7AM and 8PM, as indicated by the darker shade of the colors.

The animation below shows trip counts in each census tract for each 15-minute internal for one day in New York City. It tells not only popular trip origin areas in each time internal but also how they change across time.

This facetted time-series map below shows the trip count of each Citi Bike station, instead of census tract, during AM Rush (7AM – 10AM), Mid-Day (10AM – 3PM), Overnight (6PM – 7AM), and PM Rush (3PM – 6PM). It can be observed that PM Rush is the busiest time period across the day and popular origin stations are those at the south of the Central park and in midtown and lower Manhattan area.

Breaking down bike share trips by the day of week, the plots below show the count of trips in each census tract from Sunday to Saturday. It can be observed that more bike share trips are taking places from Monday to Wednesday, especially from Chelsea and Greenwich Village, as indicated by the general darker shades, and Central Park (south, as indicated in the maps above) has always been the popular origin.

The two plots below show the trip counts by hour and by day of week or by weekday vs weekend. The morning peak and afternoon peak hours on weekdays really stand out and it is easy to see that there are far more weekday trips than weekend trips.

2.2 Weather data

Weather data of New York City in July 2019 is imported from Iowa Environment Mesonet using the riem package. The plot below shows the total precipitation, maximum wind speed, and maximum temperature by hour of New York City in July 2019. It can be observed that there were two rainy days in the second half of the month and four significant temperature drops throughout the month.

The plots below show the trip count as a function of temperature. It seems that temperature and trip counts have a positive relationship where one degree increase in temperature is associated with 59 more trips. However, after making the time of day as a fixed effect, the relationship becomes negative where one degree increase in temperature is associated with 12 less trips.

The bar charts below indicate that people generally ride less in rainy days but week 31 (July 28th – July 31st) can be an outlier.

These plots of trip count and wind speed below do not tell a compelling story either. And making the hour of day as a fixed effect reverses the correlation as well.

2.3. Census data

Census data is downloaded from tidycensus including socio-economic features of each census tract. Five variables, namely, mean commute time, median age, median household income, percentage of residents taking public transportation, and percentage of white residents. Correlation plots are shown below. Due to the space/time nature of the data, census data will not be appropriate to be put into the model.

2.4 Time lags

Time lag variables are created to test if trip demand during a given hour is correlated to demand in the last one or few hours or or even in the day before. Holiday lag is also created to test if holiday has an impact on demand before and after the holiday. It turns out that the correlation is pretty significant - lagHour has a Pearson’s R of 0.86. The correlation plots are shown below.

3. Modeling

After splitting the data into training set (first half of the month with holiday and blackout) and test set (second half of the month), 4 models with different combination of variables are developed. All of the models have common variables including hour of the day and day of the week. The first model has the weather information and the origin station. The second model has all of the time lags. The third model has the origin station information, time lags, holiday factor, holiday lags, and the binary blackout variable. The final model has all of the significant variables which is shown in the code below.

reg5 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + holiday + holidayLag + blackout
     + Temperature + Wind_Speed
     + lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day 
     + `start station name`, 
     data=ride.Train)

The mean absolute error of each of the four models is shown in the plot below, suggesting that the MAE of the space (origin stations) and weather model is much higher than the other three where time lags are included. The third model with origin stations, holiday and blackout effects performs better than the second model which only has time lags. The MAE of the final model with all the factors is slightly higher than the second and third model. This model is selected for the following analysis.

This plot below gives a detailed performance analysis of each of the four model. It can be observed that there are some very inaccurate predictions in the first model with weather factors due to the precipitation factor. Therefore, precipitation is removed from the final model. In addition, all the models tend to under-predict peak demands during the morning and afternoon rush hours.

The plot below shoes the observed and predicted trip counts for different time periods of day during the week and weekend. It can be observed that all the models are under-predicting in general. However, some of the errors, such as those during overnight, are smaller than others, as indicated by the closer distance between the red line and the black line.

Two maps below show the spatial distribution of errors by origin stations and by neighborhood tabulation areas. It can be observed that the model predicts pretty well in Brooklyn and uptown Manhattan where bike share trips are not intense. However, the errors are greater in midtown and lower Manhattan area where huge amounts of bike share trips take place.

Breaking down the errors by weekdays vs weekend and by different time periods, it can be observed from this facetted time-series map below that the errors are greater during AM and PM rush on weekdays while the model predicts pretty well for the rest of the time and on weekend.

Three plots below explore the relationship between errors and socio-economic factors. It can be observed that places with shorter mean commute time, higher public transportation usage, and greater percentage of white residents, are more resistant to the model. Therefore, this model is not perfectly generalizable across different contexts or places or time periods since the errors vary greatly.

To improve the model’s predictive power, spatial lags can be added into the model. The hypothesis here is that bike share trip counts are not randomly distributed across places but tend to be correlated with trip counts of nearby stations.