1. Introduction

In 2018, the Chicago Police Department recorded 267,581 incidents of crime (with the exception of murders where data exists for each victim), which equates to 733 incidents per day on average. It is without doubt that crime prevention and reduction is essential for a city to thrive and to enhance the living experience of its residents since crimes can impose great cost on not only the society as a whole but also on individuals either suffered from them or living in fear of them.

This project aims to establish a Prediction Policing model which leverages environmental features of crimes observed in the past and tests the extent to which whose features can generalize to places where few crimes were reported but may at high risk of crime.

Different types of crime are associated with different levels of selection and reporting bias. For example, reporting rate of forcible-entry burglaries is much closer to 100% than that of rape, since rape victims might worry about social stigma attached to sexual assault. In turn, criminals might also choose to commit crimes in areas where people are less likely to report. To test how the prediction model performs for crimes with greater selection bias, this model targets on narcotics – crimes associated with drugs where reporting is highly biased. For example, people in low-income neighborhoods might be more cautious and suspicious of those that might possess or use drugs while people in high-end, white-dominated communities might trust their neighbors more and end up being careless and failing to notice the red flags.

Begin by loading related R packages and create a mapTheme function to standardize the following maps.

library(tidyverse)
library(sf)
library(QuantPsyc)
library(RSocrata)
library(viridis)
library(caret)
library(spatstat)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)

mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

setwd("/Users/ophelia/Desktop/MUSA507/HW7")

2. Narcotics Data

Data of Chicago narcotics in 2018 is downloaded from Chicago Open Data site using the RSocrata package. The spatial distribution of narcotics is shown below in point form. It can be observed that narcotics is highly concentrated in west of the city while the southern part also witnessed great amounts of narcotics as well.

#read in Chicago Boundary data
chicagoBoundary <- 
  st_read("./riskPrediction_data/chicagoBoundary.shp") %>%
  st_transform(crs=102271) 

#read in narcotics data
narcotics <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/2018-Crime/iecj-q4j4") %>% 
  filter(Primary.Type == "NARCOTICS") %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
  st_transform(102271) %>% 
  distinct()

#map of points
ggplot() + 
  geom_sf(data = chicagoBoundary) +
  geom_sf(data = narcotics, colour="purple4", size=0.1, show.legend = "point") +
  labs(title= "Narcotics, Chicago - 2018") +
  mapTheme()

Next, a fishnet is created, which divided the city into plenty of 500ft X 500ft grid cells. Narcotics are joined to each of the grid cell and the map below shows the count of narcotics falling into each cell. The light area in the west also suggests that narcotics are highly concentrated there.

#create the fishnet (size: 500ft x 500ft)
fishnet <- 
  st_make_grid(chicagoBoundary, cellsize = 500) %>%
  st_sf()

#remove extra grid cells
fishnet <- 
  fishnet[chicagoBoundary,] %>%
  mutate(uniqueID = rownames(.)) %>%
  dplyr::select(uniqueID)

#join narcotics data to the fishnet
drug_net <- 
  narcotics %>% 
  dplyr::select() %>% 
  mutate(countNarcotics = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countNarcotics = ifelse(is.na(countNarcotics), 0, countNarcotics),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))

#plot the fishnet map
ggplot() +
  geom_sf(data = drug_net, aes(fill = countNarcotics)) +
  scale_fill_viridis(option = "inferno") +
  labs(title = "Count of narcotics for the fishnet") +
  mapTheme()

3. Feature Engineering - Risk Factors

3.1 Count of risk factors

To test the hypothesis that narcotics are correlated to certain locational experiences, a set of environmental risk data is downloaded from Chicago Open Data. The plot below visualizes the count of each feature in the grid cells. It can be observed that abandoned buildings as well as affordable rental housing units are clustering in west Chicago and especially in south Chicago. Gambling crimes in 2018 mainly concentrated in west of the city and tobacco retails cluster around the loop area.

abandonCars <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
  mutate(year = substr(creation_date,1,4)) %>%
  filter(year == "2018") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Abandoned_Cars")

abandonBuildings <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
  mutate(year = substr(date_service_request_was_received,1,4)) %>%
  filter(year == "2018") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Abandoned_Buildings")

graffiti <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
  mutate(year = substr(creation_date,1,4)) %>%
  filter(year == "2018") %>%
  filter(where_is_the_graffiti_located_ == "Front" |
           where_is_the_graffiti_located_ == "Rear" | where_is_the_graffiti_located_ == "Side") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Graffiti")

streetLightsOut <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
  mutate(year = substr(creation_date,1,4)) %>%
  filter(year == "2018") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Street_Lights_Out")

sanitation <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
  mutate(year = substr(creation_date,1,4)) %>%
  filter(year == "2018") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Sanitation")

##license expires after 2018 Dec 31
liquorRetail <- 
  read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Liquor-Retail/4py5-yxxu") %>%
#  filter(BUSINESS.ACTIVITY == "Retail Sales of Packaged Liquor") %>%
  dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Liquor_Retail")

##license expires after 2018 Dec 31
TobaccoRetail <- 
  read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Tobacco/98qj-ah7k") %>%
#  filter(BUSINESS.ACTIVITY == "Retail Sales of Tobacco") %>%
  dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Tobacco_Retail")

affordablehousing <- 
  read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Affordable-Rental-Housing-Developments/s6ha-ppgi") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Affordable Housing")

policestation <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Police-Stations/z8bn-74gv/data") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Police Station")

gambling <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/2018-Crime/vk8s-aidq") %>% 
  filter(Primary.Type == "GAMBLING") %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
  st_transform(102271) %>% 
  distinct() %>%
  mutate(Legend = "Gambling")

vars_net <- 
  rbind(abandonBuildings, abandonCars, 
        affordablehousing, graffiti, 
        streetLightsOut, sanitation, 
        liquorRetail, TobaccoRetail,
        gambling) %>%
  st_join(., fishnet, join=st_within) %>%
  st_set_geometry(NULL) %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
  full_join(fishnet) %>%
  spread(Legend, count, fill=0) %>%
  st_sf() %>%
  dplyr::select(-`<NA>`) %>%
  na.omit()

vars_net.long <- 
  vars_net %>%
  gather(Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
    geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
    scale_fill_viridis(name="", option = "inferno") +
    labs(title=i) +
    mapTheme()}

do.call(grid.arrange,c(mapList, ncol = 3, top = "Risk Factors by Fishnet"))

3.2 Distance to nearby risk factors

It should be noted that, although some grid cells may have fewer counts of the risk factors, they can be surrounded by cells with much greater counts, meaning that they are at high risk of narcotics since they are exposed to adjacent risk factors. Therefore, another way to visualize the risk factors to the show the distance between each grid cell and its nearby risk factors. The plot below indicates the distance between each cell to its closest three risk factors. Here the color scale is reversed to make it consistent with the plot below – brighter color means greater count and closer distance.

nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  
  return(output)  
}

#nearest three
vars_net$Abandoned_Buildings.nn =
  nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(abandonBuildings), 3)

vars_net$Abandoned_Cars.nn =
  nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(abandonCars), 3)

vars_net$Graffiti.nn =
  nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(graffiti), 3)

vars_net$Liquor_Retail.nn =
  nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(liquorRetail), 3)

vars_net$Street_Lights_Out.nn =
  nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(streetLightsOut), 3)

vars_net$Sanitation.nn =
  nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(sanitation), 3)

vars_net$TobaccoRetail.nn =
  nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(TobaccoRetail), 3)

vars_net$affordablehousing.nn =
  nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(affordablehousing), 3)

#first nearest police station
vars_net$policestation.nn1 =
  nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(policestation), 1)

vars_net$gambling.nn =
  nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(gambling), 3)

vars_net.long.nn <- 
  vars_net %>%
  dplyr::select(ends_with(".nn")) %>%
  gather(Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long.nn$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
    geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
    scale_fill_viridis(option = "inferno", direction = -1) +
    labs(title=i) +
    mapTheme()}

do.call(grid.arrange,c(mapList, ncol = 3, top = "Nearest Neighbor risk Factors by Fishnet"))

3.3 Distance to the nearest police station and the loop

The plot below is made to visualize the distance between each cell to its closest police station and to the centroid of the loop, which is considered as the CBD area.

#distance to the loop's centorid (CBD area)
loopPoint <-
  neighborhoods %>%
  filter(name == "Loop") %>%
  st_centroid()

vars_net$loopDistance =
  st_distance(st_centroid(vars_net),loopPoint) %>%
  as.numeric() 

grid.arrange(
  ggplot() +
  geom_sf(data=vars_net, aes(fill=policestation.nn1)) +
  scale_fill_viridis(option = "inferno", direction = -1) +
  labs(title="Distance to the nearest police station") +
  mapTheme(),
  
ggplot() +
  geom_sf(data=vars_net, aes(fill=loopDistance)) +
  scale_fill_viridis(option = "inferno", direction = -1) +
  labs(title="Euclidean distance to The Loop") +
  mapTheme(),
ncol = 2
)

4. Spatial structures of narcotics

To understand the spatial structure of narcotics, Local Moran’s I is calculated for each grid cell to see if narcotic count at a given location is randomly distributed relative to its immediate neighbors. Maps below show the narcotics count, Local Moran’s I of each grid cell, the p-value of Local Moran’s I, and the significant hotspot areas where the p-value is less than 0.05. It can be observed that the hotspots are located in both west and south of the city while the western part have more clustered hotspots.

final_net.nb <- poly2nb(final_net, queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

final_net.localMorans <- 
  cbind(
    as.data.frame(localmoran(final_net$countNarcotics, final_net.weights)),
    as.data.frame(final_net, NULL)) %>% 
  st_sf() %>%
  dplyr::select(Narcotics_Count = countNarcotics, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z > 0)`) %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.05, 1, 0)) %>%
  gather(Variable, Value, -geometry)

vars <- unique(final_net.localMorans$Variable)
varList <- list()

grid.arrange(
  
  final_net.localMorans %>% 
    filter(Variable == "Narcotics_Count") %>%
    ggplot() + 
    geom_sf(aes(fill=Value)) +
    scale_fill_viridis(option = "inferno") +
    labs(title="Narcotics Count") +
    mapTheme(),
  
  final_net.localMorans %>% 
    filter(Variable == "Local_Morans_I") %>%
    ggplot() + 
    geom_sf(aes(fill=Value)) +
    scale_fill_viridis(option = "inferno") +
    labs(title="Local Moran's I") +
    mapTheme(),

  final_net.localMorans %>% 
    filter(Variable == "P_Value") %>%
    ggplot() + 
    geom_sf(aes(fill=Value)) +
    scale_fill_viridis(option = "inferno", direction = -1) +
    labs(title="P value") +
    mapTheme(),
  
  final_net.localMorans %>% 
    filter(Variable == "Significant_Hotspots") %>%
    ggplot() + 
    geom_sf(aes(fill=Value)) +
    scale_fill_viridis(option = "inferno") +
    labs(title="Significant Hotspots") +
    mapTheme(),
  
  ncol = 2,
  top = "Local Morans I statistics, Narcotics")

More significant hotspots are selected from cells with a p-value less than 0.0000001. The dummy variable narcotics.isSig is thus created where 1 refers to grid cells with a p-value less than 0.0000001. Distances of each cell to these hotspots (narcotics.isSig.dist) are visualized in the map below.

#distance to highly significant narcotics spots
final_net <-
  final_net %>% 
  mutate(narcotics.isSig = ifelse(localmoran(final_net$countNarcotics, 
                                            final_net.weights)[,5] <= 0.0000001, 1, 0)) %>%
  mutate(narcotics.isSig.dist = nn_function(st_coordinates(st_centroid(final_net)),
                                           st_coordinates(st_centroid(
                                             filter(final_net, narcotics.isSig == 1))), 1 ))

ggplot() + 
  geom_sf(data = final_net, aes(fill = narcotics.isSig.dist)) +
  scale_fill_viridis(direction = -1, option = "inferno") +
  labs(title = "Distance to highly significant local narcotics hotspots") +
  mapTheme()

5. Correlation tests

The set of scatterplots below shows the test result of the correlation between the independent variables (risk factors) and the dependent variable (countNarcotics). It is clear that the count of risk factors in each grid cell and the distance to nearby risk factors tell similar stories with different directions of association. For example, as the count of affordable housing or gambling crimes increases, so does the count of narcotics. As the distance to the nearest three affordable housing units or gambling crimes decrease, the count of narcotics increases.

correlation.long <-
  st_set_geometry(final_net, NULL) %>%
  dplyr::select(-uniqueID, -cvID, -name, -dist_num) %>%
  gather(Variable, Value, -countNarcotics)

correlation.cor <-
  correlation.long %>%
  group_by(Variable) %>%
  summarize(correlation = cor(Value, countNarcotics, use = "complete.obs"))

ggplot(correlation.long, aes(Value, countNarcotics)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "gold") +
  facet_wrap(~Variable, ncol = 4, scales = "free") +
  labs(title = "Narcotics count as a function of risk factors")

6. Poisson Regression

6.1 Distribution of the dependent variable (narcotics)

The histogram below shows the distribution of countNarcotics. Instead of normally distributed, the count of narcotics is highly left-skewed with many grid cells having zero or very few narcotics. In this case, OLS regression is not appropriate since it requires the dependent variable to be more normally distribution. Given the distribution pattern of countNarcotics, Poisson regression is adopted into the predictive model.

ggplot(final_net, aes(countNarcotics)) + 
  geom_histogram(binwidth = 1) +
  labs(title = "Distribution of narcotics by grid cell")

6.2 Cross-validated Poisson Regression

Two sets of variables are developed for the regression. The first set reg.vars contains Just Risk Factors while the second set reg.ss.vars includes both the risk factors and the spatial structure (narcotics.isSig and narcotics.isSig.dist). Both k-fold and LOGO (Leave-one-group-out) cross validations are conducted for the model based on 100 random ids generated for the grid cells and the neighborhood areas, respectively.

reg.vars <- c("Abandoned_Buildings.nn", "Abandoned_Cars.nn", "Graffiti.nn", "Liquor_Retail.nn", "Street_Lights_Out.nn", "Sanitation.nn", "TobaccoRetail.nn", "loopDistance", "affordablehousing.nn","gambling.nn", "policestation.nn1")

reg.ss.vars <- c("Abandoned_Buildings.nn", "Abandoned_Cars.nn", "Graffiti.nn", "Liquor_Retail.nn", "Street_Lights_Out.nn", "Sanitation.nn", "TobaccoRetail.nn", "loopDistance", "affordablehousing.nn", "gambling.nn", "policestation.nn1", "narcotics.isSig", "narcotics.isSig.dist")

crossValidate <- function(dataset, id, dependentVariable, indVariables) {
  
  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])
  
  for (i in cvID_list) {
    thisFold <- i
    cat("This hold out fold is", thisFold, "\n")
    
    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    
    regression <-
      glm(countNarcotics ~ ., family = "poisson", 
          data = fold.train %>% 
            dplyr::select(-geometry, -id))
    
    thisPrediction <- 
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
    allPredictions <-
      rbind(allPredictions, thisPrediction)
  }
  return(st_sf(allPredictions))
}

The predicted and observed narcotics are visualized in maps below.

reg.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countNarcotics",
  indVariables = reg.vars) %>%
  dplyr::select(cvID = cvID, countNarcotics, Prediction, geometry)

reg.ss.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countNarcotics",
  indVariables = reg.ss.vars) %>%
  dplyr::select(cvID = cvID, countNarcotics, Prediction, geometry)

reg.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countNarcotics",
  indVariables = reg.vars) %>%
  dplyr::select(cvID = name, countNarcotics, Prediction, geometry)

reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countNarcotics",
  indVariables = reg.ss.vars) %>%
  dplyr::select(cvID = name, countNarcotics, Prediction, geometry)

reg.summary <- 
  rbind(
    mutate(reg.cv,           Error = Prediction - countNarcotics,
           Regression = "Random k-fold CV: Just Risk Factors"),
    
    mutate(reg.ss.cv,        Error = Prediction - countNarcotics,
           Regression = "Random k-fold CV: Spatial Structure"),
    
    mutate(reg.spatialCV,    Error = Prediction - countNarcotics,
           Regression = "Spatial LOGO-CV: Just Risk Factors"),
    
    mutate(reg.ss.spatialCV, Error = Prediction - countNarcotics,
           Regression = "Spatial LOGO-CV: Spatial Structure")) %>%
  st_sf() 

grid.arrange(
  reg.summary %>%
    ggplot() +
    geom_sf(aes(fill = Prediction)) +
    facet_wrap(~Regression) +
    scale_fill_viridis(option = "inferno") +
    labs(title = "Predicted narcotics by Regression") +
    mapTheme() + theme(legend.position="bottom"),
  
  filter(reg.summary, Regression == "Random k-fold CV: Just Risk Factors") %>%
    ggplot() +
    geom_sf(aes(fill = countNarcotics)) +
    scale_fill_viridis(option = "inferno") +
    labs(title = "Observed narcotics\n") +
    mapTheme() + theme(legend.position="bottom"), ncol = 2)

The four maps below show the errors (Predicted narcotics – observed narcotics) by each of the four regressions. It is clear that regressions with spatial structures have less errors while those without tend to under-predict the hotspots where narcotics clustered.

grid.arrange(
  reg.summary %>%
    ggplot() +
    geom_sf(aes(fill = Error)) +
    facet_wrap(~Regression) +
    scale_fill_viridis(option = "inferno") +
    labs(title = "Narcotic errors by Regression") +
    mapTheme() + theme())

The table below shows the MAE (mean absolute error) and SD_MAE (standard deviation of mean absolute error) by each regression. Similarly, regressions with spatial structural features perform better with lower MAE and SD_MAE and are more accurate and generalizable.

st_set_geometry(reg.summary, NULL) %>%
  group_by(Regression) %>% 
  summarize(MAE = round(mean(abs(Prediction - countNarcotics), na.rm = T),2),
            SD_MAE = round(sd(abs(Prediction - countNarcotics), na.rm = T),2)) %>% 
  kable(caption = "MAE by regression") %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(2, color = "black", background = "#FDE725FF") %>%
  row_spec(4, color = "black", background = "#FDE725FF") 

6.3 Generalizability by race context

To test the extent to which the model is generalizable across different race context, a dummy variable raceContext is added dividing the census tracts into two groups - Majority_White where the percentage of white residents in 2017 is greater and 50% and Majority_Non_White. Mean errors of each of the regressions are summarized by race context. The table below shows the MAE by race context and by each regression. Similarly, regressions with spatial structure features reports less errors and are more generalizable across different race contexts than those without spatial sturcture features. However, all the regression models tend to over predict in non-white dominated areas and under predict in white-dominated areas.

tracts17 <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"), 
          year = 2017, state=17, county=031, geometry=T) %>%
  st_transform(102271)  %>% 
  dplyr::select(variable, estimate, GEOID) %>%
  spread(variable, estimate) %>%
  rename(TotalPop = B01001_001,
         NumberWhites = B01001A_001) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
  .[neighborhoods,]

final_reg <- reg.summary %>%
  mutate(uniqueID = rownames(.))

final_reg.tracts <- 
  st_join(st_centroid(final_reg), tracts17) %>%
  st_set_geometry(NULL) %>%
  left_join(dplyr::select(final_reg, uniqueID)) %>%
  st_sf() %>%
  na.omit()

st_set_geometry(final_reg.tracts, NULL) %>%
  group_by(Regression, raceContext) %>%
  summarize(mean.Error = mean(Error, na.rm = T)) %>%
  spread(raceContext, mean.Error) %>%
  kable(caption = "Mean Error by neighborhood racial context") %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(2, color = "black", background = "#FDE725FF") %>%
  row_spec(4, color = "black", background = "#FDE725FF") 

7. Kernel Density

To compare this model with the traditional crime hotspots map, three kernel density maps are built and shown below with three different searching radii.

library(raster)
# Compute kernel density
narc_ppp <- as.ppp(st_coordinates(narcotics), W = st_bbox(final_net))
narc_KD.1000 <- spatstat::density.ppp(narc_ppp, 1000)
narc_KD.1500 <- spatstat::density.ppp(narc_ppp, 1500)
narc_KD.2000 <- spatstat::density.ppp(narc_ppp, 2000)
narc_KD.df <- rbind(
  mutate(data.frame(rasterToPoints(mask(raster(narc_KD.1000), as(neighborhoods, 'Spatial')))), Legend = "1000 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(narc_KD.1500), as(neighborhoods, 'Spatial')))), Legend = "1500 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(narc_KD.2000), as(neighborhoods, 'Spatial')))), Legend = "2000 Ft.")) 

narc_KD.df$Legend <- factor(narc_KD.df$Legend, levels = c("1000 Ft.", "1500 Ft.", "2000 Ft."))

ggplot(data=narc_KD.df, aes(x=x, y=y)) +
  geom_raster(aes(fill=layer)) + 
  facet_wrap(~Legend) +
  coord_sf(crs=st_crs(final_net)) + 
  scale_fill_viridis(option = "inferno") +
  mapTheme()

Next, a goodness-of-fit indicator is created for both the 1000-ft kernel density model as well as the risk prediction model (LOGO-CV with spatial structures is used here). The indicator is created by scaling both the kernel densities and the predicted counts of narcotics into 100 deciles and reclassifying them into 5 risk categories. The maps below visualize the risk categories of Kernel Density model and the Risk Prediction model with observed narcotics points overlayed on them.

# Convert kernel density to grid cells taking the mean
narc_KDE_sf <- as.data.frame(narc_KD.1000) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
#Mutate the Risk_Category field as defined below.
  mutate(label = "Kernel Density",
         Risk_Category = ntile(value, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
#Bind to a layer where test set crime counts are spatially joined to the fisnnet.
  bind_cols(
    aggregate(
      dplyr::select(narcotics) %>% mutate(narcCount = 1), ., length) %>%
    mutate(narcCount = replace_na(narcCount, 0))) %>%
#Select the fields we need
  dplyr::select(label, Risk_Category, narcCount)

##same for risk prediction
narc_risk_sf <-
  filter(final_reg, Regression == "Spatial LOGO-CV: Spatial Structure") %>%
  mutate(label = "Risk Predictions",
         Risk_Category = ntile(Prediction, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  bind_cols(
    aggregate(
      dplyr::select(narcotics) %>% mutate(narcCount = 1), ., length) %>%
      mutate(narcCount = replace_na(narcCount, 0))) %>%
  dplyr::select(label,Risk_Category, narcCount)

rbind(narc_KDE_sf, narc_risk_sf) %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
  geom_sf(aes(fill = Risk_Category), colour = NA) +
  geom_sf(data = sample_n(narcotics, 1500), size = .1, colour = "turquoise") +
  facet_wrap(~label, ) +
  scale_fill_viridis(discrete = TRUE, option = "inferno") +
  labs(title="Comparison of Kernel Density and Risk Predictions",
       subtitle="Relative to test set points (in black)") +
  mapTheme()

To make the comparison, the rate (percentage) of total observed counts of narcotics falling in each risk categories is calculated and plotted below. It is observed that the Risk Prediction model captures a greater share of observed narcotics in the highest risk category (90% to 100%).

rbind(narc_KDE_sf, narc_risk_sf) %>%
  st_set_geometry(NULL) %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countNarcotics = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Rate_of_test_set_crimes = countNarcotics / sum(countNarcotics)) %>%
  ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
  geom_bar(aes(fill=label), position="dodge", stat="identity") +
  scale_fill_viridis(discrete = TRUE)

8. Conclusion

A geospatial risk prediction model is useful when it is more accurate than the current decision-making process and it is generalizable across the city. Moreover, it should be able to predict explicit crime risk without being biased by the implicit risk resulted from selection bias or reporting bias or over-policing. This project focuses on narcotics since it is a type of crime with high level of selection bias. The objective of this project is to test if the model can identify areas at great risk of narcotics crime despite a lack of reporting.

To sum up, the algorithm developed in this project is not recommended to be put into production. First of all, the algorithm is highly relied upon the existing hotspots identified through Local Moran’s I, rendering the prediction result similar to the observation without other high-risk areas identified. Secondly, the model is not very generalizable across different race contexts since it tends to over predict in non-white dominated areas and under predict in white-dominated areas. Thirdly, although the Risk Prediction model captures a greater share of observed narcotics in the 90%-100% risk category, it captures less share in the 70% to 89% category, when comparing to the Kernel Density model. More spatial risk factors should be explored to further improve this model.