final_html.rmd

---
title:  Man**rat**tan - A Look into NYC's Rats
author:  Alex Cheng, Liz Chu, Arthur Jakobsson, Kevin Ren 
date:  December 12, 2022
output:
  pdf_document:
    toc: yes
  html_document:
    toc: yes
    toc_float: yes
urlcolor: blue
---

# Rat Population Data Analysis - New York City

## Descriptions of Datasets

### Rat Sightings Dataset
*https://data.cityofnewyork.us/Social-Services/Rat-Sightings/3q43-55fe*

Our analysis primarily revolves around this dataset, with several supplementary datasets appended to this one for further in-depth analysis. This dataset contains 208,000 different rat sightings in the City of New York between 2010 to the present day, reported by citizens to the City of New York and accessed from NYC Open Data. 38 different variables are recorded for each sighting; notably, geographic data such as latitude, longitude, and borough data, and the date of opening and closing of the complaint.

### Supplementary Datasets

We join various auxiliary datasets (described below) to our rat sightings dataset in order to better examine how rat sightings correlate to other demographic and geographic factors.

#### Subway Dataset
*https://data.cityofnewyork.us/Transportation/Subway-Entrances/drex-xx56*

This dataset, also sourced from NYC Open Data, contains the names, line numbers, and geographic coordinates of 1928 subways in New York City to date.

#### Tax Return Dataset
*https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-zip-code-data-soi*

This is an 2019 IRS-sourced dataset which contains tax return information for each of the 178 zip codes in NYC; namely, the number of returns and total amounts requested by eligible citizens of each of the zip codes for their individual tax returns.

#### Restaurant Inspection Dataset
*https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j*

This is an NYC Open Data dataset, most recently updated on December 10, 2022, containing 231,000 data, each corresponding to a health violation citation given to a restaurant in NYC by the City of New York's Health Department. We are given 27 different variables that most importantly provide the location and zip code of each restaurant which was issued a citation.

## Research Questions
Going into this project, our group had several questions we wanted to answer regarding the distribution of rats in the city. Namely:

1. How do rat sightings differ geographically and by borough?
2. How has the number of rats reported changed over time?
3. How does well do wealth and geographic data combined correlate with rat sightings?
4. How do rat sightings correlate with candidate features such as subways and restaurants?

In all, we hope to make underlying observations that extend beyond the mere topic of rats, using rat sightings as a proxy for deeper conclusions about socioeconomic and geographic patterns in the City of New York.

## Graphs Made

``` {r, echo = FALSE}
knitr::opts_chunk$set(echo=FALSE)
knitr::opts_chunk$set(warning=FALSE)
knitr::opts_chunk$set(message=FALSE)
knitr::opts_chunk$set(results='hide')
```

```{r} 
# library imports
library(tigris)
library(dplyr)
library(leaflet)
library(tidyverse)
library(sp)
library(ggmap)
library(maptools)
library(broom)
library(httr)
library(rgdal)
library(gridExtra)
library(stringr)
library(ggseas)
library(geosphere)
library(stringr)
library(hydroTSM)
library(vcd)
library(magrittr)
library(knitr)
library(gpclib)
library(geojsonio)
library(plotly)
library(maps)
library(reshape2)
library(shiny)
if (!require(gpclib)) install.packages("gpclib", type="source")
gpclibPermit()
```

```{r, include=FALSE}
# Loading in Data
rats <- read.csv(file = 'Raw/Rat_Sightings.csv')
# # https://data.cityofnewyork.us/widgets/g642-4e55?mobile_redirect=true
health_inspection <- read.csv(file = 'Raw/DOHMH_New_York_City_Restaurant_Inspection_Results.csv')
# # https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/rs6k-p7g6
drinking_water_inspection <- read.csv(file = 'Raw/Self-Reported_Drinking_Water_Tank_Inspection_Results.csv')
# # https://data.cityofnewyork.us/Health/Self-Reported-Drinking-Water-Tank-Inspection-Resul/gjm4-k24g/data 

mayors <- read.csv(file = 'Raw/mayors.csv')

events <- read.csv(file = 'Raw/nycevents.csv')

pop2022 <- read.csv(file = 'Raw/nyc2022population.csv')

dailyCovid <- read.csv(file = 'Raw/nycdailycovid.csv')

subway_entrances <- read.csv(file = 'Raw/SubwayEntrances.csv')
subway_entrances <- subway_entrances %>% add_column(longitude = NA)
subway_entrances <- subway_entrances %>% add_column(latitude = NA)
lats = 1:nrow(subway_entrances)
longs = 1:nrow(subway_entrances)
for (i in (1:nrow(subway_entrances))) {
  longs[i] = as.numeric(sub(".", "", scan(text = subway_entrances$the_geom[i], what = "")[2]))
  lats [i] = as.numeric(str_sub(scan(text = subway_entrances$the_geom[i], what = "")[3],1,-2))
}
subway_entrances$longitude = longs
subway_entrances$latitude = lats


# #NYC map lines
r <- GET('http://data.beta.nyc//dataset/0ff93d2d-90ba-457c-9f7e-39e47bf2ac5f/resource/35dd04fb-81b3-479b-a074-a27a37888ce7/download/d085e2f8d0b54d4590b1e7d1f35594c1pediacitiesnycneighborhoods.geojson')
nyc_neighborhoods <- readOGR(content(r,'text'), 'OGRGeoJSON', verbose = F)

health_inspection = subset(health_inspection, select = -c(Location.Point, Zip.Codes, Community.Districts, Borough.Boundaries, City.Council.Districts, Police.Precincts))
```

```{r, include=FALSE}
zipnames <- read.csv(file = 'Raw/zipcodenames.csv')
tax <- read.csv(file = "Raw/nyctax-cleaned.csv")
pop <- read.csv(file = "Raw/nyc2022population.csv")
rest <- read.csv(file = "Raw/DOHMH_New_York_City_Restaurant_Inspection_Results.csv")
bar <- read.csv(file = "Raw/nycbarlocations.csv")
michelin <- read.csv(file = "Raw/nycmichelin.csv")


# NEW FOR THIS FILE: removing NA's for zipcodes
rats = subset(rats, !is.na(Incident.Zip), Incident.Zip != "")

nyczips <- geojson_read('Raw/NYC_ZIPS.geojson', what = 'sp')
nyczips <- tidy(nyczips, region = "postalCode")

# NUMBER OF RATS: n_rats
ratsumzip = rats %>% group_by(Incident.Zip) %>% tally()
nyczips = nyczips %>% left_join(., ratsumzip, by = c("id" = "Incident.Zip"))
nyczips = subset(nyczips, !is.na(n))
colnames(nyczips)[8] = "n_rats"
```

```{r, include=FALSE}
zipnames$ZipCode <- as.character(zipnames$ZipCode)
zipnames = subset(zipnames, select = -c(X, X.1, X.2, X.3))
nyczips = nyczips %>% left_join(., zipnames, by = c("id" = "ZipCode"))
```

```{r, include = FALSE}

rest$SCORE = replace_na(rest$SCORE, 0)
# rest = subset(rest, !is.na(SCORE))
restsumzip = rest %>% group_by(ZIPCODE) %>% tally()
colnames(restsumzip) = c("ZIPCODE", "n_rest")

# NUMBER OF RESTAURANTS: n_rest
nyczips = nyczips %>% left_join(., restsumzip, by = c("id" = "ZIPCODE"))

nyczips = subset(nyczips, !is.na(n_rest))

# RATIOS: rest_to_rat rat_to_rest
# note: we decided to only use rat_to_rest as the other one has too small vals
nyczips$rest_to_rat = nyczips$n_rest / nyczips$n_rats
nyczips$rat_to_rest = nyczips$n_rats / nyczips$n_rest

# AVERAGE RESTAURANT SCORE: score_avg
restscorezip = rest %>% group_by(ZIPCODE) %>% summarize(score_avg = mean(SCORE))
nyczips = nyczips %>% left_join(., restscorezip, by = c("id" = "ZIPCODE"))

# RATIOS: score_to_rat rat_to_score
# note: we decided to only use rat_to_score as the other one has too small vals
nyczips$score_to_rat = nyczips$score_avg / nyczips$n_rats
nyczips$rat_to_score = nyczips$n_rats / nyczips$score_avg

# POPULATION: population
# RATS PER CAPITA: rat_per_cap
pop = subset(pop, population != 0)
pop = subset(pop, select = c("zip", "population"))
pop$zip = as.character(pop$zip)
nyczips = nyczips %>% left_join(., pop, by = c("id" = "zip"))
nyczips$rat_per_cap = nyczips$n_rats / nyczips$population

tax = subset(tax, Zip.Code != 0 & Zip.Code != 99999)
tax$Zip.Code = as.character(tax$Zip.Code)

# this makes a tax rating (weighted average) score from 1 to 6, representing
# the dist. of wealth among the brackets
# NOTE FROM LIZ: this may be a really rarded way to get this info, open to other
# smarter ideas for representing tax data with one number 
tax$tax_rating = (tax$X.1.under..25.000 + 
                    2 * tax$X.25.000.under..50.000 + 
                    3 * tax$X.50.000.under..75.000 + 
                    4 * tax$X.75.000.under..100.000 + 
                    5 * tax$X.100.000.under..200.000 + 
                    6 * tax$X.200.000.or.more) / tax$Total

tax = subset(tax, select = c("Zip.Code", "tax_rating"))
nyczips = nyczips %>% left_join(., tax, by = c("id" = "Zip.Code"))

nyczips$rat_to_tax = nyczips$n_rats / nyczips$tax_rating

```

```{r, include=FALSE}
#this is to graph population by zip code
popzip <- geojson_read('Raw/NYC_ZIPS.geojson', what = 'sp')
popzip <- tidy(popzip, region = "postalCode")
popzip$id = strtoi(popzip$id)
population = read.csv("Raw/nyc2022population.csv")
colnames(population)[1] = "id"
popzip = popzip %>% left_join(., population, by = "id")
```


### Borough Bar Chart

Our first visualization performs some elementary EDA on the distribution of rat sighting counts given the borough of their reporting. We created this graph in order to very directly address our research question of how rat sightings differ by borough.

```{r}
borough.counts <- as.data.frame(table(subset(rats, Borough != "Unspecified")$Borough))
names(borough.counts) = c("Borough","Count")
borough.counts <- rownames_to_column(borough.counts)
borough.counts <- borough.counts %>%  filter(!row_number() %in% c(1))
ggplot(data = borough.counts, aes(x=Borough, y=Count)) +
  geom_col(aes(fill=Borough)) +
  labs(title="Number of Rat Sightings by Borough")
```

This bar chart displays the number of rats seen within each borough in New York City. Brooklyn had by far the most rat sightings at 74,302, followed by Manhattan, the Bronx, Queens, and Staten Island, in that order. The low number of rat sightings in Staten Island might reflect its more cut-off nature from the rest of the city, as well as its more suburban feel, which could plausibly explain why Staten Island suffers less from the very urban problem of rats compared to the other boroughs in the city. Similarly, Brooklyn's position in the dead center of the city may explain why it had so many rats. Despite being the smallest borough by land size, Manhattan had the second-most rats, which may reflect the fact that it is one of the main business centers in the city (and the world) which would obviously attract a large number of rats with high concentrations of people and food. Thus, this simple visualization of rat sighting counts allows for greater generalizations about the boroughs in the city.

### Rat Population Choropleth Map

To get a better sense of how the rats in New York City are distributed geographically, we decided to make a choropleth map that showcases the densities of rats by more specific subsections of the city. Upon looking at our data, we realized that each rat sighting was tagged with a zip code, so we created the following choropleth map that shows how many rats in our rat dataset were spotted in a given zip code region. 

```{r}
ggplot() +
  geom_polygon(data = nyczips, aes(x = long, y = lat, group = group, fill = n_rats)) +
  theme_void() +
  # scale_fill_gradient2(low = "darkblue", mid = "purple", high = "pink", midpoint=3000) + 
  coord_map() + labs(
    title = "Rat Sightings by Zip Code", 
    fill = "# Rats"
  )
```

This map tells us some key information about New York City's rats —  we see that a lot of rats are found in the center of the city (specifically, in the middle of Brooklyn) as well as in the northern part of Manhattan. Again, we notice that Staten Island is relatively dark. An interesting section of the map is upper Manhattan, as we notice that a few zip code regions to on the upper west side of Central Park (which is the rectangular cutout in Manhattan) have significantly higher numbers of rat sightings compared to lower Manhattan. The Upper West Side is known for its affluence, which may seem surprising considering the number of rat sightings, but a possible explanation is that official measures to mitigate rat problems in Manhattan have been more focused on the lower side of the borough, where most of the activity is. 


### Big Events Time Series

We hoped to garner a further understanding of what causes upticks in rat population, or at least rat spotting. We plotted a time series map of the progression of total rat population in our dataset from 2010, when the first rat was recorded, to present. We also added information which could be correlated potentially with changes in trend of the rat population or a sudden uptick in rats. This includes major events in New York that could be connected to sanitary conditions, number of people outside, or policy changes such as storms, protests, and changes in mayor. We also thought to explore the relative locations for which the rats were spotted to understand the locations in which the rats lived as well as to know if some events caused increases in some domains of the rats.

```{r}
private = c(
  "1-2 Family Dwelling",
  "1-2 Family Mixed Use Building", 
  "1-2 FamilyDwelling", 
  "1-3 Family Dwelling", 
  "1-3 Family Mixed Use Building", 
  "3+ Family Apartment Building", 
  "3+ Family Apt", 
  "3+ Family Apt.", 
  "3+ Family Apt. Building", 
  "3+ Family Mixed Use Building",
  "3+Family Apt.", 
  "Apartment",
  "Private House",
  "Residence",
  "Residential Building",
  "Residential Property",
  "Single Room Occupancy (SRO)"
)

commercial = c(
  "Cafeteria - Public School",
  "Catering Service", 
  "Commercial Building", 
  "Commercial Property", 
  "Construction Site", 
  "Day Care/Nursery", 
  "Government Building", 
  "Grocery Store",
  "Hospital",
  "Office Building", 
  "Restaurant", 
  "Restaurant/Bar/Deli/Bakery", 
  "Retail Store", 
  "School", 
  "School/Pre-School",
  "Store",
  "Street Fair Vendor", 
  "Summer Camp"
)

public = c(
 "Abandoned Building", 
 "Beach", 
 "Building (Non-Residential)", 
 "Catch Basin/Sewer",
 "Ground",
 "Parking Lot/Garage",
 "Public Garden", 
 "Public Stairs",
 "Street Area", 
 "Vacant Building", 
 "Vacant Lot", 
 "Vacant Lot/Property"
)

other = c(
  "",
  "N/A",
  "Other",
  "Other (Explain Below)"
)

ratsloc = rats
ratsloc$Location.Type[ratsloc$Location.Type %in% private] <- "Private"
ratsloc$Location.Type[ratsloc$Location.Type %in% public] <- "Public"
ratsloc$Location.Type[ratsloc$Location.Type %in% commercial] <- "Commercial"
ratsloc$Location.Type[ratsloc$Location.Type %in% other] <- "Other"
ratsloc$Location.Type <- factor(ratsloc$Location.Type)
ratsloc$Date = as.Date(ratsloc$Created.Date, "%m/%d/%Y")

rats_per_day = ratsloc %>% 
  group_by(Date, Location.Type) %>%
  tally() 

names(rats_per_day) = c("date", "locationtype", "n_rats")

# ggplot(data=rats_per_day, aes(x=date, y=n_rats, color=locationtype)) + 
#   geom_line(alpha=0.3) + labs(
#     title="Number of Rats Recorded on Each Day", 
#     subtitle="colored by the type of location",
#     x="Date", 
#     y="Number of Rats Recorded"
#   ) + 
#   scale_color_manual("Location Type", 
#                      values = c("Other" = "yellow", 
#                                 "Commercial" = "blue", 
#                                 "Private" = "red", 
#                                 "Public" = "green")) 

mayors$Date.Start <- as.Date(mayors$Date.Start)
mayors$Date.End <- as.Date(mayors$Date.End)
events$Date <- as.Date(events$Date)

ggplot(rats_per_day) +
  geom_line(aes(date, n_rats, color=locationtype), alpha=0.3) + labs(
    title="Number of Rats Recorded on Each Day (with Mayors)", 
    subtitle="colored by the type of location",
    x="Date", 
    y="Number of Rats Recorded"
  ) + 
  scale_color_manual("Location Type", 
                     values = c("Other" = "yellow", 
                                "Commercial" = "blue", 
                                "Private" = "orange", 
                                "Public" = "green")) +
  geom_rect( 
    data = mayors,
    aes(xmin = Date.Start, xmax = Date.End, fill = Party),
    ymin = -Inf, ymax = Inf, alpha = 0.1
  ) +
  geom_vline(
    aes(xintercept = as.numeric(Date.Start)),
    data = mayors,
    colour = "grey50", alpha = 0.5
  ) +
  geom_text(
    aes(x = Date.Start+60, y = 110, label = Name),
    data = mayors,
    size = 3, vjust = 0, hjust = 0, nudge_x = 50, angle = 90) +
  geom_segment(data = events, aes(x = Date, y = 40, xend = Date, yend = 55), color = "red") +
  geom_text(data = events, aes(x = Date-50, y = 95, label = Event.Name), angle=90, color = "red")+
  scale_fill_manual(values = c("blue", "red"))
```

We notice a clear increase over time of the rat population. The most evident trend can be correlated with the strong seasonal changes of the New York area which we dissect further in the next section. With regards to the relative location, we observe large number of rats spotted in private areas followed by public and then other, with the fewest rats being spotted in commercial areas. We infer that this is likely due to the fact that many rat spotting is likely connected to building complaints as well as calling of exterminators which is probably most common for private residences to do. Another large reason that this trend may be observed is that a large part of New York is residential, considerably more than any of the other sectors. Among significant events we observed very few major deviations in rat population after events. Several of the protests (Occupy Wall Street, People's Climate March and the Global Climate Strike) as well as the Brooklyn/Queens Tornado seem to be related to slight upticks in rat population, but without further analysis we cannot state whether this is statistically significant. The most noticeable, sudden increase in rat population follows the Women's March protest in 2017, but we again cannot draw any conclusions and must follow this up with statistical tests to draw any conclusions.

### Seasons Time Series

Building on this time series analysis, we now turn to a seasonal approach to modeling rat sightings over time, hoping to further address our research question of how temporal factors impact rat sighting counts.

```{r, warning=FALSE}
library(ggplot2)
rats$Date = as.Date(rats$Created.Date, "%m/%d/%Y")
rats_per_day = rats %>%
  group_by(Date) %>%
  tally()

rats_per_day$Season = time2season(rats_per_day$Date, out.fmt = "seasons")
rats_per_day$Season = ifelse(rats_per_day$Season == "autumm", "autumn", rats_per_day$Season)

ggplot(data=rats_per_day, aes(x=Date, y=n, color = Season)) +
  geom_line(alpha=0.3) + labs(
    title="Number of Rats Recorded on Each Day",
    subtitle="colored by the season",
    x="Date",
    y="Number of Rats Recorded"
  ) +
stat_rollapplyr(color = "red", width = 30, align = "left", alpha = 0.5) 
```

The above graph plots the moving average for the number of rats seen each month in the red line in order to track the trends, as well as the actual observed number of rats per day. Furthermore, we colored the observed rats by the season in which it was observed, and found a harmonic pattern - there would always be a lot of rats observed in the summer, and not many rats observed in the winter (except for one fateful day in 2017!).  This could reflect a few things - rats don't like the cold and tend to stay inside, so they are less likely to be seen. However, humans don't like cold either, so they are less likely to go outside and observe rats in New York. Overall, it is interesting to note the changes in rat observations each season.

### Tax Rating Choropleth Map 

To answer our question about wealth and rat sightings, we turn to the tax return dataset. In order to see the relationship between the affluence of residents and number of rat sightings in each zip code area, we turn to choropleth maps again. This time, we first establish a tax rating system, which is a weighted average of all the tax returns in a given zip code area. The weighted average is calculated by multiplying each individual by their tax "rating" (1 meaning they belong in the lowest bracket and 6 meaning the highest bracket), and then taking the average of all the tax ratings in each zip code area. Thus, this value ranges from 1 to 6, with 1 meaning that everyone from that area is in the lowest tax bracket and 6 meaning the same for the highest tax bracket. To relate this value to rats, we will display the ratio between tax rating and number of rat sightings for each region. 

```{r}
ggplot() +
  geom_polygon(data = nyczips, aes(x = long, y = lat, group = group, fill = rat_to_tax)) +
  theme_void() +
  scale_fill_gradient2(low = "#395184",
                       mid = "#A964B8", 
                       high = "#FFA9A9", midpoint = 1500) + 
  coord_map() + labs(
    title = "Rat to Tax Rating Ratio by Zip Code", 
    fill = "Number of Rats / Tax Rating (from 1 to 6)"
  )
```

This map has some clear differences when we compare it to the rat sightings choropleth map. First, the Upper West Side is no longer as extreme of a value, as we know that that area is very affluent (which means a smaller rat-to-tax rating ratio). However, we notice that Brooklyn still looks very similar to the original choropleth map, which may suggest a high amount of rats and a lower tax rating in these areas. As we know that Brooklyn is less affluent than Manhattan in general, a plausible explanation for the cluster of higher rat-to-tax rating ratios in the center of the city may be that there are more rats in this area compared to other areas with higher tax ratings (or more affluence). 

### Location Type Mosaic Plot

With some temporal and time series analysis of rat sightings done, we now turn to analyzing the conditional distribution of rat sighting locations given borough in an effort to address the degree to which this property of a given rat sighting differs between boroughs.

```{r}
rats_other = subset(rats, !grepl("Family", rats$Location.Type, fixed = TRUE))
rats_other = subset(rats_other, Location.Type != "Other (Explain Below)")
rats_other = subset(rats_other, Location.Type != "Street Area")
rats_other = subset(rats_other, (Borough == "BRONX" | Borough == "BROOKLYN" | Borough == "MANHATTAN" | Borough == "QUEENS" | Borough == "STATEN ISLAND"))
rats_other = rats_other %>% group_by(Location.Type) %>% filter(n() > 300 )
rats_other["Location.Type"][rats_other["Location.Type"] == "Vacant Building" | rats_other["Location.Type"] == "Vacant Lot"] <- "Unoccupied"
rats_other["Location.Type"][rats_other["Location.Type"] == "Government Building" | rats_other["Location.Type"] == "Commercial Building"] <- "Office Building"

par(mar = c(5,4,1,10))

mosaicplot(table(rats_other$Borough, rats_other$Location.Type), main = "Mosaic Plot of Non-Family Location Types by Borough", shade=TRUE, las=2)
```

This mosaic plot visualizes the conditional distribution of reporting sites of rats given borough. Based on this visualization, we see many statistically significantly high and low combinations of borough and reporting site under the assumption of independence between the two variables plotted. It is interesting to note the way rat reporting sites reflect the distinctive landscapes of each borough. For instance, we have significant evidence that Manhattan has higher proportions of rat sightings made at office buildings and construction sites than would be expected under independence, which reflects Manhattan's reputation as a bustling metropolis with many developed and developing commercial construction projects. It is also interesting to note that the statistical significance of the proportion of reports made in Unoccupied sites (which we categorized as reports made in either Vacant Buildings or Vacant Lots) for every single borough; the high proportion of such sightings in Staten Island, Brooklyn, and the Bronx may suggest the presence of pockets of high poverty or low economic development in these boroughs, and the significantly low proportion of sightings in Unoccupied regions may suggest a relatively high degree of property and economic development in these boroughs, where fewer spaces are left unused by homeowners or businesses. In all, reveals that rats are generally found in very different sets of locations in different boroughs.

### Restaurant Heat Map

```{r, message=FALSE}
# importing rats
#nice map (manhattan, brooklyn, queens, a bit of bronx)
left =-74.03 
bottom = 40.64
right = -73.87
top = 40.85
nyc_coords <- c(left, bottom, right, top)

#full map (all boroughs)
leftF = -74.2
bottomF = 40.55
rightF = -73.87
topF = 40.85
nyc_coordsF <- c(leftF, bottomF, rightF, topF)

#just dowtown manhattan
leftM = -74.03
bottomM = 40.69
rightM = -73.94
topM = 40.81
nyc_coordsM <- c(leftM, bottomM, rightM, topM)

nyc_map <- get_stamenmap(nyc_coords, maptype = "terrain", zoom = 11)
nyc_mapF <- get_stamenmap(nyc_coordsF, maptype = "terrain", zoom = 11)
nyc_mapM <- get_stamenmap(nyc_coordsM, maptype = "terrain", zoom = 11)

ratSubset <- subset(rats, Longitude<right & Latitude<top & Longitude > left & Latitude >bottom)
ratSubsetF <- subset(rats, Longitude<rightF & Latitude<topF & Longitude > leftF & Latitude >bottomF)
ratSubsetM <- subset(rats, Longitude<rightM & Latitude<topM & Longitude > leftM & Latitude >bottomM)

health_inspection[, c("SCORE")] <- sapply(health_inspection[, c("SCORE")], as.integer)
health_inspection["SCORE"][is.na(health_inspection["SCORE"])] <- 0

healthSubset <- subset(health_inspection, Longitude<right & Latitude<top & Longitude > left & Latitude >bottom & SCORE>50)
healthSubsetF <- subset(health_inspection, Longitude<rightF & Latitude<topF & Longitude > leftF & Latitude >bottomF & SCORE>50)
healthSubsetM <- subset(health_inspection, Longitude<rightM & Latitude<topM & Longitude > leftM & Latitude >bottomM & SCORE>50)

rat_map <- ggmap(nyc_map) +
  geom_point(data=ratSubset, aes(x=Longitude, y = Latitude), alpha=0.2, size =0.01, color = "coral3")

rat_mapF <- ggmap(nyc_mapF) +
  geom_point(data=ratSubsetF, aes(x=Longitude, y = Latitude), alpha=0.2, size =0.01, color = "coral3")

rat_mapM <- ggmap(nyc_mapM) +
  geom_point(data=ratSubsetM, aes(x=Longitude, y = Latitude), alpha=0.2, size =0.01, color = "coral3")
```

We decided to see if there were any evident relationships between geographic artifacts such as restaurants in New York with the density of the rat population. Our intuition was that an increases in restaurants would increase the number of food sources which would have an immediate impact on rat population. After initial results showing inconsequential results we decided to test whether restaurants that received a poor score on their health examination (arbitrarily 50+ scores, with 0 being a perfect score) had a relationship with rats. We speculated that that worse sanitary conditions of the restaurants may also share some relationship with rat population.

```{r}
#Restaurants (subset, score worse than 50)
ggmap(nyc_map) +
  geom_density_2d_filled(data=ratSubset, aes(x = Longitude, y = Latitude, fill = after_stat(level)), alpha = 0.4) + 
  geom_point(data=healthSubset, aes(x=Longitude, y = Latitude, color = SCORE), size = 0.3, alpha=0.1) + 
  scale_color_distiller(palette = "YlOrRd", name = "Restaurant Score") +
  ggtitle("Rat Density across Manhattan and Brooklyn", subtitle = "with restaurant health scores") +
  guides(fill = guide_legend(title = "Rat Density"))
```

We observe several intersting trends in this graph. We notice that the very dense area of Midtown which hosts most of the restaurants and most of the restaurants with bad health scores also contains very few rats. This in and of itself is worth further investigation. It is worth noting that Combs et. al. *https://www.theatlantic.com/science/archive/2017/11/rats-of-new-york/546959/* observed similar results in their genetic study of rats in New York, with Midtown serving as a natural genetic and physical boundary for the rat populations. Otherwise, the density of the rat populations as well as the restaurants seem to have little in common aside from some small areas. This is counter-intuitive, but supports the idea that rat population may be driven more greatly by other factors such as potentially human movement or human density. 

### Subway Heat Map

We thought to observe the relationship between human movement and rat population density. We decided to see if we could notice a relationship between subway entrances, a location comically famously associated with rat spotting, and rat density. While subway entrances provide a loose map of where people move and where people live it is important to note that because the New York subway map is designed similarly to most United States rail and public transport systems.The lines take people from the suburbs and the other boroughs into Manhattan to support the idea of a work day commute with people living in the outskirts of the city and traveling into the city for work.

```{r, warning=FALSE}
#SUBWAY STATIONS
ggmap(nyc_map) + 
  geom_density_2d_filled(data=ratSubset, aes(x = Longitude, y = Latitude, fill = after_stat(level)), alpha = 0.4) + 
  geom_point(data=subway_entrances, aes(x=longitude, y = latitude), color="red",  size = 0.5, alpha=0.2) + 
  scale_color_distiller(palette = "PiYG") +
  ggtitle("Rat Density across Manhattan and Brooklyn", subtitle = "with Subway Entrances") +
  guides(fill = guide_legend(title = "Rat Density"))
```

We observe again that the subways are densely populated in Midtown where there is a noticeable absence of rats. However, we do notice that there are some clearer trends in rats populations following approximate lines of the subway. We surmise that this is because living near subway lines is considered desirable and therefore these areas are likely to be more densely populated and also have much heavier foot traffic and daily activity. The areas in which these trends are most noticeable are in the Upper West Side, Harlem, Bedford-Stuyvesant, Williamsburg and even in Queens. It is also interesting to note that, similarly to Midtown, the Financial District in the South of Manhattan seems to host few rat populations. Therefore, we could potentially suggest that the similar industrial designs and lifestyles of these two neighborhoods may have some impact on the rat populations and density in these areas.


## Conclusions and Future Work

Through this analysis, we have learned a multitude of interesting things about the conditional distribution of rats in New York City given such variables as geography, temporal events, and physical landmarks. Clearly the distribution of rats in the city correlates highly with many of our tested variables, and displays significant geographic and temporal activity. It seems that the quantity of rat sightings differ greatly between boroughs, zip codes within boroughs, and even specific types of locations within different boroughs. Furthermore, rat sightings display a significant trend and seasonal over time, all the while responding to major events that occur in the city. Future analysis of this topic would do well to analyze a) different datasets that could potentially be compared to rat sighting distributions such as racial or age-related data in order to assess how people of different social groups experience varying levels of rats in their homes, and/or b) dive deeper into the auxiliary variables which we had already selected; for example, correcting for geographic area in our borough and zip code data in order to calculate and visualize how the rats per square mile (and by extension, variables involving rat sighing counts such as rat sighting density to tax rating ratio) changes between geographic regions. In all, this project provided a thoughtful insight into life in New York City from the perspective of its most mainstay citizens - the rats.