In this page, we show how we accessed the data and the kinds of data cleaning and manipulation done to make the data ready for analysis. The data is compiled by New York City Police Department and owned by NYC OpenData.
The data was accessed through the SoDA (Socrata Open Data Application Program Interface) web API client on 17th November, 2018. Click here to read more about the API.
Click here to view the full R code used to access, clean and manipulate the data.
RSocrata
package is needed to to use SoDA web API query. ggmap
is also needed to geocode coordinate information. Standard packages like tidyverse
and lubridate
are also necessary for manipulation.
The raw, acquired dataset has 6,036,805 observations and 35 variables. Broadly speaking, the variables contain information on the exact date, time and location of crime, description of crime, demographic information of the victim and suspect, and police department infromation. For more information on the variables, click here.
We were interested in 2017, 2016, 2015 and 2014 and felony sex-related, weapon-related and drug-related crimes. Therefore, we extracted those crime information for those years.
Below is a list of the penal codes in the dataset that we used to selected the felony crimes of interest.
The dataset comes without zipcode or neighborhood information; there are only coordinates of the exact location where the crime happened. This makes aggregation beyond clustering of the geographic points impossible. Therefore we used google maps API to reverse geocode the get the exact address of the crime occurence. We then subsequently took out the zipcodes for mapping. The full code on how we reverse geocoded the longitude and latitude information can be found in our repository on github.
After cleaning, subsetting and reverse geocoding, we saved the resulting dataset in rds
format. You can download the resulting datasets from this google drive. The code on how to read the datasets can also be found in our repository file called acquiringdata.Rmd
The list below provides all 17 variables in the datasets with their brief descriptions:
cmplnt_num
: randomly generated persistent ID for each complaint
boro_nm
: the name of the borough in which the incident occurred
cmplnt_fr_dt
: exact start date of occurrence for the reported incident
cmplnt_to_dt
: exact end date of occurrent for the reported incident
cmplnt_fr_tm
: exact time of occurrence for the reported incident
latitude
: midblock latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)
longitude
: midblock longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)
ky_cd
: three-digit offense classification code
ofns_desc
: description of offense corresponding with key code
pd_cd
: three-digit internal classification code (more granular than Key Code)
pd_desc
: description of internal classification corresponding with PD code
vic_race
: victim’s race description
vic_sex
: victim’s sex description (D=Business/Organization, E=PSNY/People of the State of New York, F=Female, M=Male)
year
: year the incident occurred
prem_typ_desc
: specific description of premises; grocery store, residence, street, etc.
crime_group
: sex-related felony offenses, drug-related felony offenses, weapon-related felony offenses
zip
: reverse geocoded address of the incident