Accessing the Data

In this page, we show how we accessed the data and the kinds of data cleaning and manipulation done to make the data ready for analysis. The data is compiled by New York City Police Department and owned by NYC OpenData.

The data was accessed through the SoDA (Socrata Open Data Application Program Interface) web API client on 17th November, 2018. Click here to read more about the API.

Click here to view the full R code used to access, clean and manipulate the data.

RSocrata package is needed to to use SoDA web API query. ggmap is also needed to geocode coordinate information. Standard packages like tidyverse and lubridate are also necessary for manipulation.

The raw, acquired dataset has 6,036,805 observations and 35 variables. Broadly speaking, the variables contain information on the exact date, time and location of crime, description of crime, demographic information of the victim and suspect, and police department infromation. For more information on the variables, click here.


Subsetting Data

We were interested in 2017, 2016, 2015 and 2014 and felony sex-related, weapon-related and drug-related crimes. Therefore, we extracted those crime information for those years.

Below is a list of the penal codes in the dataset that we used to selected the felony crimes of interest.

Reverse Geocoding

The dataset comes without zipcode or neighborhood information; there are only coordinates of the exact location where the crime happened. This makes aggregation beyond clustering of the geographic points impossible. Therefore we used google maps API to reverse geocode the get the exact address of the crime occurence. We then subsequently took out the zipcodes for mapping. The full code on how we reverse geocoded the longitude and latitude information can be found in our repository on github.

Usage

After cleaning, subsetting and reverse geocoding, we saved the resulting dataset in rds format. You can download the resulting datasets from this google drive. The code on how to read the datasets can also be found in our repository file called acquiringdata.Rmd

Data Dictionary

The list below provides all 17 variables in the datasets with their brief descriptions:

cmplnt_num: randomly generated persistent ID for each complaint

boro_nm: the name of the borough in which the incident occurred

cmplnt_fr_dt: exact start date of occurrence for the reported incident

cmplnt_to_dt: exact end date of occurrent for the reported incident

cmplnt_fr_tm: exact time of occurrence for the reported incident

latitude: midblock latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)

longitude: midblock longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)

ky_cd: three-digit offense classification code

ofns_desc: description of offense corresponding with key code

pd_cd: three-digit internal classification code (more granular than Key Code)

pd_desc: description of internal classification corresponding with PD code
vic_race: victim’s race description

vic_sex: victim’s sex description (D=Business/Organization, E=PSNY/People of the State of New York, F=Female, M=Male)

year: year the incident occurred

prem_typ_desc: specific description of premises; grocery store, residence, street, etc.

crime_group: sex-related felony offenses, drug-related felony offenses, weapon-related felony offenses

zip: reverse geocoded address of the incident