Program 13: EMS Queries. Due noon, Thursday, 12 March.
This program continues the analysis of the emergency services calls that was introduced in Program 12.
We will be using Structured Query Language (SQL) to subset and aggregate data (see Lectures 25 & 26 and DS 100: Chapter 7). For this program, we will use the pandasql that provides an easy way to query pandas DataFrames using SQL syntax. To use it, you may need to install it on your machine (e.g.
The package pandasql provides an easy way to use SQL queries directly on a Pandas DataFrame. (You may need to install it in your hierarchy (e.g.
Once installed, you can run queries via the function Note that strings need to be surrounded by quotes in your query
(e.g. The assignment is broken into the following functions to allow for unit testing:
For example, if we use the dataset of midnight calls from January 2021:
The first function uses SQL to select the column containing the borough name:
The next function, takes the DataFrame and the name of a borough and uses SQL to select all rows with that borough name. For example, we can select We can use SQL to calculate how many incidents were called in on New Year's Day in Queens:
We can calculate summary counts for the DataFrame:
As well which are the 10 most common for Brooklyn and Staten Island:
Learning Objective: To reinforce new SQL skills to query and aggregate data.
Available Libraries: pandas, pandasql, and core Python 3.6+.
Data Sources: 911 System Calls (NYC OpenData)
Sample Datasets:
pip install pandasql
or pip3q install pandasql
). See pandasql for installation details.
pip install pandasql
or pip install pandasql
).
sqldf(queryName)
. For example, you could filter for all the neighborhood tabulation areas in the Bronx in the NYC_population_by_NTA.csv on the waitlist by:
which prints:
import pandas as pd
import pandasql as psql
pop = pd.read_csv('NYC_population_by_NTA.csv')
q = 'SELECT * FROM pop WHERE Borough = "Bronx"'
bronx = psql.sqldf(q)
print(bronx.head())
Borough Year ... NTA Name Population
0 Bronx 2000 ... Claremont-Bathgate 28149
1 Bronx 2000 ... Eastchester-Edenwald-Baychester 35422
2 Bronx 2000 ... Bedford Park-Fordham North 55329
3 Bronx 2000 ... Belmont 25967
4 Bronx 2000 ... Bronxdale 34309
[5 rows x 6 columns]
"Waitlist Student"
). To set up your queries, you may find Formatted Literal Strings ("f-strings") useful (see the python tutorial or Lecture 2 for more details).
make_df(file_name)
:
This function takes one input:
The data is read into a DataFrame. Rows that are have null values for the type description, incident date, incident time, borough name are dropped.
The resulting DataFrame is returned.
file_name
: the name of a CSV file containing 911 System Calls from OpenData NYC.
Hint: this is slightly different than the function from Program 12 in that different rows are dropped and it does not restrict to only ambulance calls.
select_boro_column(df)
:
This function takes one input:
Selects, using SQL, the df
: a DataFrame containing 911 System Calls from OpenData NYC created by make_df
.
BORO_NM
column from df
.
Returns the resulting DataFrame from the SQL query.
select_by_boro(df, boro_name)
:
This function takes two inputs:
Selects, using SQL, all rows from the DataFrame, df
: a DataFrame containing 911 System Calls from OpenData NYC created by make_df
.
boro_name
: a string containing the name of a borough. Can be lower or upper case.
df
, where the borough is boro_name
.
Returns the resulting DataFrame from the SQL query.
Hint: Since the boro_name
can be upper or lower case, convert it to upper case to match the way the borough names are stored.
new_years_count(df, boro_name)
:
This function takes two inputs:
Selects, using SQL, the number of incidents from df
: a DataFrame containing 911 System Calls from OpenData NYC created by make_df
.
boro_name
: a string containing the name of a borough. Can be lower or upper case.
df
, called in on New Year's Day (Jan 1, 2021) in the specified borough, boro_name
.
Returns the resulting DataFrame from the SQL query.
Hint: Since the boro_name
can be upper or lower case, convert it to upper case to match the way the borough names are stored.
incident_counts(df)
:
This function takes one inputs:
Selects, using SQL, the incident counts per radio code (df
: a DataFrame containing 911 System Calls from OpenData NYC created by make_df
.
TYP_DESC
), sorted alphabetically by radio code (TYP_DESC
). Returns the resulting DataFrame from the SQL query.
top_10(df, boro_name)
:
This function takes two inputs:
Selects, using SQL, the top 10 most commonly occurring incidence by radio code, and the number of incident occurrences, in specified borough. Returns the resulting DataFrame from the SQL query.
df
: a DataFrame containing 911 System Calls from OpenData NYC created by make_df
.
boro_name
: a string containing the name of a borough. Can be lower or upper case.
Hint: Since the boro_name
can be upper or lower case, convert it to upper case to match the way the borough names are stored.
would print:
df = make_df('NYPD_Calls_midnight_Jan2021.csv')
print(df[['BORO_NM','RADIO_CODE','TYP_DESC']])
BORO_NM RADIO_CODE TYP_DESC
0 BROOKLYN 10S2 INVESTIGATE/POSSIBLE CRIME: SHOTS FIRED/OUTSIDE
1 QUEENS 68Q1 SEE COMPLAINANT: OTHER/INSIDE
2 BRONX 54E1 AMBULANCE CASE: EDP/INSIDE
3 BRONX 54E1 AMBULANCE CASE: EDP/INSIDE
4 BROOKLYN 24Q6 ASSAULT (PAST): OTHER/FAMILY
... ... ... ...
5582 MANHATTAN 54E1 AMBULANCE CASE: EDP/INSIDE
5583 MANHATTAN 54E1 AMBULANCE CASE: EDP/INSIDE
5584 MANHATTAN 75D VISIBILITY PATROL: DIRECTED
5585 BROOKLYN 75D VISIBILITY PATROL: DIRECTED
5586 QUEENS 75D VISIBILITY PATROL: DIRECTED
[5587 rows x 3 columns]
would print:
boros = select_boro_column(df)
print(boros)
BORO_NM
0 BROOKLYN
1 QUEENS
2 BRONX
3 BRONX
4 BROOKLYN
... ...
5582 MANHATTAN
5583 MANHATTAN
5584 MANHATTAN
5585 BROOKLYN
5586 QUEENS
[5587 rows x 1 columns]
Bronx
would print:
df_bx = select_by_boro(df, "Bronx")
print(df_bx)
CAD_EVNT_ID CREATE_DATE ... Latitude Longitude
0 73973992 01/01/2021 ... 40.830860 -73.902887
1 73973992 01/01/2021 ... 40.830860 -73.902887
2 73974010 01/01/2021 ... 40.840868 -73.925150
3 73974019 01/01/2021 ... 40.870439 -73.890847
4 73974033 01/01/2021 ... 40.856654 -73.843872
... ... ... ... ... ...
1007 74629410 02/01/2021 ... 40.859192 -73.900677
1008 74629421 02/01/2021 ... 40.834562 -73.915443
1009 74630917 02/01/2021 ... 40.825472 -73.892941
1010 74642928 02/01/2021 ... 40.859640 -73.863235
1011 74643746 02/01/2021 ... 40.870371 -73.851975
[1012 rows x 18 columns]
would print:
df_nyd_q = new_years_count(df, "Queens")
print(df_nyd_q)
COUNT(*)
0 40
would print:
df_inc = incident_counts(df)
print(df_inc)
TYP_DESC COUNT(*)
0 ALARMS: AUDIBLE/INSIDE 1
1 ALARMS: AUDIBLE/OUTSIDE 6
2 ALARMS: AUDIBLE/TRANSIT 5
3 ALARMS: BANK/BURGLARY 3
4 ALARMS: COMMERCIAL/BURGLARY 146
.. ... ...
181 VISIBILITY PATROL: DIRECTED 1334
182 VISIBILITY PATROL: FAMILY/HOME VISIT 44
183 VISIBILITY PATROL: INTERIOR 61
184 VISIBILITY PATROL:PUBLIC/PRIVATE EDUCATIONAL F... 4
185 YOUTH HOME VISIT 5
[186 rows x 2 columns]
would print:
df_si = top_10(df, "Staten Island")
print('Top 10 for Staten Island:')
print(df_si)
df_bk = top_10(df, "Brooklyn")
print('\nTop 10 for Brooklyn:')
print(df_bk)
Top 10 for Staten Island:
TYP_DESC COUNT(*)
0 VISIBILITY PATROL: DIRECTED 97
1 COMMUNITY TIME 19
2 SEE COMPLAINANT: OTHER/INSIDE 18
3 AMBULANCE CASE: EDP/INSIDE 10
4 INVESTIGATE/POSSIBLE CRIME: SERIOUS/OTHER 8
5 DISPUTE: INSIDE 8
6 BUS INVESTIGATION 7
7 AMBULANCE CASE: SERIOUS/INSIDE 7
8 DISPUTE: FAMILY 5
9 DISORDERLY: GROUP/OUTSIDE 5
Top 10 for Brooklyn:
TYP_DESC COUNT(*)
0 VISIBILITY PATROL: DIRECTED 503
1 STATION INSPECTION BY TRANSIT BUREAU PERSONNEL 189
2 SEE COMPLAINANT: OTHER/INSIDE 189
3 TRAIN RUN/MOBILE ORDER MAINTENANCE SWEEP 72
4 AMBULANCE CASE: EDP/INSIDE 60
5 INVESTIGATE/POSSIBLE CRIME: SUSP VEHICLE/OUTSIDE 52
6 INVESTIGATE/POSSIBLE CRIME: CALLS FOR HELP/INSIDE 48
7 ALARMS: COMMERCIAL/BURGLARY 44
8 VISIBILITY PATROL: INTERIOR 39
9 INVESTIGATE/POSSIBLE CRIME: SERIOUS/OTHER 36