CSci 39542 Syllabus    Resources    Coursework



Program 13: EMS Queries
CSci 39542: Introduction to Data Science
Department of Computer Science
Hunter College, City University of New York
Spring 2022


Classwork    Quizzes    Homework    Project   

Program Description


Program 13: EMS Queries.Due noon, Thursday, 12 March.
Learning Objective: To reinforce new SQL skills to query and aggregate data.
Available Libraries: pandas, pandasql, and core Python 3.6+.
Data Sources:
911 System Calls (NYC OpenData)
Sample Datasets:


This program continues the analysis of the emergency services calls that was introduced in Program 12.

We will be using Structured Query Language (SQL) to subset and aggregate data (see Lectures 25 & 26 and DS 100: Chapter 7). For this program, we will use the pandasql that provides an easy way to query pandas DataFrames using SQL syntax. To use it, you may need to install it on your machine (e.g. pip install pandasql or pip3q install pandasql). See pandasql for installation details.

The package pandasql provides an easy way to use SQL queries directly on a Pandas DataFrame. (You may need to install it in your hierarchy (e.g. pip install pandasql or pip install pandasql).

Once installed, you can run queries via the function sqldf(queryName). For example, you could filter for all the neighborhood tabulation areas in the Bronx in the NYC_population_by_NTA.csv on the waitlist by:

import pandas as pd
import pandasql as psql
pop = pd.read_csv('NYC_population_by_NTA.csv')
q = 'SELECT * FROM pop WHERE Borough = "Bronx"'
bronx = psql.sqldf(q)
print(bronx.head())
which prints:
  Borough  Year  ...                         NTA Name Population
0   Bronx  2000  ...               Claremont-Bathgate      28149
1   Bronx  2000  ...  Eastchester-Edenwald-Baychester      35422
2   Bronx  2000  ...       Bedford Park-Fordham North      55329
3   Bronx  2000  ...                          Belmont      25967
4   Bronx  2000  ...                        Bronxdale      34309

[5 rows x 6 columns]

Note that strings need to be surrounded by quotes in your query (e.g. "Waitlist Student"). To set up your queries, you may find Formatted Literal Strings ("f-strings") useful (see the python tutorial or Lecture 2 for more details).

The assignment is broken into the following functions to allow for unit testing:


For example, if we use the dataset of midnight calls from January 2021:

df = make_df('NYPD_Calls_midnight_Jan2021.csv')
print(df[['BORO_NM','RADIO_CODE','TYP_DESC']])
would print:
        BORO_NM RADIO_CODE                                         TYP_DESC
0      BROOKLYN       10S2  INVESTIGATE/POSSIBLE CRIME: SHOTS FIRED/OUTSIDE
1        QUEENS       68Q1                    SEE COMPLAINANT: OTHER/INSIDE
2         BRONX       54E1                       AMBULANCE CASE: EDP/INSIDE
3         BRONX       54E1                       AMBULANCE CASE: EDP/INSIDE
4      BROOKLYN       24Q6                     ASSAULT (PAST): OTHER/FAMILY
...         ...        ...                                              ...
5582  MANHATTAN       54E1                       AMBULANCE CASE: EDP/INSIDE
5583  MANHATTAN       54E1                       AMBULANCE CASE: EDP/INSIDE
5584  MANHATTAN        75D                      VISIBILITY PATROL: DIRECTED
5585   BROOKLYN        75D                      VISIBILITY PATROL: DIRECTED
5586     QUEENS        75D                      VISIBILITY PATROL: DIRECTED

[5587 rows x 3 columns]

The first function uses SQL to select the column containing the borough name:

boros = select_boro_column(df)
print(boros)
would print:
        BORO_NM
0      BROOKLYN
1        QUEENS
2         BRONX
3         BRONX
4      BROOKLYN
...         ...
5582  MANHATTAN
5583  MANHATTAN
5584  MANHATTAN
5585   BROOKLYN
5586     QUEENS

[5587 rows x 1 columns]

The next function, takes the DataFrame and the name of a borough and uses SQL to select all rows with that borough name. For example, we can select Bronx

df_bx = select_by_boro(df, "Bronx")
print(df_bx)
would print:
      CAD_EVNT_ID CREATE_DATE  ...   Latitude  Longitude
0        73973992  01/01/2021  ...  40.830860 -73.902887
1        73973992  01/01/2021  ...  40.830860 -73.902887
2        73974010  01/01/2021  ...  40.840868 -73.925150
3        73974019  01/01/2021  ...  40.870439 -73.890847
4        73974033  01/01/2021  ...  40.856654 -73.843872
...           ...         ...  ...        ...        ...
1007     74629410  02/01/2021  ...  40.859192 -73.900677
1008     74629421  02/01/2021  ...  40.834562 -73.915443
1009     74630917  02/01/2021  ...  40.825472 -73.892941
1010     74642928  02/01/2021  ...  40.859640 -73.863235
1011     74643746  02/01/2021  ...  40.870371 -73.851975

[1012 rows x 18 columns]

We can use SQL to calculate how many incidents were called in on New Year's Day in Queens:

df_nyd_q = new_years_count(df, "Queens")
print(df_nyd_q)
would print:
   COUNT(*)
0        40

We can calculate summary counts for the DataFrame:

df_inc = incident_counts(df)
print(df_inc)
would print:
                                              TYP_DESC  COUNT(*)
0                               ALARMS: AUDIBLE/INSIDE         1
1                              ALARMS: AUDIBLE/OUTSIDE         6
2                              ALARMS: AUDIBLE/TRANSIT         5
3                                ALARMS: BANK/BURGLARY         3
4                          ALARMS: COMMERCIAL/BURGLARY       146
..                                                 ...       ...
181                        VISIBILITY PATROL: DIRECTED      1334
182               VISIBILITY PATROL: FAMILY/HOME VISIT        44
183                        VISIBILITY PATROL: INTERIOR        61
184  VISIBILITY PATROL:PUBLIC/PRIVATE EDUCATIONAL F...         4
185                                   YOUTH HOME VISIT         5

[186 rows x 2 columns]

As well which are the 10 most common for Brooklyn and Staten Island:

df_si = top_10(df, "Staten Island")
print('Top 10 for Staten Island:')
print(df_si)
df_bk = top_10(df, "Brooklyn")
print('\nTop 10 for Brooklyn:')
print(df_bk)
would print:
Top 10 for Staten Island:
                                    TYP_DESC  COUNT(*)
0                VISIBILITY PATROL: DIRECTED        97
1                             COMMUNITY TIME        19
2              SEE COMPLAINANT: OTHER/INSIDE        18
3                 AMBULANCE CASE: EDP/INSIDE        10
4  INVESTIGATE/POSSIBLE CRIME: SERIOUS/OTHER         8
5                            DISPUTE: INSIDE         8
6                          BUS INVESTIGATION         7
7             AMBULANCE CASE: SERIOUS/INSIDE         7
8                            DISPUTE: FAMILY         5
9                  DISORDERLY: GROUP/OUTSIDE         5

Top 10 for Brooklyn:
                                            TYP_DESC  COUNT(*)
0                        VISIBILITY PATROL: DIRECTED       503
1     STATION INSPECTION BY TRANSIT BUREAU PERSONNEL       189
2                      SEE COMPLAINANT: OTHER/INSIDE       189
3           TRAIN RUN/MOBILE ORDER MAINTENANCE SWEEP        72
4                         AMBULANCE CASE: EDP/INSIDE        60
5   INVESTIGATE/POSSIBLE CRIME: SUSP VEHICLE/OUTSIDE        52
6  INVESTIGATE/POSSIBLE CRIME: CALLS FOR HELP/INSIDE        48
7                        ALARMS: COMMERCIAL/BURGLARY        44
8                        VISIBILITY PATROL: INTERIOR        39
9          INVESTIGATE/POSSIBLE CRIME: SERIOUS/OTHER        36