Program 2: Tree Census. Due 10am, Wednesday, 9 February.
Note on data access: The data for this program comes from the NYC OpenData initiative which gives free access to public data published by New York City agencies. If you have not used NYC OpenData before, see the
How To page. Access to the data is free, but some functionality (such as multiple direct requests in very short amount of time might require a login). The examples below demonstrate using the data using the Socrata Open Data API (SODA) which allows you to pull data directly from the public repository (more details at Getting Started with SODA Consumer API) as well as reading from the CSV files that have already been downloaded to your computer or repo.
The New York City Street Tree Map Every ten years, New York City does a census of all street trees in the city. The most recent survey in 2015 mapped over 666,000 trees and included the location, size, species, and health of the trees. Similar data was collected each year but was stored slightly differently, most notably: some of the column names differ form 1995 to 2005 to 2015.
Learning Objective: to refresh students' knowledge of Pandas' functionality to manipulate and create columns from formatted data.
Available Libraries: Pandas and core Python 3.6+.
Data Sources: The New York City Street TreesCount Project
Sample Datasets:
clean_df(df, year = 2015)
:
This function takes two inputs:
df
: the name of a DataFrame containing TreesCount Data from OpenData NYC.
year
: the year of the data set. There are three possible years 1995
, 2005
, or 2015
. The default value is 2015
.2015
, the function should take df
and drop all columns except:
['tree_dbh', 'health', 'spc_latin', 'spc_common', 'address', 'zipcode', 'boroname', 'nta', 'latitude', 'longitude',
'council_district', 'census_tract']
2005
, the function should take df
and drop all columns except:
['tree_dbh', 'status', 'spc_latin', 'spc_common', 'address', 'zipcode', 'boroname', 'nta', 'latitude', 'longitude',
'cncldist', 'census_tract']
and rename the corresponding columns that differ from 2015 to the 2015 names. For example, status
is renamed to health
.1995
, the function should take df
and drop all columns except:
['diameter', 'condition', 'spc_latin', 'spc_common', 'address', 'zip_original', 'borough', 'nta_2010', 'latitude', 'longitude',
'council_district', 'censustract_2010']
and rename the corresponding columns that differ from 2015 to the 2015 names. For example, diameter
is renamed to tree_dbh
.filter_health(df, keep)
:
This function takes two inputs:
df
: a DataFrame that includes the health
column.
keep
: a list of values for the
health
column.
health
contains a value from the list keep
. All rows where the health
column contains a different value are dropped.
add_indicator(row)
:
This function takes one input:
row
: a Series (a row) containing values for tree_dbh
and health
.
1
if health
is not Poor
and tree_dbh
is larger than 10
. Otherwise, it should return 0
.
find_trees(df, species)
:
This function takes two inputs:
df
: a DataFrame that includes the spc_latin
column and the address
column.
species
: a string containing the Latin name of a tree.
address
for all trees of that species
in spc_latin
. If that species
does not occur in the DataFrame, then an empty list is returned.
count_by_area(df, area = "boroname")
:
This function takes two inputs:
df
: a DataFrame that includes the area
column.
area
: the name of a column in df
.
The default value is "boroname"
.
area
. For example if area = "boroname"
, your function should group by boroname
and return the number of each trees in each of the boroughs.
Let's run through some testing code to check if your program is written correctly.
For example, to open the 2015 dataset, let's use the SODA API to directly pull the data. We can also download the file, but since it has about a half million records, it is quite large (220MB), so, working with just the first 10,000 lines:
url = 'https://data.cityofnewyork.us/resource/uvpi-gqnh.json?$limit=10000'
data = pd.read_json(url)
print(data)
will print:
tree_id block_id created_at tree_dbh stump_diam ... y_sp council_district census_tract bin bbl
0 180683 348711 2015-08-27 3 0 ... 202756.7687 29.0 739.0 4052307.0 4.022210e+09
1 200540 315986 2015-09-03 21 0 ... 228644.8374 19.0 973.0 4101931.0 4.044750e+09
2 204026 218365 2015-09-05 3 0 ... 200716.8913 34.0 449.0 3338310.0 3.028870e+09
3 204337 217969 2015-09-05 10 0 ... 199244.2531 34.0 449.0 3338342.0 3.029250e+09
4 189565 223043 2015-08-30 21 0 ... 182202.4260 39.0 165.0 3025654.0 3.010850e+09
... ... ... ... ... ... ... ... ... ... ... ...
9995 213226 107016 2015-09-10 3 0 ... 227105.0006 6.0 175.0 1033301.0 1.012360e+09
9996 207273 222814 2015-09-06 16 0 ... 196176.2566 37.0 443.0 3074614.0 3.032720e+09
9997 207274 222814 2015-09-06 11 0 ... 196116.3506 37.0 443.0 3074610.0 3.032720e+09
9998 208443 225537 2015-09-07 6 0 ... 180013.4704 38.0 149.0 3016810.0 3.008750e+09
9999 195808 105937 2015-09-01 11 0 ... 215863.8898 5.0 10601.0 1078278.0 1.013700e+09
[10000 rows x 45 columns]
Note that the default only pulls the first 1000 lines. We have increased the limit to 10,000. See Getting Started with SODA API for more details.
Let's test the clean_df()
function with the first 10,000 entries of the 2015 data:
df = clean_df(data)
print(f'The new columns are: {df.columns}.')
will print:
The new columns are: Index(['tree_dbh', 'health', 'spc_latin', 'spc_common', 'address', 'zipcode',
'boroname', 'nta', 'latitude', 'longitude', 'council_district',
'census_tract'],
Next, let's use the first 1000 entries (default is 1000 entries if limit is not specified) of the 2005 Tree Census:
url05 = 'https://data.cityofnewyork.us/resource/29bw-z7pj.json'
data05= pd.read_json(url05)
df05 = clean_df(data05, year = 2005)
print(f'The new columns are: {df05.columns}.')
will print:
The new columns are: Index(['tree_dbh', 'health', 'spc_latin', 'spc_common', 'address', 'zipcode',
'boroname', 'nta', 'latitude', 'longitude', 'council_district',
'census_tract'],
dtype='object').
Similarly, for the first 1000 entries of the 1995 Tree Census:
url95 = 'https://data.cityofnewyork.us/resource/kyad-zm4j.json'
data95= pd.read_json(url95)
df95 = clean_df(data95, year = 1995)
print(df95)
will print:
tree_dbh health spc_latin spc_common address zipcode boroname nta latitude longitude council_district census_tract
0 8 Unknown PLATANUS ACERIFOLIA LONDON PLANETREE 245 E 17 ST 10003 Manhattan MN21 40.734551 -73.984235 2.0 48
1 7 Good ACER PLATANOIDES MAPLE, NORWAY 80 N MOORE ST 10013 Manhattan MN24 40.720159 -74.010532 1.0 39
2 6 Good ACER PLATANOIDES MAPLE, NORWAY 80 N MOORE ST 10013 Manhattan MN24 40.720159 -74.010532 1.0 39
3 7 Excellent ACER PLATANOIDES MAPLE, NORWAY 80 N MOORE ST 10013 Manhattan MN24 40.720159 -74.010532 1.0 39
4 6 Good ACER PLATANOIDES MAPLE, NORWAY 80 N MOORE ST 10013 Manhattan MN24 40.720159 -74.010532 1.0 39
.. ... ... ... ... ... ... ... ... ... ... ... ...
995 4 Good AILANTHUS ALTISSIMA TREE OF HEAVEN 342 E 13 ST 10003 Manhattan MN22 40.731228 -73.984424 2.0 40
996 9 Good SOPHORA JAPONICA JAPANESE PAGODA TREE 21 SPRING ST 10003 Manhattan MN24 40.721486 -73.995137 1.0 43
997 3 Good CORNUS MAS CHERRY, CORNELIAN 344 E 13 ST 10003 Manhattan MN22 40.731209 -73.984378 2.0 40
998 10 Poor SOPHORA JAPONICA JAPANESE PAGODA TREE 21 SPRING ST 10003 Manhattan MN24 40.721486 -73.995137 1.0 43
999 1 Poor GLEDITSIA TRIACANTHOS HONEYLOCUST 339 E 13 ST 10003 Manhattan MN22 40.731268 -73.984518 2.0 40
[1000 rows x 12 columns]
Continuing with our 2015 dataset, let's filter by health and compute the percentage of healthy trees:
df_h = filter_health(df, ['Fair','Good', 'Excellent'])
healthy = 100*len(df_h)/len(df)
print(f"In 2015, {healthy}% trees were healthy.")
will print:
In 2015, 88.54% trees were healthy.
Adding a new column using add_indicator()
print(df.iloc[0])
print(f'First tree has indicator {add_indicator(df.iloc[0])}'
df['Mature Trees'] = df.apply(add_indicator, axis=1)
print(f"New column:\n {df['Mature Trees']}")
will print:
First tree has indicator 0
New column:
0 0
1 1
2 0
3 0
4 1
..
9995 0
9996 1
9997 1
9998 0
9999 1
Name: Mature Trees, Length: 10000, dtype: int64
Using the find_trees()
, we can locate all the American elm trees in the 2015 sample:
print('American elm trees in our 2015 sample:')
print(find_trees(df, "Ulmus americana"))
will print:
American elm trees in our 2015 sample:
['75 FEATHERBED LANE', '1840 GRAND CONCOURSE', '13 WEST 61 STREET', '15 WEST 81 STREET', '475 RIVERSIDE DRIVE', '70 EAST 183 STREET', '1936 ANTHONY AVENUE', '110-022 COLONIAL AVENUE', '87-019 162 AVENUE', '415 RIVERSIDE DRIVE', '315 LEXINGTON AVENUE', '408 COLUMBUS AVENUE', '13 WEST 61 STREET', '485 OCEAN AVENUE', '515 OCEAN AVENUE', '127 EAST 38 STREET', '520 4 STREET', '505 OCEAN AVENUE', '220 CARLTON AVENUE', '21 HETT AVENUE', '133 PITT STREET', '34 DOVER STREET', '128 PITT STREET', '116-003 85 AVENUE', '499 OCEAN AVENUE', '1 WEST 81 STREET', '316 WEST 95 STREET', '135 WOLVERINE STREET', '318 WEST 90 STREET', '315 LEXINGTON AVENUE', '156 EAST 100 STREET', '51 WEST 81 STREET', '520 4 STREET', '97-033 DREW STREET', '33 HETT AVENUE', '49 8 AVENUE', '1 WEST 81 STREET', '4201 9 AVENUE', '471 OCEAN AVENUE', '1 WEST 85 STREET', '1 DUNHAM PLACE', '248 COURT STREET', '51 WEST 81 STREET', '1 WEST 81 STREET', '305 PARK AVENUE', '116-002 85 AVENUE', '1940 ADAM C POWELL BOULEVARD', '471 OCEAN AVENUE', '11 WEST 81 STREET', '4201 9 AVENUE', '1660 DEKALB AVENUE', '1332 METROPOLITAN AVENUE', '646 39 STREET', '70 WEST 93 STREET', '211-007 82 AVENUE', '305 PARK AVENUE', '50-062 45 STREET', '54-002 32 AVENUE', '277 WEST 4 STREET', '55 MONTGOMERY PLACE', '170 RIVERSIDE DRIVE', '31-021 90 STREET', '1681 DEKALB AVENUE', '30 FINLEY AVENUE', '4301 9 AVENUE', '31-033 90 STREET', '1 LEWIS AVENUE', '300 WEST 115 STREET', '328 WEST 115 STREET', '34 FINLEY AVENUE']
The function count_by_area()
can be called with different granularities of areas. For example, here is the number of trees by borough (boroname) and by neighborhood (NTA):
print('In 2015, the number of trees by borough')
print(count_by_area(df))
print()
print('In 2015, the number of trees by NTA')
print(count_by_area(df, area = 'nta'))
will print:
In 2015, the number of trees by borough
boroname
Bronx 889
Brooklyn 2930
Manhattan 1812
Queens 3262
Staten Island 1107
dtype: int64
In 2015, the number of trees by NTA
nta
BK09 3
BK17 49
BK19 24
BK21 21
BK23 7
...
SI36 57
SI37 12
SI45 155
SI48 131
SI54 70
Name: nta, Length: 179, dtype: int64
Notes: