Classwork: Regular Expressions

MHC 250/Seminar 4:
Shaping the Future of New York City
Spring 2017

Out-of-Class (Snow Day Make-up)

Since we missed a day of class due to the snow, we have an "asynchronous learning activity" to replace the missing instructional time. It is modeled on our typical class, and contains the following activities:

Discussion

For the discussion, use the discussion board under "Classwork" folder.
  1. Complete the reading and associated question on Homework #10.
  2. Share an image of the property that you focused on for your rezoning, as well as its address, and its zoning under the new regulations. Include your plan and why it would enhance the urban experience.
  3. For two other properties, suggest a possible new building and explain why it is possible and why it would improve the livability of the neighborhood. That is, submit a comment in the discussion, with your suggestions.
  4. Pick another property that has its three possible designs uploaded and summarize the common features and constrast the differences between them. That is, choose a property that has three designs in it, and add your compare/contrast summary as a comment. Note: only one summary per entry (if someone has already written one for a property, find another one to summarize).

Tech Skills

The tech skills for today focus on extracting information from strings, files, and webpages. While we have extracted information from strings and files throughout the term, this session starts with review and moves to new ways to find patterns (called "regular expressions") as well as useful ways to scrap webpages (such as the popular "beautifulSoup" package).

Useful String Methods

A standard use of Python is to manipulate strings. Here are some useful methods for strings that we have used over the semester:

Challenges:

You can check if your answers are right, but typing them into a Python shell.

More Tech Skills: Regular Expressions

Regular expressions provide a powerful way to find patterns, in particular, those that might vary in length. Patterns can be as simple as a single word, or a series of strings that can occur a fixed or varying number of times. For example, if you were searching for any number of the letter a, you could write:

	
	a*
which says you are looking for 0 or more a. Similarly, if you wanted a word repeated:
	(hi)*
This pattern will match any number of copies of the word hi, such as: hi, hihihihihi, etc.

This search for patterns are quite useful in many fields including biology (yup, this was added just for all the biology majors in the class). For example, in a DNA sequence, small patterns can occur varying number of times (short tandem repeat polymorphism). To find the first AT repeat of longer than 4 repeats, we can use a regular expression:

import re
dna = "ACTGCATTATATCGTACGAAAGCTGCTTATACGCGCG" 
runs = re.findall("[AT]{4,100}", dna) 
print(runs)
To find the location of a pattern in a string, we can use:
if re.search(r"GC[ATGC]GC", dna):
    print("restriction site found!")
(for the biologists in the class: more examples like this from Python for Biologists).

The re library is distributed with Python. We will use two useful functions in the library:

We often want more information than just if a pattern occurred or in what way. To find out the starting (and stopping) location, we can use the match object that re.search() returns. It's most useful functions are:

(see Python regex tutorial for more details).

From our example above, we could store the match object:

m = re.search(r"GC[ATGC]GC, dna)
print "The matching string is", m.group()
print "Match starts at", m.start()
print "Match ends at", m.end()

These are more general (and more powerful) tools than the string methods above. In many cases, either can be used. For finding approximate matches or matches of varying lengths, using regular expressions is much easier. Here's a regex cheat sheet with an overview of the most common commands.

Challenges

Try the following challenges (use the cheat sheet or google if you get stuck):

Tech Skills: Scraping Webpages

Webpages are formatted using the HyperText Markup Language (HTML) which is a series of commands on how the pages should be formatted, along with links and embedded calls to programs. For example, if you would like a word to show up in bold, you surround it by "tags" that say what to do:

<b>bold</b>
The opening tag starts the bold text and the closing tag (indicated by the '/') ends the bold text. Most HTML commands follow this same style: there's an opening tag, paired with a closing text that includes a '/' and the same name.

We can access files stored on webpages inside Python. The built-in urllib module has functions to mimic the file I/O we have seen before. If we are reading in a CSV file, we can use pandas directly (see the citiBike example in Classwork 9).

Let's say we want to make a list of all the seminars at the American Museum of Natural History (we're using these, since I like to go to their seminars, and the format is a bit easier than the MHC events page which is scattered across multiple pages). We can `scrap the data' on the comparative biology seminar page into our Python program. We can then search for strings that look like dates and print out the subsequent lines. The interface is very similarly to opening files:

  1. Use urllib.open() to open file.
  2. Then can use read(), readline() or readlines() to traverse file.
(If you are going to be parsing lots and lots of HTML files, you should consider the beautifulSoup that does a great job handling badly formatted files-- see tutorials below).

The museum's webpage is machine generated (you can look at the raw code by saving the HTML file and then opening it with TextEdit, TextWrangler, or some other editor meant for text files). The code is very clean with no missing ending tags (unlike the HTML for the page you're currently reading...).

Here are the first couple of lines with the seminar dates:

We can search the file for dates, and then print out the subsequent lines with the speaker and title. We can do this in several different ways. Here's one approach:

  1. Open the URL ('URL' stands for Uniform Record Locator and is the location of the file. It usually starts out http://www...).
  2. Read the file into a list of strings, called lines.
  3. For each line in lines, check if it contains the date.
  4. If so, print out the date and the subsequent lines with name, affiliation, and title.
  5. Close file.

We are just missing the tools to open webpages. There are several options (both built-in and modules you can download). We are going to use requests since it automatically converts incoming data from bytes to text, making it simplier to use. First we need to import the module:

import requests

If you do not have requests on your machine, you can download it in a terminal with:

pip install requests

To get the contents of a webpage:

data = requests.get("http://www.amnh.org/our-research/richard-gilder-graduate-school/academics-and-research/seminars-and-conferences")
which now contains the strings in the file as well as some other information (This will take a bit depending on network connectivity.) The text of the file can be accessed via:
data.text

It's stored as a single, very long string with \n separating the lines. So, if we want to look at the file line-by-line, we can use our friend, split()

lines = data.text.split("\n")

If you print out lines, you'll notice that most lines are blank or formatting statements. To find the seminars, we need to go through and check each line to see if it contains a seminar listing (hint: if statement!).

Since each line of the webpage is in the variable lines, and we can loop through it. Here's an outline: it traverses the list by line number since we'll want to refer to the lines after it (where the name and titles are stored):

for i in range(len(lines)):
	#Check if the lines[i] has a date in it (can use find() or regular expressions)
	#If it does print it, 
	#	as well as the subsequent lines[i+2] (has name) and 
	#	lines[i+4] (has title)

Test and debug your program and then figure out how to print just the date, name, affiliation, and title (without the HTML formatting statements).

Challenges

Tech Skills Bonus: Beautiful Soup

For those who are going to be digging deep into webpages, there's a lovely package that makes it much easier (if the above web scraping was more than enough for you, skip this section).

The Beautiful Soup package for Python has the motto:

You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects.

Like folium or pandas, you will need to download the package to your machine to use it. Here is the soup quick start.

Group Work

This week's readings focus on the income inequality in New York City. The ERSI storymap illustrates the income disparity across the city as well as several other metropolitian areas. Working with your group, answer the following questions and submit via Blackboard:
  1. Using the interactive map of New York City with the number of households earning over $200,000 per year and those earning less than $25,000, find the census tract in each borough that has the highest ratio of high-to-low earners?
  2. Find the census tract in each borough with the fewest high income earners.
  3. The article suggests that New York City is more mixed by income with extremely wealthy neighborhoods adjacent to poor neighborhoods, when compared to other cities. How could you use the data to test this statement? No need to write code, but as a group, come up with a way to analysis the data ERSI has collated. Your answer should outline a way to say if the statement is true or false.