Program 12: Patterns & Testing. Due 10am, Wednesday, 3 May.
The case study in Lecture 11 focused on determining the accuracy of self-reported college rankings data. Much of the exploratory data analysis relied on "scraping" information from webpages. For this program, we will write functions that extract data from a an HTML page, with a twist on this canonical problem: you can only use the (If you are rusty at HTML, here's a quick tutorial.)
Learning Objective: to use regular expressions with simple patterns from semi-struictured data.
Available Libraries: re and core Python 3.6+(Note: pandas and numpy are not included).
Sample Datasets: sample.html
re
(regex) library and no additional libraries, such as pandas, numpy, beautifulSoup, etc.
The assignment is broken into the following functions to allow for unit testing:
rm_tags(data)
:
This function takes one input:
data
: a multiline string.
test_rm_tags(rm_tags_fnc)
:
This function takes one input:
rm_tags_fnc
: a function that takes a string and returns a string.
True
if the inputted function correctly strips out the text from a HTML file and False
otherwise.
make_dict(data)
:
This function takes one input:
data
: a string
re
package in Python) to find all external links in data
and store the link text as the key and URL value in a dictionary. Title
and URL
in the CSV file specified by the user. For the URL, keep the leading https://
or http://
. Returns the resulting dictionary.
test_make_dict(make_dict_fnc)
:
This function takes one input:
make_dict_fnc
: a function that takes a string and returns a dictionary.
True
if the inputted function correctly returns a dictionary of links and False
otherwise.
For example, if the input file is:
<html>
<head><title>Simple HTML File</title></head>
<body>
<p> Here's a link for <a href="http://www.hunter.cuny.edu/csci">Hunter CS Department</a>
and for <a href="https://stjohn.github.io/teaching/data/fall21/index.html">CSci 39542</a>. </p>
<p> And for <a href="https://www.google.com/">google</a>
</body>
</html>
Then a sample run of the program:
Enter input file name: simple.html
Enter output file name: links.csv
And the links.csv
would be:
Title,URL
Hunter CS Department,www.hunter.cuny.edu/csci
CSci 39542,stjohn.github.io/teaching/data/fall21/index.html
google,www.google.com
For example, let's try the rm_tags
function on a string with no tags:
no_tags = "No tags\tbut new\nlines"
print('For data = "No tags\\tbut new\\nlines"')
print(rm_tags(no_tags))
will print:
No tags but new
lines
Trying with data that has tags, let's use the sample HTML file:
data = open('sample.html').read()
print(rm_tags(data)
will print:
Simple HTML File
Here's a link for Hunter CS Department
and for CSci 39542.
And for google
Next, we'll build a function that pulls the titles and URL of external links (i.e. starts with http
or https
) from an HTML file. As an exercise in regular expressions, only the standard Python and re
library are allowed.
For example, let's try the make_dict
with a single external link:
data = '<a href="https://mta.info">MTA</a>'
print(make_dict(data))
will print:
{'MTA': 'https://mta.info'}
Our functions test_rm_tags
and test_rm_tags
are tester functions, not unlike the ones used to grade assignments in Gradescope Autograder. Let's build a constant function that returns True
no matter what the inputs and see how it compares to rm_tags
:
def always_true(data):
return True
print(f'Testing rm_tags with constant function: {test_rm_tags(always_true)}')
print(f'Testing rm_tags with our function: {test_rm_tags(rm_tags)}')
will print:
Testing rm_tags with constant function: False
Testing rm_tags with our function: True
The tester for making dictionaries will similarly should return True when passed the correct function and False otherwise.