Program 12: Patterns & Testing. Due 10am, Wednesday, 3 May.
The case study in Lecture 11 focused on determining the accuracy of self-reported college rankings data. Much of the exploratory data analysis relied on "scraping" information from webpages. For this program, we will write functions that extract data from a an HTML page, with a twist on this canonical problem: you can only use the (If you are rusty at HTML, here's a quick tutorial.)
Learning Objective: to use regular expressions with simple patterns from semi-struictured data.
Available Libraries: re and core Python 3.6+(Note: pandas and numpy are not included).
Sample Datasets: sample.html
re (regex) library and no additional libraries, such as pandas, numpy, beautifulSoup, etc.
The assignment is broken into the following functions to allow for unit testing:
rm_tags(data):
This function takes one input:
data: a multiline string.
test_rm_tags(rm_tags_fnc):
This function takes one input:
rm_tags_fnc: a function that takes a string and returns a string.
True if the inputted function correctly strips out the text from a HTML file and False otherwise.
make_dict(data):
This function takes one input:
data: a string
re package in Python) to find all external links in data and store the link text as the key and URL value in a dictionary. Title and URL in the CSV file specified by the user. For the URL, keep the leading https:// or http://. Returns the resulting dictionary.
test_make_dict(make_dict_fnc):
This function takes one input:
make_dict_fnc: a function that takes a string and returns a dictionary.
True if the inputted function correctly returns a dictionary of links and False otherwise.
For example, if the input file is:
<html>
<head><title>Simple HTML File</title></head>
<body>
<p> Here's a link for <a href="http://www.hunter.cuny.edu/csci">Hunter CS Department</a>
and for <a href="https://stjohn.github.io/teaching/data/fall21/index.html">CSci 39542</a>. </p>
<p> And for <a href="https://www.google.com/">google</a>
</body>
</html>
Then a sample run of the program:
Enter input file name: simple.html
Enter output file name: links.csv
And the links.csv would be:
Title,URL
Hunter CS Department,www.hunter.cuny.edu/csci
CSci 39542,stjohn.github.io/teaching/data/fall21/index.html
google,www.google.com
For example, let's try the rm_tags function on a string with no tags:
no_tags = "No tags\tbut new\nlines"
print('For data = "No tags\\tbut new\\nlines"')
print(rm_tags(no_tags))
will print:
No tags but new
lines
Trying with data that has tags, let's use the sample HTML file:
data = open('sample.html').read()
print(rm_tags(data)
will print:
Simple HTML File
Here's a link for Hunter CS Department
and for CSci 39542.
And for google
Next, we'll build a function that pulls the titles and URL of external links (i.e. starts with http or https) from an HTML file. As an exercise in regular expressions, only the standard Python and re library are allowed.
For example, let's try the make_dict with a single external link:
data = '<a href="https://mta.info">MTA</a>'
print(make_dict(data))
will print:
{'MTA': 'https://mta.info'}
Our functions test_rm_tags and test_rm_tags are tester functions, not unlike the ones used to grade assignments in Gradescope Autograder. Let's build a constant function that returns True no matter what the inputs and see how it compares to rm_tags:
def always_true(data):
return True
print(f'Testing rm_tags with constant function: {test_rm_tags(always_true)}')
print(f'Testing rm_tags with our function: {test_rm_tags(rm_tags)}')
will print:
Testing rm_tags with constant function: False
Testing rm_tags with our function: True
The tester for making dictionaries will similarly should return True when passed the correct function and False otherwise.