Program 12, CSci 39542: Data Science, Hunter College

Program 12: Patterns & Testing
CSci 39542: Introduction to Data Science
Department of Computer Science
Hunter College, City University of New York
Spring 2023

Program Description

Program 12: Patterns & Testing. Due 10am, Wednesday, 3 May.
Learning Objective: to use regular expressions with simple patterns from semi-struictured data.
Available Libraries: re and core Python 3.6+(Note: pandas and numpy are not included).
Sample Datasets: sample.html

The case study in Lecture 11 focused on determining the accuracy of self-reported college rankings data. Much of the exploratory data analysis relied on "scraping" information from webpages. For this program, we will write functions that extract data from a an HTML page, with a twist on this canonical problem: you can only use the re (regex) library and no additional libraries, such as pandas, numpy, beautifulSoup, etc.

(If you are rusty at HTML, here's a quick tutorial.)

The assignment is broken into the following functions to allow for unit testing:

rm_tags(data): This function takes one input:
- data: a multiline string.
Returns a string with all HTML formatting removed. If the string was plain text, the contents are returned unaltered as a string.

test_rm_tags(rm_tags_fnc): This function takes one input:
- rm_tags_fnc: a function that takes a string and returns a string.
Returns True if the inputted function correctly strips out the text from a HTML file and False otherwise.

make_dict(data): This function takes one input:
- data: a string
Uses regular expressions (see Chapter 12.4 for using the re package in Python) to find all external links in data and store the link text as the key and URL value in a dictionary. Title and URL in the CSV file specified by the user. For the URL, keep the leading https:// or http://. Returns the resulting dictionary.

test_make_dict(make_dict_fnc): This function takes one input:
- make_dict_fnc: a function that takes a string and returns a dictionary.
Returns True if the inputted function correctly returns a dictionary of links and False otherwise.

For example, if the input file is:


  <html>
  <head><title>Simple HTML File</title></head>

  <body>
    <p> Here's a link for <a href="http://www.hunter.cuny.edu/csci">Hunter CS Department</a>
    and for <a href="https://stjohn.github.io/teaching/data/fall21/index.html">CSci 39542</a>.  </p>

    <p> And for <a href="https://www.google.com/">google</a>
  </body>
  </html>

Then a sample run of the program:

Enter input file name: simple.html
Enter output file name:  links.csv

And the links.csv would be:


Title,URL
Hunter CS Department,www.hunter.cuny.edu/csci
CSci 39542,stjohn.github.io/teaching/data/fall21/index.html
google,www.google.com

For example, let's try the rm_tags function on a string with no tags:

no_tags = "No tags\tbut new\nlines"
print('For data = "No tags\\tbut new\\nlines"')
print(rm_tags(no_tags))

will print:

No tags	but new
lines

Trying with data that has tags, let's use the sample HTML file:

data = open('sample.html').read()
print(rm_tags(data)

will print:

    Simple HTML File
  
    
  Here's a link for Hunter CS Department
 and for CSci 39542.  

  And for google

Next, we'll build a function that pulls the titles and URL of external links (i.e. starts with http or https) from an HTML file. As an exercise in regular expressions, only the standard Python and re library are allowed.

For example, let's try the make_dict with a single external link:

data = '<a href="https://mta.info">MTA</a>'
print(make_dict(data))

will print:

{'MTA': 'https://mta.info'}

Our functions test_rm_tags and test_rm_tags are tester functions, not unlike the ones used to grade assignments in Gradescope Autograder. Let's build a constant function that returns True no matter what the inputs and see how it compares to rm_tags:

def always_true(data):
  return True
  
print(f'Testing rm_tags with constant function:  {test_rm_tags(always_true)}')
print(f'Testing rm_tags with our function:  {test_rm_tags(rm_tags)}')

will print:

Testing rm_tags with constant function:  False
Testing rm_tags with our function:  True

The tester for making dictionaries will similarly should return True when passed the correct function and False otherwise.

Program 12: Patterns & Testing CSci 39542: Introduction to Data Science Department of Computer Science Hunter College, City University of New York Spring 2023

Program Description

Program 12: Patterns & Testing
CSci 39542: Introduction to Data Science
Department of Computer Science
Hunter College, City University of New York
Spring 2023