CSci 39542 Syllabus    Resources    Coursework



Program 12: Patterns & Testing
CSci 39542: Introduction to Data Science
Department of Computer Science
Hunter College, City University of New York
Spring 2023


Classwork    Quizzes    Homework    Project   

Program Description


Program 12: Patterns & Testing.Due 10am, Wednesday, 3 May.
Learning Objective: to use regular expressions with simple patterns from semi-struictured data.
Available Libraries: re and core Python 3.6+(Note: pandas and numpy are not included).
Sample Datasets:
sample.html

The case study in Lecture 11 focused on determining the accuracy of self-reported college rankings data. Much of the exploratory data analysis relied on "scraping" information from webpages. For this program, we will write functions that extract data from a an HTML page, with a twist on this canonical problem: you can only use the re (regex) library and no additional libraries, such as pandas, numpy, beautifulSoup, etc.

(If you are rusty at HTML, here's a quick tutorial.)

The assignment is broken into the following functions to allow for unit testing:

For example, if the input file is:


  <html>
  <head><title>Simple HTML File</title></head>

  <body>
    <p> Here's a link for <a href="http://www.hunter.cuny.edu/csci">Hunter CS Department</a>
    and for <a href="https://stjohn.github.io/teaching/data/fall21/index.html">CSci 39542</a>.  </p>

    <p> And for <a href="https://www.google.com/">google</a>
  </body>
  </html>
Then a sample run of the program:
Enter input file name: simple.html
Enter output file name:  links.csv
And the links.csv would be:

Title,URL
Hunter CS Department,www.hunter.cuny.edu/csci
CSci 39542,stjohn.github.io/teaching/data/fall21/index.html
google,www.google.com



For example, let's try the rm_tags function on a string with no tags:

no_tags = "No tags\tbut new\nlines"
print('For data = "No tags\\tbut new\\nlines"')
print(rm_tags(no_tags))
will print:
No tags	but new
lines
Trying with data that has tags, let's use the sample HTML file:
data = open('sample.html').read()
print(rm_tags(data)
will print:
    Simple HTML File
  
    
  Here's a link for Hunter CS Department
 and for CSci 39542.  

  And for google


Next, we'll build a function that pulls the titles and URL of external links (i.e. starts with http or https) from an HTML file. As an exercise in regular expressions, only the standard Python and re library are allowed.

For example, let's try the make_dict with a single external link:

data = '<a href="https://mta.info">MTA</a>'
print(make_dict(data))
will print:
{'MTA': 'https://mta.info'}

Our functions test_rm_tags and test_rm_tags are tester functions, not unlike the ones used to grade assignments in Gradescope Autograder. Let's build a constant function that returns True no matter what the inputs and see how it compares to rm_tags:

def always_true(data):
  return True
  
print(f'Testing rm_tags with constant function:  {test_rm_tags(always_true)}')
print(f'Testing rm_tags with our function:  {test_rm_tags(rm_tags)}')
will print:

Testing rm_tags with constant function:  False
Testing rm_tags with our function:  True
The tester for making dictionaries will similarly should return True when passed the correct function and False otherwise.