Program 5: Regex Logs. Due noon, Thursday, 10 March.
This program applies regular expressions
(covered in Lecture 7 & DS 100: Sections 13.2-3) to parse information from Python logs.
The assignment is broken into the following functions to allow for unit testing:
Learning Objective: to use regular expressions to parse from log data.
Available Libraries: Regular expressions (re) and core Python 3.6+. (Note: not pandas)
Sample Datasets: one_liner_log.txt, multi_liner_log.txt, traceback_log_simple.txt, traceback_log_complex.txt.
parse_date_from_one_line_log(file_name)
:
This function takes in a text file containing one line of log and parses out the log date, returning the log date as string.
file_name
, the name of a text file which contains one line of log
YYYY-MM-DD
2022-02-22 21:05:13,191 read_data - ERROR:[Errno 2] No such file or directory: 'inputfile_1.txt'
will return:
log_date = parse_date_from_one_line_log('one_liner_log.txt')
print(log_date)
2022-02-22
parse_min_max_date_from_one_line_logs(file_name)
:
This function takes in a text file containing multiple lines of logs and parses out the first and last log date, stored as two string variables.
file_name
, the name of a text file which contains multiple lines of log
YYYY-MM-DD
2022-01-22 01:01:11,121 read_data - ERROR:[Errno 2] No such file or directory: 'inputfile.txt'
2022-01-23 01:01:11,121 read_data - ERROR:[Errno 2] No such file or directory: 'inputfile.txt'
2022-01-23 01:01:11,121 read_data - ERROR:[Errno 2] No such file or directory: 'inputfile.txt'
...
will return:
min_log_date, max_log_date = parse_min_max_date_from_one_line_logs('multi_liner_log.txt')
print(min_log_date)
print(max_log_date)
2022-01-22
2022-02-14
parse_missing_filename_from_one_line_log(file_name)
:
This function takes in a text file containing one line of log and parses out the missing filename stored as string
file_name
, the name of a text file which contains one line of log
2022-02-22 21:05:13,191 read_data - ERROR:[Errno 2] No such file or directory: 'inputfile_1.txt'
will return:
missing_filename = parse_missing_filename_from_one_line_log('one_liner_log.txt')
print(missing_filename)
inputfile_1.txt
parse_filepath_linenum_from_traceback_log(file_name)
:
For a typical Python traceback, the first line contains the file name, line number, and module name.
The second line contains the actual code that is executed (and subsequently errored out).
This function takes in a text file containing an example of a multi-line Python traceback error and parses out the filepath and line number of the error.
file_name
, the name of a text file which contains a single Python traceback error log spanning multiple lines
Traceback (most recent call last):
File "/home/datascience/logs/read_data.py", line 1, in word_count
with open(filename) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'inputfile.txt'
will return:
log_filepath, log_linenum = parse_filepath_linenum_from_traceback_log('traceback_log_simple.txt')
print(log_filepath)
print(log_linenum)
/home/datascience/logs/read_data.py
1
parse_last_linenum_from_traceback_log(file_name)
:
Unlike stack traces in other programming languages, a Python trace back should be read from bottom to top.
In the case where there are multiple errors in a Python trace back log, always look for the most recent call, which is the last line.
This function takes in a text file containing an example of a multi-line Python traceback error and parses out the filepath and line number of the most recent call.
file_name
, the name of a text file which contains a multiple Python traceback error logs spanning multiple lines
Traceback (most recent call last):
File "build_model.py", line 52, in build_model
LogisticRegression()
File "clean_data.py", line 40, in create_dummies
create_dummies()
File "clean_data.py", line 22, in read_csv
df = read_csv(filename)
File "data/import_data.py", line 10, in process_data
with open(filename) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'inputfile.txt'
will return:
last_linenum = parse_last_linenum_from_traceback_log('traceback_log_complex.txt')
print(last_linenum)
10