Laboratory Exercise 8
Filtering and Formatting Data
CMP 420: Database Systems
Lehman College, City University of New York
3 April 2003
Employee Database
Last lab, we used the Java interface to SQL,
called JDBC. Today, we will populate a simple database
using data from the web. The database will
contain two tables:
Employees(ID, Name, Address, StartYear, StopYear, Salary)
Colleges(SchoolName, StudentName)
Telnet into comet and create a database for today's
lab (or use the one you created for Lab 7). You should create tables corresponding to the schema
above, with the following constraints:
- ID is the primary key for Employee.
- StudentName is the primary key for Colleges.
- Address has a default value of "1600 Pennsylvania Ave."
- Name, SchoolName, and StudentName cannot be NULL.
Populating the Database
One way to enter the information into your database is to type
in every line by hand. Another way is to read in the database
stored as a file. For today's lab, we will use data from the
web to populate the database, found at:
http://www.presidentsusa.net/index.html.
We will use the information about the presidents found there
to populate our database.
Unfortunately, most data on the web cannot be immediately loaded
into your database. You first must format it, so, it can be
correctly loaded. For example, this page
gives most of the information needed for the first table, while information for the second table can be found here. However,
it is not in the correct format (for example of the correct
format, see the lab2data.txt).
The main part of today's lab is to write a filter program
that reads in the data from the presidential webpage (save it as
a file in your account), and writes out a file with the "INSERT ... VALUES"
statements needed to insert the information about the presidents
into the database.
Some hints on how to do this:
- You may use whichever programming language you feel most
comfortable. On the system is C, C++, perl, and Java with most
of the standard libraries. Ask if you would like to use a
language not listed above.
- Start by writing a program that reads in a file line by line,
and prints out each line to the screen.
- Next, break each line you read in into pieces (in Java, you
might want to use a string tokenizer class). Note that you
can throw away the information about the vice presidents at
the end of the lines since it's not needed in this database.
- Next, write out the lines with the necessary SQL commands to
insert the data into the database.
- We will use the presidental number as their ID. For example,
George Washington will have ID number 1, while George W. Bush
will have ID number 43.
- Note that two presidents served less than a year, so, have
only StartYears on the webpage. You should either "preprocess"
the file (add StopYears before running your program), or have
conditionals in your program that handle this situation.
[President William Harrison caught pneumonia during his
inaugeration and died during his first year in office.
President Garfield was assassinated by a rival's supporter
and died 80 days into his term.]
Once you have created you data file, you can read it into your
database using the "psql -f" option or using JDBC as described in the last lab.
Write a similar program to use the information about the colleges
that the presidents attended to populate the Colleges table.
Using your Database
Demonstrate your database works by:
- listing all presidents who served before 1800.
- listing all presidents who attended Harvard or Yale
and served at least 6 years.
Final Comments
Before leaving the lab, you need to log out and shut down your
computer. First, you should quit all applications. Then,
go to the K menu in the bottom left hand corner and
select Log Out. Choose the Shut Down option,
and shut off the monitor when you're done.
Remember to pick up any files you sent to the printer and to
take any personal belongings. It is difficult to get back
into the lab later (since it's locked when not in use), so,
it's worth making sure you haven't forgotten anything.