Lab 9, RGGS 670: Algorithmic Approaches to Biological Data, Spring 2016

Designing Modular Programs

Today's lab has several themes:

strengthening programming skills by implementing some canonical sorting algorithms,
empirically measuring running times of algorithms on different types of data, and
designing programs in pieces ('modularly') for reuse.

Let's start with designing a general sorting program that we can use as a framework. The basic outline of our sorting program will be:

Get list of items to sort.
Sort list.
Print sorted list.

We will translate this list into a "main" function, and then fill in the pieces. As we fill in the pieces, if easier to test each part ('unit test') independently, then waiting to test at the end (since if there's a problem at the end, we have to search through all the code to find the problem). Before we start our main, write down what the inputs and outputs for each part are:

Get list of items to sort.
Inputs: nothing obvious
Outputs: the list to sort
Sort list.
Inputs: the original list (sorts it in place)
Outputs: nothing obvious
Print sorted list.
Inputs: the sorted list
Outputs: nothing obvious

We start by creating a main function and including our "To Do" list as comments:

def main():
	print "Welcome to the sorting program!"
	#Get list of items to sort.
	#Sort list.
	#Print sorted list

if __name__ == "__main__":
	main()

Next, we'll add in empty functions (or "stubs") for each item on our annotated To Do list. We have included "dummy" variables corresponding to the inputs and outputs from the annotated list:

def getInputList():
	"""
	No inputs
	Returns the list to sort
	"""
	a = []
	return a
	
def sortList(a):
	"""
	Takes a list and sorts it in place
	"""

	
def printList(b):
	"""
	Prints sorted list.
	No outputs
	"""

def main():
	print "Welcome to the sorting program!"
	#Get list of items to sort.
	a = getInputList()
	#Sort list.
	sortList(a)
	#Print sorted list
	printList(a)

if __name__ == "__main__":
	main()

Make sure the program runs (you can also look at the completed program at sortingLab.py) before continuing.

Now, from the To Do list, let's write the easiest functions first. For printList(), we will just use a simple print() (we can improve on it later):

def printList(b):
	"""
	Prints sorted list.
	No outputs
	"""
	print "The sorted list is:"
	print b

The next easiest function is the getInputList. To start, let's ask the user for a list, separated by spaces (again, we can improve this function later and have it read from a file for really large lists):

def getInputList():
	"""
	No inputs
	Returns the list to sort
	"""
	s = raw_input('Please enter a list, separated by spaces: ')
	a = [int(w) for w in s.split()]
	return a

(We're assuming that we're sorting numbers-- if not we can remove the "int() from the loop comprehension.)

For our sorting algorithm, we'll write a couple from pseudo-code. Here's a popular one that we haven't discussed: selectionSort (but is in the textbook):

Search the list for the largest element.
Move it to last position.
Search the remaining list for the largest element.
Move to second to last position.
Repeat until list is sorted.

Let's translate that to pseudocode:

For each i, starting at len(aList)-1 and decrementing to 1.
1. Find the index of the largest element in the list aList[0:i+1]
2. Swap the largest element with the element at position i.

Try writing out each part (note finding the index of the largest element will take another loop). Then try out your code (or look below at the solution from the book).

def selectionSort(alist):
   for fillslot in range(len(alist)-1,0,-1):
       positionOfMax=0
       for location in range(1,fillslot+1):
           if alist[location]>alist[positionOfMax]:
               positionOfMax = location

       temp = alist[fillslot]
       alist[fillslot] = alist[positionOfMax]
       alist[positionOfMax] = temp

Try on several cases to make sure your program works. When choosing test cases, try 'border cases' where the items are close to the edges of the arrays or the edges of the conditional tests (broadly interpreted).

Challenges:

Replace the sortList() function with a function that implements InsertionSort. Keep your previous sorting function, since we will use it in the next section.
Replace the sortList() function with a function that implements MergeSort.
Replace the getInputList() function with a function that generates the first n positive numbers. Test your program with n = 100, 1000, and 10000 (we will use this for testing running times of sorts below).
Replace the getInputList() function with a function that generates the first n positive numbers, in reverse order. Test your program with n = 100, 1000, and 10000 (we will use this for testing running times of sorts below).
Replace the getInputList() function with a function that generates n random numbers. Test your program with n = 100, 1000, and 10000 (we will use this for testing running times of sorts below).

Timing Functions

In lecture, we analyzed running times theoretically. We discussed worst-case time bounds for several of the algorithms. But how does that work in practice? We saw the sorting algorithms demonstration that showed that different types of input yield very different running times. For example, while the worst case analysis says InsertionSort is O(n^2) and MergeSort is O(n log n), when the data is nearly sorted, then InsertionSort finishes more quickly.

Python provides several built-in methods for timing functions. While there is a complete module for analyzing complete programs to determine which functions are using the most time (called ``profiling'' the code), we will use a simpler function, timeit() that times how long a single function takes. In the simplest form, we can time how long a command will take:

timeit() runs in its own environment to give as accurate a measurement as possible. The environment knows about built-in functions, but to use other functions, we need to "import" them into the environment. For example, the following will time a simple function that compute the factorial:

import timeit

def factR(n):
     if n <= 1:
          return 1
     else:
          return n*factR(n-1)

if __name__ == '__main__':
    import timeit
    print(timeit.timeit("factR(100)", setup="from __main__ import factR", number=10))

The last parameter controls the number of times it runs the function to compute the average. The default value is 1,000,000 runs. The function returns the amount of time, in seconds, that it took to run through the code the specified number of times.

Challenges

How long does SelectionSort and MergeSort take on an ordered list of 100 items? How about 10,000 items?
How long does SelectionSort and MergeSort take on a reversed ordered list of 100 items? How about 10,000 items?
And lastly, how long does SelectionSort and MergeSort take on a random list of 100 items? How about 10,000 items?

Recursion vs. Iteration

Everything you can write with recursion, you can write with iteration (just loops). Should you? Each function call requires Python to set up a new frame that keeps track of the parameters passed and current state. For small datasets, that a small portion of the overall running time, but what happens for larger computations?

Let's use a very simple algorithm to test recursion versus iteration running times:

def factR(n):
    if n <= 1:
        return 1
    else:
        return n*factR(n-1)

and

def factI(n):
    prod = 1
    for i in range(2,n):
        prod = i*prod
    return prod

Challenges

How much does recursion add to the running time? Test the two versions of the factorial function on n = 100, 200, 300, 400, and 500, and report the differences in time.

Side Note: In practice, after you have a program running, you profile the code to determine which functions are using the most time. Then, you look at the most time-intensive function and see if there are ways to improve its efficiency. Sometimes, this can be accomplished by simple changes to the code. Other times, it might be necessary to rewrite the entire function. For time-critical programs, this could mean rewriting code in a specialized language for the task or one that allows low-level manipulation to optimize efficiency (e.g. c or assembly language).