RGGS 670: Algorithmic Approaches to Biological Data

Spring 2016

Topics: Modular Design, Analyzing sorts by running time on varying data sets (sorted, almost sorted, and random data); Traversing Trees with Recursive Functions;

Today's lab has several themes:

- strengthening programming skills by implementing some canonical sorting algorithms,
- empirically measuring running times of algorithms on different types of data, and
- designing programs in pieces ('modularly') for reuse.

Let's start with designing a general sorting program that we can use as a framework. The basic outline of our sorting program will be:

- Get list of items to sort.
- Sort list.
- Print sorted list.

We will translate this list into a "main" function, and then fill in the pieces. As we fill in the pieces, if easier to test each part ('unit test') independently, then waiting to test at the end (since if there's a problem at the end, we have to search through all the code to find the problem). Before we start our `main`, write down what the inputs and outputs for each part are:

- Get list of items to sort.

Inputs: nothing obvious

Outputs: the list to sort - Sort list.

Inputs: the original list (sorts it in place)

Outputs: nothing obvious - Print sorted list.

Inputs: the sorted list

Outputs: nothing obvious

def main(): print "Welcome to the sorting program!" #Get list of items to sort. #Sort list. #Print sorted list if __name__ == "__main__": main()

Next, we'll add in empty functions (or "stubs") for each item on our annotated To Do list. We have included "dummy" variables corresponding to the inputs and outputs from the annotated list:

def getInputList(): """ No inputs Returns the list to sort """ a = [] return a def sortList(a): """ Takes a list and sorts it in place """ def printList(b): """ Prints sorted list. No outputs """ def main(): print "Welcome to the sorting program!" #Get list of items to sort. a = getInputList() #Sort list. sortList(a) #Print sorted list printList(a) if __name__ == "__main__": main()

Make sure the program runs (you can also look at the completed program at sortingLab.py) before continuing.

Now, from the To Do list, let's write the easiest functions first. For `printList()`, we will just use a simple `print()` (we can improve on it later):

def printList(b): """ Prints sorted list. No outputs """ print "The sorted list is:" print b

The next easiest function is the `getInputList`. To start, let's ask the user for a list, separated by spaces (again, we can improve this function later and have it read from a file for really large lists):

def getInputList(): """ No inputs Returns the list to sort """ s = raw_input('Please enter a list, separated by spaces: ') a = [int(w) for w in s.split()] return a

(We're assuming that we're sorting numbers-- if not we can remove the "`int()` from the loop comprehension.)

For our sorting algorithm, we'll write a couple from pseudo-code. Here's a popular one that we haven't discussed: selectionSort (but is in the textbook):

- Search the list for the largest element.
- Move it to last position.
- Search the remaining list for the largest element.
- Move to second to last position.
- Repeat until list is sorted.

Let's translate that to pseudocode:

- For each i, starting at len(aList)-1 and decrementing to 1.
- Find the index of the largest element in the list aList[0:i+1]
- Swap the largest element with the element at position i.

Try writing out each part (note finding the index of the largest element will take another loop). Then try out your code (or look below at the solution from the book).

def selectionSort(alist): for fillslot in range(len(alist)-1,0,-1): positionOfMax=0 for location in range(1,fillslot+1): if alist[location]>alist[positionOfMax]: positionOfMax = location temp = alist[fillslot] alist[fillslot] = alist[positionOfMax] alist[positionOfMax] = temp

Try on several cases to make sure your program works. When choosing test cases, try 'border cases' where the items are close to the edges of the arrays or the edges of the conditional tests (broadly interpreted).

- Replace the
`sortList()`function with a function that implements InsertionSort. Keep your previous sorting function, since we will use it in the next section. - Replace the
`sortList()`function with a function that implements MergeSort. - Replace the
`getInputList()`function with a function that generates the first n positive numbers. Test your program with n = 100, 1000, and 10000 (we will use this for testing running times of sorts below). - Replace the
`getInputList()`function with a function that generates the first n positive numbers, in reverse order. Test your program with n = 100, 1000, and 10000 (we will use this for testing running times of sorts below). - Replace the
`getInputList()`function with a function that generates n random numbers. Test your program with n = 100, 1000, and 10000 (we will use this for testing running times of sorts below).

In lecture, we analyzed running times theoretically. We discussed worst-case time bounds for several of the algorithms. But how does that work in practice? We saw the sorting algorithms demonstration that showed that different types of input yield very different running times. For example, while the worst case analysis says InsertionSort is O(n^2) and MergeSort is O(n log n), when the data is nearly sorted, then InsertionSort finishes more quickly.

Python provides several built-in methods for timing functions. While there is a complete module for analyzing complete programs to determine which functions are using the most time (called ``profiling'' the code), we will use a simpler function, timeit() that times how long a single function takes. In the simplest form, we can time how long a command will take:

`timeit()` runs in its own environment to give as accurate a measurement as possible. The environment knows about built-in functions, but to use other functions, we need to "import" them into the environment. For example, the following will time a simple function that compute the factorial:

import timeit def factR(n): if n <= 1: return 1 else: return n*factR(n-1) if __name__ == '__main__': import timeit print(timeit.timeit("factR(100)", setup="from __main__ import factR", number=10))The last parameter controls the number of times it runs the function to compute the average. The default value is 1,000,000 runs. The function returns the amount of time, in seconds, that it took to run through the code the specified number of times.

- How long does SelectionSort and MergeSort take on an ordered list of 100 items? How about 10,000 items?
- How long does SelectionSort and MergeSort take on a
*reversed*ordered list of 100 items? How about 10,000 items? - And lastly, how long does SelectionSort and MergeSort take on a random list of 100 items? How about 10,000 items?

Everything you can write with recursion, you can write with iteration (just loops). Should you? Each function call requires Python to set up a new frame that keeps track of the parameters passed and current state. For small datasets, that a small portion of the overall running time, but what happens for larger computations?

Let's use a very simple algorithm to test recursion versus iteration running times:

def factR(n): if n <= 1: return 1 else: return n*factR(n-1)and

def factI(n): prod = 1 for i in range(2,n): prod = i*prod return prod

- How much does recursion add to the running time? Test the two versions of the factorial function on n = 100, 200, 300, 400, and 500, and report the differences in time.

*
Side Note: In practice, after you have a program running, you profile the code to determine which functions are using the most time. Then, you look at the most time-intensive function and see if there are ways to improve its efficiency. Sometimes, this can be accomplished by simple changes to the code. Other times, it might be necessary to rewrite the entire function. For time-critical programs, this could mean rewriting code in a specialized language for the task or one that allows low-level manipulation to optimize efficiency (e.g. c or assembly language).*

For each lab, you should submit a lab report by the target date to: `kstjohn AT amnh DOT org`. The reports should be about a page for the first labs and contain the following:

Target Date: 6 April 2016

Title: Lab 9: Sorting & Recursion

Name & Email:

Purpose: Give summary of what was done in this lab.

Procedure: Describe step-by-step what you did (include programs or program outlines).

Results: If applicable, show all data collected. Including screen shots is fine (can capture via the `Grab` program).

Discussion: Give a short explanation and interpretation of your results here.

This course will use the on-line Rosalind system for submitting programs electronically. The password for the course has been sent to your email. Before leaving lab today, complete the first two challenges.