This lab works with associative arrays (dictionaries).
This program analyzes word frequency in a file and prints a report on the n most frequent words. File to analyze: wordfreq.py Output analysis of how many words? 10 ' 6 text 6 words 6 counts 5 items 5 n 5 in 4 of 4 word 4 1 3To count the number of words, he uses an associative array, or dictionary. These are similar to arrays, but only store entries where they are needed. For our counting of words above, instead of having an entry for every word in the English language, we set up a dictionary:
counts = {}It starts out empty, but as we encounter new words, we add to the count for those words in a similar fashion as for lists or arrays:
counts[w] = counts.get(w,0) + 1After we have processed the entire file, we turn our dictionary into a list that we can sort and print out the n items that have occurred most:
items = list(counts.items()) items.sort() items.sort(key=byFreq, reverse=True) for i in range(n): word, count = items[i] print("{0:<15}{1:>5}".format(word, count))
Let's try this on another file sandwich.txt (from the nifty program developed by John DeNero and Aditi Muralidharan). This file is a collection of tweets collected over 24 hours, that contain the word sandwich (the format follows JSON used by Twitter). When we run our program again, with a small change to the open() so it can handle the extra characters that occur in tweets:
text = open(fname,'r',encoding='utf8').read() #Added in encoding to handle tweetswe can print out the most common words:
This program analyzes word frequency in a file and prints a report on the n most frequent words. File to analyze: sandwich.txt Output analysis of how many words? 10 sandwich 783 2011 779 08 503 09 471 co 353 http 351 t 350 a 307 3 205 4 183Why so many numbers? Let's look at the first couple of lines of sandwich.txt
[33.979703999999998, -118.037312] 6 2011-08-28 19:35:49 Flat Iron steak sandwich, Arnold Palmer (under contract at BIOLA) (@ The 6740) http://t.co/xCE6jou [42.917841000000003, -78.877071999999998] 6 2011-08-28 19:41:00 Pulled pork sandwich, mac-n-cheese from Fat Bob's. (@ Elmwood Avenue Festival of the Arts) http://t.co/pkedqkn [47.864685469999998, -122.28599858] 6 2011-08-28 19:59:49 Why is lettuce so important to a sandwich/burger when it taste like nothing and has no nutritional value? [40.70896407, -73.818375709999998] 6 2011-08-28 20:10:57 Hurricane Tuna Sandwich fix (@ Dunkin' Donuts w/ 3 others) [pic]: http://t.co/qzXaFD2Each line has a lot of additional information, in addition to the twitter message. The first numbers are the location, a callback (which we will ignore), the date and time, and then lastly the tweet itself.
Let's break up the information and create two different dictionaries:
First, we'll go through the file and split each line into fields, keeping only the hour that the tweet occurred and the tweeted message:
infile = open(fname,'r',encoding='utf8') #Split up the lines to have the number of events each hour, # and put the tweets in text to be processed as before. text = "" times = {} for lines in infile.readlines(): fields = lines.split("\t") text = text + fields[3] time = fields[2].split()[1] hour = time[0:2] times[hour] = times.get(hour,0) + 1 print(hour, fields[3]) #Print for testing, remove before running final program
Try running the program with the test print above. Then comment out the print, since it's not needed once you've verified it is getting the right fields.
Next, we'll print out the hours and the number of tweets in the dictionary for each hour. We'll follow the same format as Zelle used for counting words:
#Print out when tweets occur: items = list(times.items()) items.sort() for item in items: #Each item is a list of the key and the associated entry: print("At hour", item[0], ", there were", item[1], "tweets.")At what hour do most people tweet about sandwiches? Try your program on the other tweets collected at the nifty site. Each file there is labelled by the common topic (i.e. "soup", "obama", or "party").
# wordfreq.py def byFreq(pair): return pair[1] def main(): print("This program analyzes word frequency in a file") print("and prints a report on the n most frequent words.\n") # get the sequence of words from the file fname = input("File to analyze: ") text = open(fname,'r',encoding='utf8').read() #Added in encoding to handle tweets text = text.lower() for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~': text = text.replace(ch, ' ') words = text.split() # construct a dictionary of word counts counts = {} for w in words: counts[w] = counts.get(w,0) + 1 # output analysis of n most frequent words. n = eval(input("Output analysis of how many words? ")) items = list(counts.items()) items.sort() items.sort(key=byFreq, reverse=True) for i in range(n): word, count = items[i] print("{0:<15}{1:>5}".format(word, count)) if __name__ == '__main__': main()tweetFreq.py
# Modified version of wordfreq.py (from Zelle's 2nd Edition) # to analyze tweet times def byFreq(pair): return pair[1] def main(): print("This program analyzes word frequency in a file") print("and prints a report on the n most frequent words.\n") # get the sequence of words from the file fname = input("File to analyze: ") ###New code between these comments: ### infile = open(fname,'r',encoding='utf8') #Split up the lines to have the number of events each hour, # and put the tweets in text to be processed as before. text = "" times = {} for lines in infile.readlines(): fields = lines.split("\t") text = text + fields[3] time = fields[2].split()[1] hour = time[0:2] times[hour] = times.get(hour,0) + 1 print(hour, fields[3]) #Print for testing, remove before running final program #Print out when tweets occur: items = list(times.items()) items.sort() for item in items: #Each item is a list of the key and the associated entry: print("At hour", item[0], ", there were", item[1], "tweets.") ###New code between these comments: ### #Process the text as before: text = text.lower() for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~': text = text.replace(ch, ' ') words = text.split() # construct a dictionary of word counts counts = {} for w in words: counts[w] = counts.get(w,0) + 1 # output analysis of n most frequent words. n = eval(input("Output analysis of how many words? ")) items = list(counts.items()) items.sort() items.sort(key=byFreq, reverse=True) for i in range(n): word, count = items[i] print("{0:<15}{1:>5}".format(word, count)) if __name__ == '__main__': main()