lab07_extra : String Formatting and File IO (cont'd)
num | ready? | description | assigned | due |
---|---|---|---|---|
lab07_extra | true | String Formatting and File IO (cont'd) | Wed 12/04 09:00AM | Wed 12/11 08:59AM |
layout: lab num: lab07_extra ready: true desc: “String Formatting and File IO (cont’d)” assigned: 2019-12-04 09:00am due: 2019-12-11 08:59am —
In this lab, you’ll get more practice with:
- Testing your functions with pytest
- Using String Formatting
- Reading and processing files
This lab should be done solo.
This lab is optional
… and is designed to help you test and get credit for your previous solution. Additionally, there are slight modifications you can add to your implementation, which will make the resulting output more interesting to analyze.
You will get credit for the lab only if you use the material we covered in class.
Looking up how to solve a problem counts as a violation of academic integrity, so refrain from doing so and think of ways how what you learned in class can help you solve this problem. Using functions we have not covered in class is a red flag: we are interested in how well you can apply the material that you learned in this class, not in how well you can look up someone else’s solution.
Part 0
If you open and close the file within every function to get the list of words, consider writing a helper function, which you can call instead.
def getAllWords(filename):
'''
Returns a list of all words from the given filename.
'''
Part 1
If you correctly implemented the previous lab, you should be able to run the following test file and pass all tests.
Note that the input file “the-gift-of-the-magi.txt” (source: Gutenberg project) contains double-quotes and dollar signs, which are not alpha-numeric characters, however, it should not be a problem if you stuck to our original assumption that “the text only contains the ,.!?;
punctuation characters”.
#test_functions.py (lab 07)
# test file without the em dash
import pytest
input_file = "the-gift-of-the-magi.txt"
total_words = 2093
unique_words = 859
def test_totalWords_prophet():
'''Test totalWords("input1")'''
from lab07 import totalWords
assert totalWords(input_file) == total_words
# Tests for longestWord
def test_longestWord_prophet():
'''Test longestWord(input1)'''
from lab07 import longestWord
assert longestWord(input_file) == "inconsequential"
def test_charactersPerWord_prophet():
'''Test charactersPerWord(input1)'''
from lab07 import charactersPerWord
assert round(charactersPerWord(input_file), 5) == 4.22121
def test_mostCommomWords_1():
''' Test mostCommomWords, N=1 '''
from lab07 import mostCommonWords
assert mostCommonWords(input_file, 1) == ['the']
def test_mostCommomWords_3():
''' Test mostCommomWords, N=3 '''
from lab07 import mostCommonWords
assert mostCommonWords(input_file, 3) == ['the', 'and', 'a']
def test_mostCommomWords_5():
''' Test mostCommomWords, N=5 '''
from lab07 import mostCommonWords
assert mostCommonWords(input_file, 5) == ['the', 'and', 'a', 'of', 'to']
def test_mostCommomWords_30():
''' Test mostCommomWords, N=30 '''
from lab07 import mostCommonWords
words = mostCommonWords(input_file, 30)
assert words[29] == "all"
def test_mostCommomWords_last():
''' Test mostCommomWords, last word '''
from lab07 import mostCommonWords
words = mostCommonWords(input_file, unique_words)
assert words[unique_words-1] == "\"Cut"
def test_mostCommomWords_prophet_neg():
''' Test mostCommomWords, N is negative '''
from lab07 import mostCommonWords
assert mostCommonWords(input_file, -1) == None
def test_mostCommomWords_prophet_too_large():
''' Test mostCommomWords, N is too large '''
from lab07 import mostCommonWords
assert mostCommonWords(input_file, unique_words+1) == None
Part 2
Now that you verified that your previous code works as expected, let’s add a few modifications to it.
- Let’s also remove double-quotes we read-in the words (in addition to the
,.!?;
characters that you are already removing; don’t remove any other characters). - The lower/upper case or capitalization shouldn’t matter when counting words, so let’s enforce that by converting words to lowercase. Notice that “hi”, “Hi” and “HI” will all be counted as the same word “hi”. Dealing with cases like acronyms and state abbreviations is outside the scope of this assignment.
Part 3
If you have noticed, the most common words usually include articles and prepositions, which are not very interesting for text analysis.
- Remove stop words using the provided “stopwords.txt” (source: natural language toolkit (nltk) module)
- Hint: since
remove
will take out only the first occurrence of an item in the list, you can use awhile
loop and keep removing the item while it is still found in the list. - Alternatively, you can create a new list by saving there all words that are not in the list of stopwords.
- Hint: since
def removeStopwords(wordList, stopwords_file):
'''
Given a list of words (wordList) and a file that
contains stopwords separated by newlines (stopwords_file),
remove from wordList all words that are in stopwords_file.
'''
Now that we’ve added an option to run our code with the stopwords file, let’s change our function signatures to add a parameter stopwords_file
, which by default will be set to None
.
For example:
def totalWords(filename, stopwords_file = None):
'''
Reads the file from filename in your function and returns
the number of words in the file.
'''
allWords = getAllWords(filename)
if stopwords_file != None:
allWords = removeStopwords(allWords, stopwords_file)
else: # stopwords_file is not given
# do the regular processing
Doing so will allow you to run the function with and without the default argument:
totalWords(input_file)
totalWords(input_file, stopwords_file)
Check that adding stopwords_file = None
to all your functions still works correctly with all the tests from Part 2.
Below are the tests for checking whether the stopwords are being removed correctly. Notice now that by looking at the top 3 common words you can get a better sense of what this story is about.
#test_functions.py (lab 07_extra)
# With stopwords (no em dash)
import pytest
input_file = "the-gift-of-the-magi.txt"
stopwords_file = "stopwords.txt"
total_words = 1041
unique_words = 654
def test_totalWords_input1_sw():
'''Test totalWords("input1")'''
from lab07_extra import totalWords
assert totalWords(input_file, stopwords_file) == total_words
# Tests for longestWord
def test_longestWord_input1_sw():
'''Test longestWord(input1)'''
from lab07_extra import longestWord
assert longestWord(input_file, stopwords_file) == "inconsequential"
def test_charactersPerWord_sw():
'''Test charactersPerWord(input1)'''
from lab07_extra import charactersPerWord
assert round(charactersPerWord(input_file, stopwords_file), 5) == 5.54467
def test_mostCommomWords_1_sw():
''' Test mostCommomWords, N=1 '''
from lab07_extra import mostCommonWords
assert mostCommonWords(input_file, 1, stopwords_file) == ['jim']
def test_mostCommomWords_3_sw():
''' Test mostCommomWords, N=3 '''
from lab07_extra import mostCommonWords
assert mostCommonWords(input_file, 3, stopwords_file) == ['jim', 'della', 'hair']
def test_mostCommomWords_5_sw():
''' Test mostCommomWords, N=5 '''
from lab07_extra import mostCommonWords
assert mostCommonWords(input_file, 5, stopwords_file) == ['jim', 'della', 'hair', '--', 'said']
def test_mostCommomWords_30_sw():
''' Test mostCommomWords, N=30 '''
from lab07_extra import mostCommonWords
words = mostCommonWords(input_file, 30, stopwords_file)
assert words[29] == "without"
def test_mostCommomWords_last_sw():
''' Test mostCommomWords, last word '''
from lab07_extra import mostCommonWords
words = mostCommonWords(input_file, unique_words, stopwords_file)
assert words[unique_words-1] == "$20"
def test_mostCommomWords_neg_sw():
''' Test mostCommomWords, N is negative '''
from lab07_extra import mostCommonWords
assert mostCommonWords(input_file, -1, stopwords_file) == None
def test_mostCommomWords_too_large_sw():
''' Test mostCommomWords, N is too large '''
from lab07_extra import mostCommonWords
assert mostCommonWords(input_file, unique_words+1, stopwords_file) == None
Upload lab07_extra.py
to Gradescope.
Once you’re done with writing your functions, navigate to the Lab assignment “lab07_extra” on Gradescope and upload your lab07_extra.py
.
lab07_extra.py
.
Remember to add your name and perm number at the start of the file.