lab07 : String Formatting, and File IO

num	ready?	description	assigned	due
lab07	true	String Formatting, and File IO	Wed 11/20 09:00AM	Wed 11/27 08:59AM

In this lab, you’ll get more practice with:

Testing your functions with pytest
Using String Formatting
Reading and processing files

This lab should be done solo.

Instructions

In this lab, you will need to create two files:

lab07.py - file containing function definitions
lab07_tests.py - file containing test cases
Please comment your name / perm at the top of each file.
Make sure to have a docstring for every function.

Starter code is provided for you at the bottom of this page.

Create a directory called ~/cs8/lab07 (using the mkdir command) and cd into that directory.
Use idle3 (you might try idle3 & if you want to be able to type commands on your terminal window after IDLE opens).
Use “New File” to create empty files called lab07.py and lab07_tests.py in that ~/cs8/lab07 directory.

Some notes about the File I/O functions

Create two files input1.txt and input2.txt in the ~/cs8/lab07 directory. Use these two input files to test your functionality: add a small number of words (2-10) first to make sure that you can verify that your function is working correctly. Add punctuation next to make sure you can correctly remove it; test for contractions, e.g., “don’t”. Make sure to test a file that contains more than one line.

These files are used when running pytest on lab07_tests.py. We will be using additional input files to test your submission on Gradescope.

Note that all words are separated by a whitespace character, and a word contains only alpha-numeric characters that does not include punctuation characters. For simplicity, you may assume the text only contains the ,.!?; punctuation characters. Your code will need to split and strip the text file string appropriately.

Notes on computing the most common words

If you run mostCommonWords("input1.txt", 1), this function should essentially return the mode value from the file (the word that occurs most often). To be able to return the list of most common words, you will need to count how many times each word occurred in a file. Implement wordFrequency to help you first count the words in a file, then mostCommonWords() can sort them by the frequencies and store N of them into the returned list.

Here are simple examples you should try:

input1.txt

hello
hello
hello world

input2.txt

hello world
world
world

mostCommonWords("input1.txt", 1) Test that you are able to return “hello” as the most frequently occuring word.
mostCommonWords("input1.txt", 2), the function should return ['hello', 'world'].
mostCommonWords("input2.txt", 2), the function should return ['world', 'hello'].
mostCommonWords("input2.txt", 3), the function correctly prints [Error] The "input2.txt" contains 2 unique words (you asked for 3).. Check that it also returns None.

Test the other functions accordingly, verifying on a simple input file that the results are correct.

Upload `lab07.py` and `lab07_tests.py` to Gradescope.

Once you’re done with writing your functions, navigate to the Lab assignment “lab07” on Gradescope and upload your lab07.py and lab07_tests.py files.

`lab07.py`

# Student: (insert name and perm number here)

def totalWords(filename):
    '''
    (20 points)
    Reads the file from filename in your function and returns
    the number of words in the file.
    - Words are separated by whitespace characters, but the count doesn't include
    the following punctuation characters (,.!?;). You can assume contractions
    count as one word (i.e. "don't", "you'll", etc. are one word).
    - The split and strip functions may be useful in your implementation.
    - Your function should open the file for reading, and close
    the file before returning.
    '''
    return "stub"


def longestWord(filename):
    '''
    (20 points)
    Reads the file from filename in your function and returns
    the longest word in the text file.
    - Words are separated by whitespace characters, but do not include
    the following punctuation characters (,.!?;). You can assume contractions
    count as one word (i.e. "don't", "you'll", etc. are one word).
    - In the case of a tie, the 1st occurrence of the longest word
    is returned.
    - The split and strip functions may be useful.
    - Your function should open the file for reading, and close
    the file before returning.
    '''
    return "stub"


def charactersPerWord(filename):
    '''
    (20 points)
    Reads the file with filename into your function and returns
    the average number of characters per word.
    - Words are separated by whitespace characters, but does not include
    the following punctuation characters (,.!?;). You can assume contractions
    count as one word (i.e. "don't", "you'll", etc. are one word).
    - The split and strip functions may be useful.
    - Your function should open the file for reading, and close
    the file when done.
    '''
    return "stub"


def wordFrequency(filename):
    '''
    (20 points)
    Reads the file from filename in your function and returns a dictionary 
    with the frequency of each word as its value.
    - Words are separated by whitespace characters, but do not include
    the following punctuation characters (,.!?;). You can assume contractions
    count as one word (i.e. "don't", "you'll", etc. are one word).
    - The split and strip functions may be useful.
    - You can assume contractions count as one word 
    (i.e. "don't", "you'll", etc. are one word).
    - Your function should open the file for reading, and close
    the file before returning.
    '''
    return "stub"
    
    
def mostCommonWords(filename, N):
    '''
    (20 points)
    Reads the file from filename in your function and returns a list of N most
    common words in the text file (i.e., N words with the highest frequency),
    sorted by the number of times they occured in the file (most common first).
    - Use wordFrequency() helper function to count the frequency of each word.
    - Print "[Error] The "<filename>" contains <X> unique words (you asked for <N>)." 
    and return None if N is larger than the number of words in the file (substitute 
    "<filename>", <X>, <N> with the actual values).
    '''
    return "stub"