Homework 5 — Strings and Files
- Become familiar with using strings, lists, dictionaries, and files
- Learn how to sort a list
- Learn how to read data from and write data to files
- Write a program in more than one module
In preparation for this assignment, read Chapter 11 of the textbook.
Assignment — Create a List of Distinct Words in a set of Text Documents
Write a program that opens and reads one or more files containing English text, scans each one to accumulate a list of individual words appearing in that file, and then, when each input file has been scanned, writes to another file the list of unique words appearing in the entire set of input files — in alphabetical order — along with the number of occurrences of each word. Your pro- gram must also write the total number of words read (including duplicates).
Your program must ignore the cases of words. For example, ‘This’ and ‘this’ are the same word. On the other hand, singular and plural words are usually different — for example ‘file’ and ‘files’
— as are different verb forms such as ‘eat’ and ‘eats.’
Hyphenated words must be treated as one word:– for example, ‘deep-seated’ is different from the words ‘deep’ and ‘seated.’ You must also take into account possessives and contractions such as “don’t” and “Bob’s”. Note that “Bob’s” is a separate word from “Bob”. Numbers such as 10,000 or $50 or 45¢ should also be treated as words. You may assume that words do not break across lines.
A sample output file should be in a format resembling the following (continued on the next page):–
This is a traditional problem in C and C++ courses. The solution is Python is much simpler than in C. This problem is discussed in §11.6.3. However, there are significant differences between this assignment and the solution pre- sented in that section of the textbook.
20 Number of distinct words 790 Total words read
To allow for very long input files, the field width of the number of occurrences of each word should be at least five decimal digits. The counts need to be right-justified in the output. You print the number of distinct words and the total number of words read. All of the words should be in lower case.
For debugging, you may use the following text files:–
For fun, you may use any of Shakespeare’s plays, which can be downloaded from the Internet.
Structure and organization
Structure this homework in three or more separate .py modules, plus a wrapper. No individual function in any file may be more than 25 lines length.
- One module should handle the opening and scanning of input
- A separate module should handle the accumulation of words, taking into consideration dupli- cates, and
- Yet a third module should organize and write out the output
In addition, you must have a separate “wrapper” module for invoking your program with test cases. During development, you will want the wrapper to prompt you for simple test cases, and you will also want it to test particular functions or parts of your program. In the end, the wrapper will “manage” the other functions to complete the assignment.
Read each file a line at a time. You may use a for-loop as describe on p. 156 of the text book. For example:–
file = open(filename, ‘r’) for line in file:
# process line here
# Note that line is a string
Each line is a string ending in the ‘\n’ character. Split the string into “words” delineated by white space using the string.split() method. Next, strip out the punctuation at the begin- ning and end of each word using the string.strip() method.2,3 By this means, you will au- tomatically handle contractions (i.e., words with apostrophes) and hyphenated words. You may, however, end up with a number of “null” words — i.e., strings of length zero left by strip() applied to non-words. These must be discarded before taking the next step.
Finally, convert the words to lower case, and then add them to a dictionary. Use the approach de- scribed in §11.6.3 and the dictionary method get(). That is, if a word is not yet in the dictio- nary, add an entry to the dictionary using the word as the key and initialize its value (i.e., count) to 1. If a word is already in the dictionary, increment its count in the dictionary by one. By the time you have completed scanning all of the input documents, the dictionary will contain entries for all of the words, along with the number of occurrences of each one.
Next, extract all of the keys in the dictionary to a list, and sort that list using the list.sort() method. The list will now be in alphabetical order, but it will not have any information about the number of occurrences of each word.
In another function, go through the list. For each word, look up the count that has been accumu- lated in the dictionary, and then write the word and its count to the output using the file.write() method.
The output file needs to be formatted correctly, with the word counts right-justified on each line. It is strongly suggested to use the string.format() method, described in §5.8.2 of the text- book.
Formatting the output according to the problem specifications is an important requirement. You must learn how to use string.format() to right justify the numbers on each line. This cre- ates a formatted string that you can then pass to file.write(). Note that file.write() works a lot like print() but applies to files rather than to the standard output.
Finally, remember to close the output file, so that it is safely stored away.
Note that string.strip() is not shown in the textbook under string methods in §5.6. However, it is just a com- bination of string.lstrip() and string.rstrip(), both of which are shown in Table 5.2.
Do not copy the code in the middle of p. 373 for replacing punctuation characters with spaces. That will give incor- rect answers for this assignment.
Specifying input and output files
At the very minimum, your wrapper must prompt the user for the name of the output file, and then it must prompt the user for the names of the input files, one at a time, in a loop. It should ac- cumulate the names of the input files in a list, each element of which is a string. The loop should terminate when the user enters a null (i.e., empty) string.
The wrapper should then call your function to read and process the input, with the list of input file names as argument. After all inputs have been read and processed, the wrapper should call another function to prepare the output and write it to the output file.
Extra credit: Getting file names from command line
For extra credit, augment your wrapper module to look for file names on the command line or command prompt. A typical command line will look something like the following:–
python HW5_username.py outputFile inputFile1 inputFile2 …
That is, HW5_username.py is the name of Python program to be executed. The outputFile is the name of the file to which to write the output, and following that are one or more names of input files, each one of which is to be read and scanned for text.
If no output and/or input files are specified on the command line, then your wrapper should prompt for file names from the user in the IDLE or Python shell. However, if files are specified on the command line, then the wrapper must not prompt for anything from the shell.
Note on command line arguments
Most programs in real life take commands from a command line (Macintosh or Linux terminal window) or a command prompt (Windows Command Prompt window). The general format of such a command is
commandName argument1 argument2 argument3 …
The command name is, by convention, argument0 and is, in general, the name of the file con- taining the program that implements the command. Each argument is a string of characters. Also by convention, the operating system loads the file containing the command, prepares a list of the arguments along with a count, and then calls the function named main(), passing the list and count to it. The main() function then picks apart the list of arguments, readies the program, and runs it, passing the argument strings to its various internal components.
Command line arguments are supported by Windows, Macintosh, Linux, and most other systems. Graphical user interfaces (GUI’s) typically associate specific file types with specific programs.
For example, .docx is associated with Microsoft Word, .pdf is associated with Adobe Acrobat, and .py is associated with Python. In most cases, “opening” a file in the GUI causes the operat- ing system to create a command line containing the associated program name and the file name and then to “run” that command as if it had been typed into a command shell or command prompt.
In some installations, Python programs can be run this way — by simply double-clicking or opening them in the GUI. But in many cases, Python is not sufficiently integrated with the GUI. Therefore, it has to be run separately from a command line or command prompt. In this case, you would run the program as follows:
python HW5_username.py argument1 argument2 argument3 …
The number of arguments is, of course, variable and depends upon the program.
To get access to these arguments in Python, import the sys module. The arguments are then available in a list named sys.argv. (“argv” is a traditional name used in Unix and Linux meaning “argument vector.”). Note that sys.argv is the file name of your “main” Python program — i.e., the file that is “run” by the Python interpreter. The length of the list is len(sys.argv). See §29.1 of the online Python documents for more information.
Submit your project in the form of a zip-file to the WPI Canvas system at
This assignment is HW5. Zip together all of your Python modules, any input files that you used in addition to the three listed above, an output file showing the words from all of your inputs, and a file named README.txt that lists everything that you are submitting, along with any notes or other information that you want the graders to know about.
Your wrapper module must be named HW5_username.py (where any hyphens are replaced by underscores). Other modules may be named what you wish.
The zip-file must be named HW5_username.zip.
When the graders start to grade your submission, they will attempt to run HW5_username.py by invoking the Run > Run Module command from an IDLE edit window. They will enter the names of the graders’ test files when prompted. If you have implemented the extra credit option of this assignment, they will repeat the tests from the command line. In the second case, the wrapper should detect that the file names are provided as arguments to the command line as pre- sented in sys.argv (see §29.1 of the online Python documents).
This assignment is worth 120 points, plus 25 points extra credit. Points are allocated as follows:–
- Program organization:– (25 points)
- Properly structured into three .py modules, with appropriate importing (10)
- All input file opening, closing, and reading in one module (2)
- All word counting, sorting, and dictionary management in a separate module (2)
- All output file opening, formatting, writing, and closing in another separate module (2) points
- Wrapper for testing, prompting user, interpreting command line, combining words from multiple input files, etc. (9)
- File input:– (25 points)
- Correctly opening file for reading and closing when done (5)
- Reading lines or whole file (5)
- Splitting input into words at white-space boundaries (5)
- Correctly stripping leading and trailing punctuation (5)
- Building up and returning list of words (5)
- Sorting, duplicate removal, and counting:– (20 points)
- Correct use of a dictionary for storing and counting unique words (8) (includes removal of null words)
- Correct use of sort() method (7)
- Correctly returning a list of unique words (5)
- Output – (15 points)
- Correct use of format() method for each output line (5)
- Correct use of write() method (5)
- Correctly opening and closing output file (5)
- Properly structured submission (15 points)
- Output file showing words from all of your inputs
- txt listing everything of your submission plus notes and information to graders
- Graders’ test cases – (20 points)
- Command line arguments:– (25 points)
- Correct use of argv list for getting command line arguments (15)
- Detecting when no command line arguments and prompting instead (10)
- Penalty for any function longer than 25 lines – 10 points per function
- Penalty for forgetting or failing to close each open file – 10 points per file open statement
- Late penalty per course rules
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: [email protected] 微信:itcsdx