The Name of a Text File on the User S System. the File Consists of Ordered Triples, Each

CS 325.001
Software Engineering
Spring 2017 / Team Project #1
Use Case Diagrams, Unit Modules, and Required Documentation due by 8 AM on Thursday, 1/19/17.
Fully Integrated Code and Required Documentation due by 8 AM on Thursday, 2/2/17.

Team Cheddar
Joshua L. Baldwin
Dane A. Grote
Ryan O. Wheeler /
Team Chorizo
Paul F. Barkley
Jonathan Hearn
Adam Wooldridge /
Team Garlic
Alexander J. Brotherton
Andrew M. Henry
William J. Rohrkaste /
Team Hickory
Tyler M. Burch
Bryant M. Peppler
Frankie Seavey /
Team Jalapeño
Clayton E. Calabrese
John B. Lednicky
Kyle T. Shriver /
Team Pepper
Hernan Cortez
Lindsey N. McCall
Samuel R. Smyth /
Team Tabasco
Cristian Diaz
Cory M. McCunney
Maximilian Spicer /
Team Teriyaki
Maksym O. Doroshenko
Diane R. Meng
Taylor R. Thul /
Team Tocino
Brandon T. Faulkner
Adam W. Papirnik
Justin A. Vartanian
Your first CS 325 team (listed above) is being assigned a programming project (in HTML and CSS) to be conducted in two phases. In Phase One, each team member will develop and test independent unit modules to perform very specific input, output, and processing tasks. Each team member is completely responsible for his or her modules, but the team will work together to develop use case diagrams to model the practical application of the integrated code that will be produced in Phase Two of the project. In Phase Two, the entire team is responsible for integrating the various code segments into a single piece of software. When the deadline for the assignment is reached, the program must be completely implemented, tested, documented, and functional.
The project involves the development of a spam filter that will be based on the statistics that have already been accumulated from large sets of spam and ham (i.e., non-spam) messages. In the final integrated program, a user will specify three pieces of information:

The name of a text file on the user’s system. The file consists of ordered triples, each of which starts with a word, followed by the percentage of the spam messages that contained that word and the percentage of the ham messages that contained that word. For instance, one such text file might contain the line:

mortgage 0.23 0.02
This indicates that the word “mortgage” occurred in 23% of that file’s spam sample set and in 2% of that file’s ham sample set. This file, known as a “spam dictionary”, will normally require hundreds of triples of this form.

The name of another text file on the user’s system. This file constitutes a message that the user wishes to test to determine the likelihood that it is a spam message. The details for this calculation are supplied below.
A percentage cutoff specifying the spam probability above which a message will be considered spam. For example, if the user specifies a cutoff of 90%, then a tested message that yields a 92% spam probability would be considered spam, but a tested message that yields an 88% spam probability would not be considered spam.

A spam filter begins by processing the spam dictionary and calculating the probability that each word in the dictionary is a “spam indicator”. Essentially, the percentage pspam of spam messages containing the word is retrieved from the first input file as well as the percentage pham of ham messages containing that word, and the probability P that that word is a spam indicator is calculated as follows:

For instance, in our example above, the word “mortgage” occurs in 23% of the spam sample set (i.e., pspam = 0.23) and in 2% of the ham sample set (i.e., pham = 0.02), so the probability P that a new message containing the word “mortgage” would be spam is 92%.
Rather than calculating a spam dictionary from a large group of spam and ham messages, your application will merely retrieve one from storage by means of user selection. When testing a user-specified text message to see if it qualifies as spam, the fifteen words in the test message with the most extreme probabilities (i.e., furthest in either direction from 0.5) should be determined and inserted into the following formula to calculate the probability that the new message is spam:

where Pi is the spam probability associated with the ith word and  is the product symbol. Any test message with a high enough spam-probability will be considered spam by the filter.
Keep in mind the following considerations:

All probabilities of words in the spam dictionary must be between 0.0 and 1.0, with none actually equaling 0.0 or 1.0.
If a test message does not contain at least fifteen words from the spam dictionary, then it cannot be tested.

For Phase One of this project, members of your team will develop and test each of the following self-contained modules:

Input a user-specified text file (i.e., one with suffix .txt) and confirm that it is in the format of the triples mentioned above (i.e., word, spam percentage, ham percentage). If it is incorrectly formatted, give the user a message indicating the nature of the formatting problem (i.e., word contains non-alphabetic other than hyphen or apostrophe, percentage contains non-numeric other than decimal point, percentage not between 0.0 and 1.0). Otherwise, compute the spam probability of each word in the file and display a scrollable list of all of the words and their spam probabilities.
Input a user-specified text file (i.e., one with suffix .txt) and display it in a scrollable textbox. Allow the user to toggle a checkbox for case sensitivity; when case-sensitivity is on, the text is displayed in its original format, but when case sensitivity is off, the text is displayed completely in lower-case.
Input a user-specified text file assumed to be a list of words, each on a separate line of text (no error-checking for format). Input a second user-specified text file. Search the second file and count how many times each word from the first list occurs in the second file. Display a scrollable list of all of the words from the second file and their counts from the first file.
Input a user-specified text file consisting of pairs of values, each on a separate line of text, with the first value in the pair being a word and the second being a percentage (no error-checking for format). Input a second user-specified text file assumed to be a list of words, each on a separate line of text (no error checking for format). Display a scrollable list of the fifteen most extreme words from the second file (i.e., the fifteen words that are in both files and have first-file percentages farthest from 0.5). Display the percentages next to the displayed words. If there are not at least fifteen words shared by the two files, just display a message to that effect.
Input a user-specified text file assumed to be a list of fifteen words (no error-checking for format) and a second user-specified text file. Display the contents of the second file in a scrollable textbox, but with each occurrence of one of the fifteen first-file words highlighted in some way (e.g., bold font, underlined, italicized, red letters). Note that substring occurrences (e.g., “vita” is a substring of “inevitable”) should not be highlighted.
Input a user-specified text file consisting of fifteen percentages (no error-checking for format). Use a textbox to allow the user to enter a cutoff percentage (with error-checking for format). Use the formula above to calculate and display the spam message percentage. Also display a message indicating whether or not the spam message percentage exceeds the user-supplied cutoff percentage.

Phase One Deliverables (Due on every team member’s Moodle drop-box by 8:00 AM on Thursday, January 19, 2017):

Unit Modules (using HTML and CSS) implementing the six components listed above. The components are assigned as follows:

Three-Person Team (based on last names alphabetically) / Four-Person Team (based on last names alphabetically)
First / Second / Third / First / Second / Third / Fourth
A and E / B and D / C and F / B and F / C / E / A and D
Each module must be completely implemented, tested, documented, and functional. Credit will be awarded to the individual assigned each module.

UML Use Case Diagrams (using Microsoft Visio) illustrating every actor and use case associated with the final application. Keep in mind that the final application could be used to test whether an incoming message is spam, to test whether a particular spam dictionary is adequate, to determine what spam probabilities are considered appropriate by users, to determine how a message might be adjusted so it would not be deemed as spam, etc. These diagrams should be developed by the entire team, working together, and credit for this portion of the assignment will be awarded with that collaboration in mind.
Each team member will submit a copy of the team’s use case diagrams and his or her individual unit modules only, in a single zip-compressed folder to the Moodle drop-box.

Phase Two Deliverables (Due on every team member’s drop-box by 8:00 AM on Thursday, February 2, 2017):

Integrated Spam Filter Program (in HTML and CSS) with all of the team’s unit modules fully integrated into a single program that implements a full spam filter on a single web page. The user should be able to enter the three pieces of input information (the spam dictionary text file name, the test message text file name, and the spam message percentage cutoff). The scrollable spam dictionary, the scrollable test message (with highlighted extreme words), and the spam message probability (including whether the message is deemed spam based on the cutoff) should then be displayed as output. All appropriate input errors (i.e., formatting problems) should be displayed.

Team Time Log document (using Microsoft Word) detailing the specific Phase Two tasks performed by each team member and precise measurements (in minutes) of each task’s duration. Every team member must log the exact time when a task was begun, when a task was halted temporarily, when a task was resumed, and when a task was ultimately completed. Any task performed by multiple team members should have log entries by each participating member. Provide full details on each task (e.g., “Integrated code for generating 15-word extreme list with code for searching input file for words with extreme spam probabilities” is superior to “coded”).
Postmortem document (using Microsoft Word) detailing what aspects of this team project went right and what aspects went wrong. For the latter, the team should specify how they could have modified their development process to avoid the problems. This document should be at least 500 words in length.