Lab 4
(75 pts) Due Sun, Mar 25
One topic of research today is accurately determining the most relevant topic(s) of a document. While this is useful for many reasons (e.g., search engines, finding relevant court documents, relevant medical documents, determining redundancy among documents when, for instance, determining whether a document contains fake news, etc.), determining the topic of a document is not always straightforward.
In this lab you’ll be doing a first step in determining a document’s topic(s). For this lab, you’ll be reading in a web page with html tags. Below is an example of a very simple web page(Note that this is a shortened version of the actual web page I used for the output):
<!DOCTYPE html>
html
<head
<meta charset="utf-8">
<title> Dogs and Puppies</title>
</head>
body
<h1> Dog and Puppy Care </h1>
<h2>Links Section </h2>
<p> SPCA, ESPN </p>
<p>Check regularly for more relevant links </p>
<h3> Dogs</h3>
<p>A dog can be a wonderful addition to any home. It's important
to keep your canine companion's health and happiness a top priority.
Below are some useful tips for all dog parents.
</p>
<h3>Feeding</h3>
<p>Premium-quality dry food provides a well-balanced diet for adult
dogs and may be mixed with water, broth or canned food. Your dog may
enjoy cottage cheese, cooked egg or fruits and vegetables, but these
additions should not total more than ten percent of his daily food intake.
</p>
<h3>Puppy Care</h3>
<p>Puppies are adorable, and quite tempting when one is looking to adopt.
However, keep in mind that for the first year the amount of care a
puppy requires is quite high, almost to the level of a toddler. If you
think you will be unable to provide that level of care, please consider
adopting an older dog.
</p>
</body>
</html>
In a web page, the tags themselves indicate to some degree the most relevant information on the web page. Text surrounded by the <title> … </title> tags should strongly indicate a web page’s topics, just as text surrounded by <h1>… </h1> tags, and, to a slightly less extend the text surrounded by the <h2>…</h2> tags. Tags <h3>, <h4>, <h5>, and <h6> all indicate that the text is a subheading and, while relevant to the topic, are slightly less likely to be the dominant topic of a document. Text surrounded by <p> and other tags are even less likely to be the dominant topic. And, most likely, the tags themselves have no relevance to the topic of the document and can be stripped out.
Clearly there is much muchmuch more that should be done to accurately determine the topic of a document, but we’ll start with prioritizing the text within the document so that the most relevant text is listed first, followed by slightly less relevant text, and followed by the least likely text.
For this lab you’ll be writing methods for the SNode class, with nodes consisting of a word and the word’s priority, the methods for the SLL class to create, insert, and print nodes in the linked list that is ordered by a word’s priority, and some of the methods for the WebTopic class, which reads in a web page, strips out the html tags, determines a priority for each word that isn’t in a tag on the web page, and inserts that word and its priority into the web page’s linked list. It also prints out the linked list, broken down into categories.
Once you are done inserting the words into the linked list based on their priority, you will then remove all the words that have little to no real meaning in terms of identifying the topic of the document. For instance, the words “the”, “a”, “and”,etc. are just filler words (we refer to them as stop words) and don’t tell us anything about the topic of the document. So I’ve included a list of stopwords in the WebTopic class header file. You will remove all of those stopwords from the linked list.
So you will be given and then must write the following:
SNode.hpp is defined as follows:
#ifndef SNODE_HPP_
#define SNODE_HPP_
#includestdlib.h
#includeiostream
usingnamespacestd;
classSNode {
friendclassSLL;
friendclassWebTopic;
stringword; // instead of int data, now the data is a string
intpriority; // the priority of a node (1,2, or 3)
SNode *next;
public:
SNode(string w, int p);
~SNode();
voidprintNode();
};
#endif/* SNODE_HPP_ */
(5 pts) Write the accompanyingSNode.cpp
------
SLL.hpp is defined as follows:
#ifndef SLL_HPP_
#define SLL_HPP_
#include"SNode.hpp"
#includestdlib.h
#includeiostream
usingnamespacestd;
classSLL {
friendclassWebTopic;
SNode *first;
SNode *last;
SNode *p2; //points to the node in the list that is the last node with a priority of 2. If you add another node with a priority of 2 to the list, it will be added after this node
//Note that you still only have one list
intsize;
public:
SLL();
~SLL();
voidprintSLL();
// (4 pts) used for testing purposes – not actually needed for this
//lab, but useful
voidpriorityInsert(string s, int p);
//(8 pts)
//This method creates a new node with s as the word and p as the priority and,
//if the priority is 1, adds the new node to the beginning of the list, if it
//is 3, adds the node to the end of the list, and if it’s 2, it will insert it
//into the list right after pointer p2 (which is the last node with a priority
//of 2. In essence, all the nodes with a priority of 1 are at the beginning
//of the list, all the nodes with a priority of 2 are in the middle of the
//list, and all the nodes with a priority of 3 are at the end of the list.
voidpush(string s, int p);
//(4 pts)
// pushes a new node(with priority p and word s) onto the end of the stack
// (remember to reset the last pointer) – I called this from the
// priorityInsert method.
voidaddAtFront(string s, int p);
//(5 pts)
//adds a new node (made from word s and priority p) to the beginning of the
// list (remember to reset the first pointer) – I called this from
// priorityInsert
voidaddFirst(string s, int p);
//(4 pts)
//adds the very first node (made from word s and priority p) to an empty list
// I called this from priorityInsert
voidaddAtP2(string s, int p);
//(6 pts)
// inserts a new node into the middle of the list right after the priority2 p2
// pointer – I called this from priorityInsert
intremoveAll(string w);
// (10 pts)
// removes all occurrences of word w from the linked list
// this is used to remove every word in the array of stop words from the
// linked list - I returned the number of times the word w was removed from // the list.
stringpop();
// 4 pts)
// removes the last node from the linked list, and returns the word from the
// node removed. – I called this from removeAll
};
#endif/* SLL_HPP_ */
(45 pts, as defined above) Write the accompanying SLL.cpp file with the methods as defined above
------
And finally this is the WebTopic.hpp file (for reading in a web page, stripping out the html tags while determining the priority of words within the tags, then creating a list of those words ordered by their priority :
#ifndef WEBTOPIC_HPP_
#define WEBTOPIC_HPP_
#include"SNode.hpp"
#include"SLL.hpp"
#includestdlib.h
usingnamespacestd;
classWebTopic {
intcurrpriority;
//This priority changes as the web page is traversed based on the last tag
//that has been read in
SLL *wordlist;
//the list of words ordered by priority
stringfile;
//The name of the web page file being read in
// (below) the ARRAY of stopwords, to be removed from the linked list
stringstopwords[50]={"a",
"along",
"although",
"am",
"among",
"and",
"are",
"as",
"at",
"be",
"because",
"between",
"can",
"do",
"dont",
"either",
"for",
"got",
"has",
"have",
"havent",
"he",
"i",
"in",
"is",
"isnt",
"it",
"more",
"much",
"neither",
"no",
"none",
"nor",
"not",
"of",
"one",
"or",
"that",
"the",
"they",
"this",
"though",
"was",
"when",
"while",
"why",
"with",
"without",
"you",
"your"
};
intswlen = 50; // the length of the ARRAY stopwords, above
public:
WebTopic(string filename);
//constructor
voidReadFile();
//reads in the web page, character by character, into a a line, setting the
//current priority based on the latest tag read in. This method calls
//getPriority when a tag has been read in, and it calls parseString when the //text between tags has been read in.
voidparseString(string line);
//takes a line of characters and breaks the line up into words by creating a
//new string of only alphabetical characters: Note: I did this by first
//breaking the string into individual words separated by spaces, and then used
//the function stripSpace to remove anything that wasn’t an alphanumeric
//character using the built in isalpha function.
stringstripSpace(string s);
//Strips out any character that isn’t alphanumeric and returns the stripped
//string
voidgetPriority(string line);
//The line is the tag, without the first <. This method uses the line to
//determine the current priority as follows:
//If the first 5 characters in the line match the word “title” (I used
//line.compare for this), or the first 2 characters match either h2 or h2,
//then the current priority is set to 1. If the first 2 characters are
//anything between h3 and h6, then the priority is set to 2, and if it’s
//anything else, it’s set to 3.
voidprintPage();
// (6 pts) – YOU WRITE
//prints out the list of words on the web page, ordered by their priority (and
//listing each word’s priority
voidremoveStopWords();
// (4 pts) – YOU WRITE
// after the linked list has been created, this method removes all the
// stopwords in the array of stopwords from the linked list (using the
// removeAll method)
};
#endif/* WEBTOPIC_HPP_ */
You must fill in the missing code in the WebTopic.cpp source file below:
#include"SLL.hpp"
#include"SNode.hpp"
#include"WebTopic.hpp"
#includeiostream
#includestdlib.h
#includecstring
#includestdio.h
#includefstream
#includecctype
usingnamespacestd;
WebTopic::WebTopic(string filename) {
file = filename;
wordlist = newSLL();
currpriority = 3;
}
voidWebTopic::getPriority(string line) {
//coutendl;
//cout < line < endl;
if (line.compare(0,5,"title") == 0 || line.compare(0,2,"h1")==0 || line.compare(0,2, "h2")==0) {
currpriority = 1;
}
elseif (line.compare(0,2,"h3") == 0 ||line.compare(0,2,"h4") == 0||line.compare(0,2,"h5") == 0||line.compare(0,2,"h6") == 0) {
currpriority = 2;
}
else {
currpriority = 3;
}
//cout < "Curr Priority: " < currpriorityendl;
}
voidWebTopic::printPage() {
// YOU WRITE
}
voidWebTopic::readFile() {
ifstreaminfile(file.c_str(),ios::in); // open file
string line = "";
char c;
while (infile.get(c)) {
if (c == '<') {
if (!line.empty()) {
//cout < "Line outside of tag: "<endl;
//cout < line < endlendl;
parseString(line);
line = "";
}
}
elseif (c == '>') {
//cout < "Line inside of tag: "<endl;
//cout < line < endlendl;
getPriority(line);
line = "";
}
else {
line += c;
}
}
cout"*****************************************"endl;
cout"BEFORE REMOVING STOP WORDS"endl;
printPage();
coutendl;
removeStopWords();
cout"*****************************************"endl;
cout"AFTER REMOVING STOP WORDS"endl;
printPage();
coutendl;
infile.close();
}
stringWebTopic::stripSpace(string s) {
unsignedinti = 0;
while (is.length()) {
if (!isalpha(s[i])) {
s.erase(i);
}
i++;
}
return s;
}
voidWebTopic::parseString(string line) {
char *l=const_castchar *>(line.c_str());
char *token = strtok(l, " ");
while (token != NULL) {
string s = stripSpace(token);
if (s != "") {
wordlist->priorityInsert(s,currpriority);
}
token = strtok(NULL, " ");
}
}
voidWebTopic::removeStopWords(){
// YOU MUST WRITE
}
(10 pts, as defined above) Fill in the methods in the WebTopic.cpp file with the methods as defined above
------
And my main is:
#include"SLL.hpp"
#include"SNode.hpp"
#include"WebTopic.hpp"
#includeiostream
#includestdlib.h
#includecstring
usingnamespacestd;
intmain() {
WebTopic *x = newWebTopic("webpage.html"); //or another web page – I haven’t //tested it extensively, but it should work for any basic html page
x->ReadFile();
x->printPage();
return 0;
}
//15 pts for getting everything to work together
Altogether, you should
- write the SNode.cpp file (5 pts)
- The SLL.cpp file methods (45 pts)
- The 2 methods in the WebTopic.cpp file (10 pts)
- And get everything to work together (15 pt)
------
To turn in (zipped):
- SNode.hpp //no changes from original
- SNode.cpp
- SLL.hpp //no changes from original
- SLL.cpp
- WebTopic.hpp
- WebTopic.cpp
- WebPageMain.cpp //no changes from original
- webpage.html //no changes from original
______
And my output for webpage.html:
*****************************************
BEFORE REMOVING STOP WORDS
Priority 1:
Puppies:1, Section:1, Donation:1, Welfare:1, Animal:1, Section:1, Links:1, Organizations:1, Welfare:1, Animal:1, Care:1, Puppy:1
and:1, Dog:1, Puppies:1, and:1, Dogs:1,
Priority 2:
Dogs:2, Feeding:2, your:2, Dog:2, Exercise:2, Puppy:2, Care:2,
Priority 3:
SPCA:3, ESPN:3, Check:3, regularly:3, for:3, more:3, relevant:3, links:3, Make:3, sure:3, you:3, donate:3
to:3, your:3, local:3, animal:3, shelter:3, Animal:3, Welfare:3, Institute:3, American:3, Humane:3, Association:3, Humane:3
Farming:3, Association:3, dog:3, can:3, be:3, a:3, wonderful:3, addition:3, to:3, any:3, home:3, but:3
whether:3, you:3, experienced:3, pet:3, parent:3, or:3, a:3, first:3, adopter:3, it:3, important:3, keep:3
your:3, canine:3, companion:3, health:3, and:3, happiness:3, a:3, top:3, priority:3, are:3, some:3, useful:3
tips:3, for:3, all:3, dog:3, parents:3, Premium:3, dry:3, food:3, provides:3, a:3, well:3, diet:3
for:3, adult:3, and:3, may:3, be:3, mixed:3, with:3, water:3, broth:3, or:3, canned:3, food:3
Your:3, dog:3, may:3, cottage:3, cheese:3, cooked:3, egg:3, or:3, fruits:3, and:3, vegetables:3, but:3
these:3, should:3, not:3, total:3, more:3, than:3, ten:3, percent:3, of:3, his:3, daily:3, food:3
Exercise:3, your:3, dog:3, regularly:3, to:3, keep:3, your:3, dog:3, stimulated:3, and:3, healthy:3, to:3
keep:3, their:3, weight:3, in:3, check:3, Different:3, breeds:3, need:3, different:3, of:3, exercise:3, Before:3
adopting:3, a:3, dog:3, research:3, the:3, energy:3, level:3, of:3, breed:3, to:3, ensure:3, you:3
will:3, be:3, able:3, to:3, provide:3, the:3, proper:3, level:3, of:3, Our:3, newest:3, puppy:3
are:3, adorable:3, and:3, quite:3, tempting:3, when:3, one:3, is:3, looking:3, to:3, adopt:3, keep:3
in:3, mind:3, that:3, for:3, the:3, first:3, year:3, the:3, amount:3, of:3, care:3, a:3
requires:3, is:3, quite:3, high:3, almost:3, to:3, the:3, level:3, of:3, a:3, toddler:3, If:3
you:3, you:3, will:3, be:3, unable:3, to:3, provide:3, that:3, level:3, of:3, care:3, please:3
consider:3, an:3, older:3, dog:3, Copyright:3, DY:3, All:3, rights:3, reserved:3,
removing a
deleting a may cause a memory leak
deleting a may cause a memory leak
deleting a may cause a memory leak
deleting a may cause a memory leak
deleting a may cause a memory leak
deleting a may cause a memory leak
deleting a may cause a memory leak
removing along
removing although
removing am
removing among
removing and
deleting and may cause a memory leak
deleting and may cause a memory leak
deleting and may cause a memory leak
deleting and may cause a memory leak
deleting and may cause a memory leak
deleting and may cause a memory leak
deleting and may cause a memory leak
removing are
deleting are may cause a memory leak
deleting are may cause a memory leak
removing as
removing at
removing be
deleting be may cause a memory leak
deleting be may cause a memory leak
deleting be may cause a memory leak
deleting be may cause a memory leak
removing because
removing between
removing can
deleting can may cause a memory leak
removing do
removingdont
removing either
removing for
deleting for may cause a memory leak
deleting for may cause a memory leak
deleting for may cause a memory leak
deleting for may cause a memory leak
removing got
removing has
removing have
removinghavent
removing he
removingi
removing in
deleting in may cause a memory leak
deleting in may cause a memory leak
removing is
deleting is may cause a memory leak
deleting is may cause a memory leak
removingisnt
removing it
deleting it may cause a memory leak
removing more
deleting more may cause a memory leak
deleting more may cause a memory leak
removing much
removing neither
removing no
removing none
removing nor
removing not
deleting not may cause a memory leak
removing of
deleting of may cause a memory leak
deleting of may cause a memory leak
deleting of may cause a memory leak
deleting of may cause a memory leak
deleting of may cause a memory leak
deleting of may cause a memory leak
deleting of may cause a memory leak
removing one
deleting one may cause a memory leak
removing or
deleting or may cause a memory leak
deleting or may cause a memory leak
deleting or may cause a memory leak
removing that
deleting that may cause a memory leak
deleting that may cause a memory leak
removing the
deleting the may cause a memory leak
deleting the may cause a memory leak
deleting the may cause a memory leak
deleting the may cause a memory leak
deleting the may cause a memory leak
removing they
removing this
removing though
removing was
removing when
deleting when may cause a memory leak
removing while
removing why
removing with
deleting with may cause a memory leak
removing without
removing you
deleting you may cause a memory leak
deleting you may cause a memory leak
deleting you may cause a memory leak
deleting you may cause a memory leak
deleting you may cause a memory leak
removing your
deletingyour may cause a memory leak
deletingyour may cause a memory leak
deletingyour may cause a memory leak
deletingyour may cause a memory leak
deletingyour may cause a memory leak
*****************************************
AFTER REMOVING STOP WORDS
Priority 1:
Puppies:1, Section:1, Donation:1, Welfare:1, Animal:1, Section:1, Links:1, Organizations:1, Welfare:1, Animal:1, Care:1, Puppy:1
Dog:1, Puppies:1, Dogs:1,
Priority 2:
Dogs:2, Feeding:2, Dog:2, Exercise:2, Puppy:2, Care:2,
Priority 3:
SPCA:3, ESPN:3, Check:3, regularly:3, relevant:3, links:3, Make:3, sure:3, donate:3, to:3, local:3, animal:3
shelter:3, Animal:3, Welfare:3, Institute:3, American:3, Humane:3, Association:3, Humane:3, Farming:3, Association:3, dog:3, wonderful:3
addition:3, to:3, any:3, home:3, but:3, whether:3, experienced:3, pet:3, parent:3, first:3, adopter:3, important:3
keep:3, canine:3, companion:3, health:3, happiness:3, top:3, priority:3, some:3, useful:3, tips:3, all:3, dog:3