Assignment – Extra Credit (Text Mining)
This assignment is individual effort.
Problem Definition
As analysts, we have a huge variety of data mining software packages at our disposal. These packages typically provide a whole slew of parameters that we can tweak in the models that we are implementing. It is therefore easy to get into the habit of taking a black-box approach, where we just try a bunch of parameters and parameter values, and see what sticks. The problem with this approach is that we then don’t know why some set of parameters worked, and worse, why they stopped working with a different data set.
As analysts therefore, we should always start with a fundamental understanding of the theory behind a model. Then, based on this understanding of how the model works, decide on what parameters we believe will help. Then test those parameters, and if they don’t work, revisit the data and partial results to see why that might be the case.
In this Assignment, you have been given an IMDB data set. The data set has been pre-labeled as positive or negative sentiment. You have also been given Python code that implements a Naïve Bayes Classifier for Sentiment Analysis based on this data. The code as it stands right nowwill give you an accuracy of ~10%. Your task is to start with the theory we covered on Text Preprocessing, TF-IDF, and Naïve Bayes Classification in Week 3, decide on what analysis steps would make the most sense for this data set, and then tweak the model parameters (on lines 62, 63, 64 of the python script) in order to achieve an accuracy of >= 80%.
Requirement for this Assignment
To get started, download and unzipTextMining.zip. Once unzipped, the TextMining directory will contain the following files:
- MoviePosNeg Directory: Contains data pertaining to Movie Reviews.
- SKLearnNB.py: Python Code that implements the Naïve Bayes classifier, and that you will update for this assignment.
The code in SKLearnNB.pyas it stands runs the following steps:
- Reads in the Movie Data
- Creates a “Pipeline” of data transformation and classification
- Uses Naïve Bayes Classifier to classify the Movie data into Positive and Negative Sentiment
- Uses 6-fold cross-validation, and then outputs the accuracy and confusion matrix
Take the following steps to complete this assignment:
- Review the theory on Preprocessing Steps, TF-IDF, and Naïve Bayes Classification in the Text Mining Lectures in Week 3
- Run the python code as-is and note the Accuracy of the Classification
- Based on the Text Mining Theory we covered in Week 3, come up with a strategy of analysis steps you’d want to perform on your dataset (for instance, which preprocessing steps would you want to use, would you want to use TF-IDF scores, would you want your NB classifier to use Laplace smoothing, etc). Know WHY you believe these steps will give you the best results.
- Browse the documentation on the functions in lines 62, 63 and 64 of SKLearnNB.py. Documentation for each of the functions is available in the links below. Note that the documentation calls out a lot of parameters. You’ll want to stay focused on the parameters that we went over in the course, and that you decided upon in the previous step. Also note that there may be no pre-canned ways of tweaking some of the parameters that you may have decide upon in the previous step. Take note of that.
- CountVectorizer
- TfidfTransformer
- MultinomialNB
- Finally, update the parameters you decided upon per the documentation to improve the prediction accuracy. Note: your improved Accuracy must be >= 80 for the Movie data set.
Submission for this Assignment
Submit the following for this Assignment:
- Provide a screenshot of the BEST Accuracy and Confusion Matrix you obtained with the Updated Code. Note: you must improve Accuracy to >= 80 for the Movie data set. (2 points)
- Attach your Final Updated Code that gave you the BEST reported accuracy above (3 points). Use the following naming convention for your script. Note: upload the actual script – DO NOT attach a screenshot of the script!
<FirstName<LastName>ExtraCredit.py.
[Example: HinaAroraExtraCredit.py]
- Explain (in no more than half a page) the different combinations of parameter values you tried, and WHY you believe the final set of parameters values you picked gave you the best Classifier Accuracy (5 points)
Note:
- If you are running on a personal installation of Spyder (instead of the installation on AWS that was provided to you), please be sure to have completed the steps called out in “Download Python Packages.Pdf” under Week 1 before attempting this assignment. Note: you don't need to worry about the last command on page 4 (from community import best_partition), or page 5 in that document.
1