DATA MINING USING WEKA:
AN ANALYSIS OF WEB BROWSING BEHAVIOR EVALUATION USING FIREFOX CACHE
TEAM-3:
Mike Egbert
Gonzalo Perez
Diah Schur
Novelle Maxwell-Sinclair
For:
Emerging IT
Dr. Charles Tappert
Dr. Sun-Hyuk Cha
Pace University
2012
TABLE OF CONTENTS
Introduction
Common Browsers
Internet Explorer
Google Chrome
Mozilla Firefox
Safari
Opera
Figure 1: Browser Market Share
Previous Work in Cache Analysis
Figure 2: IE Browser Cache Clearing Dialog Box
Browser Cache and Forensics
Figure 3: Browser Cache and Forensics
Browser-based Forensics: History as a Cache Artifact
Figure 4: Browser History
Persistence Browser Cache
Figure 5: Browser Cache left on a Hard Drive
Web Browsing Cache Log
Data Collection
Figure 6: About:Cache in Firefox
Methodology
DataSet
Categories
Data Loading
Figure 7: Loaded Data on WEKA Pre-processing with Category shown
Analysis
Figure 8: Categories vs. ID
Conclusion and Recommendation
References
List of Figures
Introduction
This project seeks to evaluate the possibilities of utilizing browser cache/history log to determine user browsing behavior, especially in the work place. The first step in this activity is to acquire the actual log file(s). The second step is to identify the relevant file that can be analyzed by the tool. The third stepis to evaluate the data recovered to determine whether the user has been abusing his or her time at work. The last step is to present the discovery at a high level. For this project, we useFirefox as the subject browser. However, there are several other available browsers that are commonly used by individuals or companies.
Previous work in computer forensic s has used cache analysis as the main methodology. In the following sections, common web browsers and previous work in cache analysis will be discussed.
Common Browsers
The web provides a diverse source of information and services and there are many browsers in the market competing to deliver the experience users desire at a click of a button. Browsers are the tools that actually request content from web servers, understand the markup language, interpret the content and them present it to users. These Web browsers are compatible with PCs, Macs and other Internet-capable devices such as the Apple iPod (though Internet Explorer is compatible only with PCs running Windows). The most popular browser today is Internet Explorer with Google Chrome close behind, then Mozilla Firefox, Safari, Opera and a few others.
Internet Explorer
Internet Explorer was introduced in the mid-1990’s and is embedded on every Windows-based PC.
Google Chrome
Google Chrome has gained serious market share within the past few years. Chrome has ranked very high in independent tests with regards to speed and page load times. Chrome also features an “incognito mode” where users can stealthily visit web sites without having any cookies reside on their pc’s.
Mozilla Firefox
Firefox is an open source web browser with a very simple interface with enhanced security features to help protect users. Since Firefox is open source, it affords users a healthy library of add-ons that can be installed to the browser to augment and customize the user experience. Different types of add-ons include extensions, themes, search providers, dictionaries, language packs and plugins.
Safari
Safari, which runs on the Mac OS and the IPhone, offers many browser extensions, including an eBay manager and twitter integration.
Opera
The Opera browser has some unique features such as text and graphic enlarging on a web page.
Figure 1: Browser Market Share
Recent trends show browsers becoming their own operating systems; integrating many functions that were historically performed on a local machine. The Google Chrome book doesn’t have a Windows operating system installed; Chrome assumes that function. Users can create, save and edit spreadsheets or word processing documents directly through the Google browser.
Previous Work in Cache Analysis
Browser Cache is a form of temporary file folders where content from web sites you’ve visited (e.g., graphics, static pages, cookies, entire web pages) are stored. The theory behind browser cache was to improve performance. Every time you would revisit sites, much of the content could be cached and then the browser is serving up local cache pages versus going out to a site and retrieving them again. [1]
What kinds of files are stored in the browser cache? The following is a list of typical files that are stored in browser cache:
- Files from the web sites you've opened in the browser
- Entire web pages
- Images
- CSS
- Audio
- Video
Figure 2: IE Browser Cache Clearing Dialog Box
An example of using browser cache would be, if you save web pages for offline browsing, all the files would be stored in the browser cache. Depending on your browser and the operating system, both the hard disk and RAM are used to store the cache files. [1]
Browser Cache and Forensics
Prior work, including forensics and browser cache, has been to use browser-based cache when investigating cybercrime. While browser cache is an important part of web architecture and we can derive many benefits, it is also a forensics artifact that may be used in cybercrime investigations. One might look at this as a negative, but law officials look at this as a positive and powerful artifact.
Figure 3: Browser Cache and Forensics
In addition to informational artifacts, browser cache can also be tied to Geo-location. This is possible because information from Map web pages and internet addresses may be in cache. Forensics investigators will look for this. Chad Tilbury states in an article in Forensic Methods: “Geo-location artifacts demonstrate an interesting concept with regard to browser-based evidence. Among the various browser artifacts, Internet history is a fan favorite because it provides such rich information. There is no easier place to look to identify sites visited by a specific user at a specific time. Browser history is so useful, a critical shortcoming is often ignored; with today’s dynamic web pages, the vast number of web page requests goes unrecorded.”[2]
Browser-based Forensics: History as a Cache Artifact
Historical data is kept in browser cache. If you look at most browsers today, you will find multiple days and weeks of stored browser. The figure below is an example of historical browser cache. Again, forensics can utilize such historical data.
Figure 4: Browser History
Persistence Browser Cache
Peter Grant indicates the importance of browser cache and even though a user deletes their cache, the data is still able to be retrieved in most cases. Federal and State cybercrime investigators can utilize this data in a court of law. [3] The figure below is an example of browser cache that still remains on the hard disk, even though the user deleted it.
Figure 5: Browser Cache left on a Hard Drive
We can see that the development of browser cache has been a good tool for web performance, as well as an important tool for cyber forensics. Keith J. Jones, RohytBelani state: “Critical electronic evidence is often found in the suspect's web browsing history in the form of received emails; sites visited and attempted Internet searches.” [4]
The prior work done in browser cache technology has yielded a powerful tool. It is because of the prior work that we have modern day browser cache technology and a powerful cyber forensics tool.
Web Browsing Cache Log
Cache analysis can be done by using cache log of browsing history from a user’s computer to find out his/her browsing behavior. This is particularly useful in the workplace environment. Other potential evidenceshould be availablefrom the registry entries, temporary files, index.dat, cookies, bookmarked pages, saved html pages, emails sent and received by the user, etc. A study conducted by Junghoon Oh,Seungbong Lee, Sangjin Lee asserts that searching for evidence left by web browsing activity is typically a crucial component of digital forensic investigations. [5] Almost every movement a suspect performs while using a web browser leaves a trace on the computer; even a simple search for information using a web browser. Therefore, when an investigator analyzes the suspect’s computer, this evidence can provide useful information. After retrieving data such as cache, history, cookies, and download list from a suspect’s computer, it is possible to analyze this evidence for web sites visited, time and frequency of access, and search engine keywords used by the suspect.
Computer forensic analysis can also be done from the server side by analyzing access logs, error logs and FTP log files, as well as network traffic. For this project, we employ a simple cache log analysis method using Firefox Disk Cache.
Data Collection
In the beginning of the project, we decided that Firefox cache will be usedto obtainthe browsing history foran actual user. We further decided to use 499 entries of the cache entries, after acquiring the data by accessing “about:cache” from the url address box of Firefox browser. The screenshot is as follows:
Figure 6: About:Cache in Firefox
As is shown in the figure above, there are three cache devices, but only two have number of entries: Memory cache and Disk cache device. Memory cache is held in RAM (application processed) while disk cache is stored on the user’s hard drive. Disk Cache is where Temp Internet Filesare stored, thus it has more entries.
In order to obtain various good data, we used the cache entries from the Disk cache devicelog and499 entries from the cache were taken as samples.
Methodology
Once we acquired the log from the cache, the data is then preparedso that it can be loaded and processed by WEKA seamlessly.
DataSet
The log file was transferred to an Excel spreadsheet and then saved as a CSV file. An ID attribute was added to the file so that Weka can process the data. We encountered a few errors while trying to load the data into Weka for the pre-processing. However,we managed to resolve these errors after the ID field was added.
Categories
Categories attribute was also added to classify the “key” attribute or the visited sites into: general browsing, shopping, entertainment, lifestyle, foreign news, and alert.
- General Browsing is a harmless browsing that might be work related research activities, thus this type of behavior is accepted
- Shopping is a category that is not acceptable in work environment
- Entertainment is a category that is not acceptable
- Lifestyle is another category that is not acceptable
- Foreign News is not acceptable
- Alert is a heightened level category; this is beyond unacceptable e.g. visiting porn sites or monitored sites
With the 5 original attributes from Firefox cache: Key, Datasize, Fetchcount, Lastmodified, and Expires, as well as the added ID and Categories attributes, the data now has 7 attributes and ready to be preprocessed in Weka.
Data Loading
The data loading process starts by opening WekaExplorer and loading the CSV file prepared from the log file. The result is as follows:
Figure 7: Loaded Data on WEKA Pre-processing with Category shown
Analysis
The pre-processing panelreveals that our log consists of:
- General Browsing = 371 entries
- Lifestyle= 63 entries
- Shopping= 11 entries
- Social network= 9 entries
- Entertainment= 16 entries
- Foreign News= 16 entries
- Alert= 13 entries
The ID attribute is a unique identifier for each individual browsing instance. The data has 499 unique instances and analyzed against the Categories. It is shown that the majority of IDs are in the General Browsing category: 371 out of 499 instances (74.35%) are General Browsing, which is the only acceptable browsing behavior category. Thus, the rest of the IDs (128 instances), which is 25.65%, are in the unacceptable behavior categories.
Figure 8: Categories vs. ID
Conclusion and Recommendation
Firefox cache can be used to analyze web browsing behavior, especially for browsing activities during working hours at the workplace. These kinds of logs can be easily acquired since browsers have caching capability. For each cached page, this capability provides the address/URL from which the page was fetched, the name of the file, the size, the time it was last modified, and its expiry date.
From the data acquired in this project, the web browsing cache log shows misuse of internet access at work. The user has spent 25% of his/her time conducting unacceptable browsing activities.
Most web browsers provide an erase function for log information such as cache, history, cookies, and download list. If a user ran this function to erase log information, investigation will be difficult.[5] Consequently, we recommend that companies disable their employees’ ability to delete browsing history cache or retain a copy outside of the user’s computer.
References
- Web Developers Notes. Browser Cache – What is It?, 2012
- Chad Tilbury. Big Brother Forensics: Device Tracking Using Browser-Based Artifacts, April 11, 2012
- Peter Grant. Forensic Tools for Internet Activity. 2012
- Keith J. Jones, RohytBelani. Web Browser Forensics, 2012
- Junghoon Oh,Seungbong Lee, Sangjin Lee: Advanced evidence collection and analysis of web browser activity
- Jones Keith J. Forensic analysis of internet explorer activity files. Foundstone, dat.pdf; 2003.
- Jones Keith j, RohytBlani. Web browser forensic. Security focus, 2005a.
- Jones Keith j, RohytBlani. Web browser forensic. Security focus, 2005b.
- Arvidson, Erick. Types of Web Browsers. 2012. <
- Davison, Brian D. Web Caching. Feb 2008. <
- Jaroslovsky, Rich. Bloomberg Technology Columnist Renee Montagne. 12 August 2012.
List of Figures
Figure 1: Browser Market Share
Figure 2: IE Browser Cache Clearing Dialog Box
Figure 3: Browser Cache and Forensics
Figure 4: Browser History
Figure 5: Browser Cache left on a Hard Drive
Figure 6: About:Cache in Firefox
Figure 7: Loaded Data on WEKA Pre-processing with Category shown
Figure 8: Categories vs. ID
Page 1 of 13