DATA MINING USING WEKA:

AN ANALYSIS OF WEB BROWSING BEHAVIOR EVALUATION USING FIREFOX CACHE

TEAM-3:

Mike Egbert

Gonzalo Perez

Diah Schur

Novelle Maxwell-Sinclair

For:

Emerging IT

Dr. Charles Tappert

Dr. Sun-Hyuk Cha

Pace University

2012

TABLE OF CONTENTS

Introduction

Common Browsers

Internet Explorer

Google Chrome

Mozilla Firefox

Safari

Opera

Figure 1: Browser Market Share

Previous Work in Cache Analysis

Figure 2: IE Browser Cache Clearing Dialog Box

Browser Cache and Forensics

Figure 3: Browser Cache and Forensics

Browser-based Forensics: History as a Cache Artifact

Figure 4: Browser History

Persistence Browser Cache

Figure 5: Browser Cache left on a Hard Drive

Web Browsing Cache Log

Data Collection

Figure 6: About:Cache in Firefox

Methodology

DataSet

Categories

Data Loading

Figure 7: Loaded Data on WEKA Pre-processing with Category shown

Analysis

Figure 8: Categories vs. ID

Conclusion and Recommendation

References

List of Figures

Introduction

This project seeks to evaluate the possibilities of utilizing browser cache/history log to determine user browsing behavior, especially in the work place. The first step in this activity is to acquire the actual log file(s). The second step is to identify the relevant file that can be analyzed by the tool. The third stepis to evaluate the data recovered to determine whether the user has been abusing his or her time at work. The last step is to present the discovery at a high level. For this project, we useFirefox as the subject browser. However, there are several other available browsers that are commonly used by individuals or companies.

Previous work in computer forensic s has used cache analysis as the main methodology. In the following sections, common web browsers and previous work in cache analysis will be discussed.

Common Browsers

The web provides a diverse source of information and services and there are many browsers in the market competing to deliver the experience users desire at a click of a button. Browsers are the tools that actually request content from web servers, understand the markup language, interpret the content and them present it to users. These Web browsers are compatible with PCs, Macs and other Internet-capable devices such as the Apple iPod (though Internet Explorer is compatible only with PCs running Windows). The most popular browser today is Internet Explorer with Google Chrome close behind, then Mozilla Firefox, Safari, Opera and a few others.

Internet Explorer

Internet Explorer was introduced in the mid-1990’s and is embedded on every Windows-based PC.

Google Chrome

Google Chrome has gained serious market share within the past few years. Chrome has ranked very high in independent tests with regards to speed and page load times. Chrome also features an “incognito mode” where users can stealthily visit web sites without having any cookies reside on their pc’s.

Mozilla Firefox

Firefox is an open source web browser with a very simple interface with enhanced security features to help protect users. Since Firefox is open source, it affords users a healthy library of add-ons that can be installed to the browser to augment and customize the user experience. Different types of add-ons include extensions, themes, search providers, dictionaries, language packs and plugins.

Safari

Safari, which runs on the Mac OS and the IPhone, offers many browser extensions, including an eBay manager and twitter integration.

Opera

The Opera browser has some unique features such as text and graphic enlarging on a web page.

Figure 1: Browser Market Share

Recent trends show browsers becoming their own operating systems; integrating many functions that were historically performed on a local machine. The Google Chrome book doesn’t have a Windows operating system installed; Chrome assumes that function. Users can create, save and edit spreadsheets or word processing documents directly through the Google browser.

Previous Work in Cache Analysis

Browser Cache is a form of temporary file folders where content from web sites you’ve visited (e.g., graphics, static pages, cookies, entire web pages) are stored. The theory behind browser cache was to improve performance. Every time you would revisit sites, much of the content could be cached and then the browser is serving up local cache pages versus going out to a site and retrieving them again. [1]

What kinds of files are stored in the browser cache? The following is a list of typical files that are stored in browser cache:

  1. Files from the web sites you've opened in the browser
  2. Entire web pages
  3. Images
  4. CSS
  5. Audio
  6. Video

Figure 2: IE Browser Cache Clearing Dialog Box

An example of using browser cache would be, if you save web pages for offline browsing, all the files would be stored in the browser cache. Depending on your browser and the operating system, both the hard disk and RAM are used to store the cache files. [1]

Browser Cache and Forensics

Prior work, including forensics and browser cache, has been to use browser-based cache when investigating cybercrime. While browser cache is an important part of web architecture and we can derive many benefits, it is also a forensics artifact that may be used in cybercrime investigations. One might look at this as a negative, but law officials look at this as a positive and powerful artifact.

Figure 3: Browser Cache and Forensics

In addition to informational artifacts, browser cache can also be tied to Geo-location. This is possible because information from Map web pages and internet addresses may be in cache. Forensics investigators will look for this. Chad Tilbury states in an article in Forensic Methods: “Geo-location artifacts demonstrate an interesting concept with regard to browser-based evidence. Among the various browser artifacts, Internet history is a fan favorite because it provides such rich information. There is no easier place to look to identify sites visited by a specific user at a specific time. Browser history is so useful, a critical shortcoming is often ignored; with today’s dynamic web pages, the vast number of web page requests goes unrecorded.”[2]

Browser-based Forensics: History as a Cache Artifact

Historical data is kept in browser cache. If you look at most browsers today, you will find multiple days and weeks of stored browser. The figure below is an example of historical browser cache. Again, forensics can utilize such historical data.

Figure 4: Browser History
Persistence Browser Cache

Peter Grant indicates the importance of browser cache and even though a user deletes their cache, the data is still able to be retrieved in most cases. Federal and State cybercrime investigators can utilize this data in a court of law. [3] The figure below is an example of browser cache that still remains on the hard disk, even though the user deleted it.

Figure 5: Browser Cache left on a Hard Drive

We can see that the development of browser cache has been a good tool for web performance, as well as an important tool for cyber forensics. Keith J. Jones, RohytBelani state: “Critical electronic evidence is often found in the suspect's web browsing history in the form of received emails; sites visited and attempted Internet searches.” [4]

The prior work done in browser cache technology has yielded a powerful tool. It is because of the prior work that we have modern day browser cache technology and a powerful cyber forensics tool.

Web Browsing Cache Log

Cache analysis can be done by using cache log of browsing history from a user’s computer to find out his/her browsing behavior. This is particularly useful in the workplace environment. Other potential evidenceshould be availablefrom the registry entries, temporary files, index.dat, cookies, bookmarked pages, saved html pages, emails sent and received by the user, etc. A study conducted by Junghoon Oh,Seungbong Lee, Sangjin Lee asserts that searching for evidence left by web browsing activity is typically a crucial component of digital forensic investigations. [5] Almost every movement a suspect performs while using a web browser leaves a trace on the computer; even a simple search for information using a web browser. Therefore, when an investigator analyzes the suspect’s computer, this evidence can provide useful information. After retrieving data such as cache, history, cookies, and download list from a suspect’s computer, it is possible to analyze this evidence for web sites visited, time and frequency of access, and search engine keywords used by the suspect.

Computer forensic analysis can also be done from the server side by analyzing access logs, error logs and FTP log files, as well as network traffic. For this project, we employ a simple cache log analysis method using Firefox Disk Cache.

Data Collection

In the beginning of the project, we decided that Firefox cache will be usedto obtainthe browsing history foran actual user. We further decided to use 499 entries of the cache entries, after acquiring the data by accessing “about:cache” from the url address box of Firefox browser. The screenshot is as follows:

Figure 6: About:Cache in Firefox

As is shown in the figure above, there are three cache devices, but only two have number of entries: Memory cache and Disk cache device. Memory cache is held in RAM (application processed) while disk cache is stored on the user’s hard drive. Disk Cache is where Temp Internet Filesare stored, thus it has more entries.

In order to obtain various good data, we used the cache entries from the Disk cache devicelog and499 entries from the cache were taken as samples.

Methodology

Once we acquired the log from the cache, the data is then preparedso that it can be loaded and processed by WEKA seamlessly.

DataSet

The log file was transferred to an Excel spreadsheet and then saved as a CSV file. An ID attribute was added to the file so that Weka can process the data. We encountered a few errors while trying to load the data into Weka for the pre-processing. However,we managed to resolve these errors after the ID field was added.

Categories

Categories attribute was also added to classify the “key” attribute or the visited sites into: general browsing, shopping, entertainment, lifestyle, foreign news, and alert.

  • General Browsing is a harmless browsing that might be work related research activities, thus this type of behavior is accepted
  • Shopping is a category that is not acceptable in work environment
  • Entertainment is a category that is not acceptable
  • Lifestyle is another category that is not acceptable
  • Foreign News is not acceptable
  • Alert is a heightened level category; this is beyond unacceptable e.g. visiting porn sites or monitored sites

With the 5 original attributes from Firefox cache: Key, Datasize, Fetchcount, Lastmodified, and Expires, as well as the added ID and Categories attributes, the data now has 7 attributes and ready to be preprocessed in Weka.

Data Loading

The data loading process starts by opening WekaExplorer and loading the CSV file prepared from the log file. The result is as follows:

Figure 7: Loaded Data on WEKA Pre-processing with Category shown

Analysis

The pre-processing panelreveals that our log consists of:

  • General Browsing = 371 entries
  • Lifestyle= 63 entries
  • Shopping= 11 entries
  • Social network= 9 entries
  • Entertainment= 16 entries
  • Foreign News= 16 entries
  • Alert= 13 entries

The ID attribute is a unique identifier for each individual browsing instance. The data has 499 unique instances and analyzed against the Categories. It is shown that the majority of IDs are in the General Browsing category: 371 out of 499 instances (74.35%) are General Browsing, which is the only acceptable browsing behavior category. Thus, the rest of the IDs (128 instances), which is 25.65%, are in the unacceptable behavior categories.

Figure 8: Categories vs. ID

Conclusion and Recommendation

Firefox cache can be used to analyze web browsing behavior, especially for browsing activities during working hours at the workplace. These kinds of logs can be easily acquired since browsers have caching capability. For each cached page, this capability provides the address/URL from which the page was fetched, the name of the file, the size, the time it was last modified, and its expiry date.

From the data acquired in this project, the web browsing cache log shows misuse of internet access at work. The user has spent 25% of his/her time conducting unacceptable browsing activities.

Most web browsers provide an erase function for log information such as cache, history, cookies, and download list. If a user ran this function to erase log information, investigation will be difficult.[5] Consequently, we recommend that companies disable their employees’ ability to delete browsing history cache or retain a copy outside of the user’s computer.

References

  1. Web Developers Notes. Browser Cache – What is It?, 2012
  2. Chad Tilbury. Big Brother Forensics: Device Tracking Using Browser-Based Artifacts, April 11, 2012
  3. Peter Grant. Forensic Tools for Internet Activity. 2012
  4. Keith J. Jones, RohytBelani. Web Browser Forensics, 2012
  5. Junghoon Oh,Seungbong Lee, Sangjin Lee: Advanced evidence collection and analysis of web browser activity
  6. Jones Keith J. Forensic analysis of internet explorer activity files. Foundstone, dat.pdf; 2003.
  7. Jones Keith j, RohytBlani. Web browser forensic. Security focus, 2005a.
  8. Jones Keith j, RohytBlani. Web browser forensic. Security focus, 2005b.
  9. Arvidson, Erick. Types of Web Browsers. 2012. <
  10. Davison, Brian D. Web Caching. Feb 2008. <
  11. Jaroslovsky, Rich. Bloomberg Technology Columnist Renee Montagne. 12 August 2012.

List of Figures

Figure 1: Browser Market Share

Figure 2: IE Browser Cache Clearing Dialog Box

Figure 3: Browser Cache and Forensics

Figure 4: Browser History

Figure 5: Browser Cache left on a Hard Drive

Figure 6: About:Cache in Firefox

Figure 7: Loaded Data on WEKA Pre-processing with Category shown

Figure 8: Categories vs. ID

Page 1 of 13