Keylogger Keystroke Biometric System

Brian Tschinkel, Bernard Esantsi, Dominick Iacovelli, Padma Nagesar, Richard Walz, VinnieMonaco, and Ned Bakelman

Pace University Seidenberg School of CSIS, White Plains, NY 10606, USA

{bt66343n, be55006w, di80499p, pn53434n, rw195606p}@pace.edu, vinmonaco, nbakelman}@gmail.com

8

Abstract

The system developed uses an open-source keylogger to capture data samples of all keystroke input. The keylogger output is converted to a data file format appropriate for processing by the Pace Keystroke Biometric System (PKBS). This study evaluates the overall system to determine the accuracy of correctly authenticating users based on their recorded keystroke patterns.

  1. Introduction

Biometrics can be defined as the study of human traits to identify and verify a person based on their physiological and behavioral characteristics. Physiological characteristics include fingerprint, DNA, iris recognition, facial recognition, palm print, and hand geometry, while behavioral characteristics include typing rhythm, voice, and gait [1]. Since the beginning of time, facial recognition has been used to identify individuals. In ancient civilization, palm prints and fingerprints were used to differentiate one person from another. In the 1880s, Francis Galton discovered that fingerprints do not change over time; he calculated the odds of two people having the exact same fingerprint was 1 in 64 billion [7]. These methods worked well and were sufficient in small communities. However, as society progressed and people began to migrate, changes were necessary to prevent theft, fraud, and other criminal activities.

Computer technology has made a significant contribution to improve the collection and measurement of raw data to enhance the accuracy and uniqueness of the biometric system. Biometrics is essentially a pattern recognition system that consists of three main components: data collection, feature extraction, and classification. These components are used to collect data samples and compare them to verify a person’s identity. Figure 1 shows the process of matching samples with the data stored in a biometric database system [1].

Technology has paved the way for private industries and government agencies to incorporate biometric technologies that provide solutions for crime prevention, positive identifications, and various other methods that increase security. The government utilizes these technologies to issue passports, driver’s licenses, visas, and other identification cards. For example, the government issues driver’s licenses which include eye color, hair color, height, etc. These physiological characteristics are used to identify and authenticate a person as the person they claim to be. Businesses employ these methods for validating user’s information online, especially in the electronic banking industry. Banks often ask users to provide a user ID, password, answers to personal security questions, date of birth, and social security number, etc. This information is validated against the stored data in the bank’s database system to verify that the right person is accessing the right account.

Figure 1. Basic block diagram of a biometric system [1]

Biometric technologies have increased and enhanced the study of behavioral characteristics, such as typing rhythm, voice, and walking gait. For instance, biometric computing can be applied to voice recognition to identify the person speaking (known as speaker recognition) and what the person is saying (known as speech recognition). Law enforcement officials use this technology to measure voice pitch, speaking style, vocal chords vibration, and format frequencies to prove or disprove the identity and authenticity of people involved in crimes. A biometric voice print can be just as unique as a fingerprint.

Gait Biometrics is the study of human bodily movement. This can be quite discerning, particularly with people who have walking disabilities. It is used to identify and treat individuals with injuries and help professional athletes with their performance [1].

Keystroke dynamics is the process of capturing typing rhythms typically through the use of timing measurements
such as key press (key down) and release (key up) times [1]. Features can then be determined from these timings and used to discriminate one individual’s typing pattern from another with a fair amount of accuracy.

Our study will focus on the keystroke dynamics with particular interest on keystrokes generated from spreadsheet and web browsing input. We’re mainly interested in authentication using both text and numeric keypad entries. We will generate spreadsheet data samples using a specific template and data samples from the Internet which will then be transmitted to a centralized server. These samples will be collected and converted to run through the Pace Keystroke Biometric System (PKBS) for analysis and performance evaluation.

  1. Fimbel’s Basic Keylogger

Keyloggers have been used in many different environments to record data from a user at a terminal or workstation. A keylogger is a type of surveillance software or hardware that can record every single keystroke a user makes with a keyboard to a log file [6]. The log file can then be sent to a security analyst for inspection, or the file can be used as spyware and the data can be sent to a hacker. In malicious applications, keyloggers are embedded into spyware applications that may compromise a user’s sensitive information and identity by recording account passwords and transaction data. While the keylogger used in this study is not designed for spyware purposes, it does record every keystroke while the data session is active.

Eric J Fimbel, a native of Venezuela, has created the Basic Keylogger that records mouse and keyboard events regardless of any application that might be running in parallel [5]. Developed in Python, this keylogger is an evaluation tool for the study of human-computer interaction and was not created for malicious intent [5].

In Fimbel’s keylogger, events are stored in memory during the recording and written to a file at the end of the session. The keylogger produces two data files: a KEY log and a KPC log. The former records input events, such as pressing keys, releasing keys, and mouse movements. The latter records operations, which are more “concise than input events and show what a user is doing” [5]. Such operations include typing keys, pointing movements, and mouse clicks. Log files are stored in tab-separated values format (TSV), which is easy to view in any spreadsheet application.

Fimbel’s Basic Keylogger is of important use to the PKBS, mostly due to its ability to record events regardless of what application(s) the user may be using. The KPC log can be used to track the number of operations per task, execution time of an operation, length of a mouse pointer movement, typing rates, and other common mouse gestures. The key log aids in analyzing mouse trajectories and kinematics as well as mouse click densities, all of which can be used to analyze the physical motion of a user’s hand or finger.

  1. New Pace Keystroke Biometric System

The Pace Keystroke Biometric System (PKBS) has undergone many revisions in its seven-years of existence. Using the Fimbel Keylogger, the system has now been adapted to incorporate keystroke and mouse input completely independent of any application(s) running on an individual’s computer (i.e., open-based web browsing, spreadsheets, instant messaging, etc.).

While the old system used a Java-based application to capture data input, the new system relies on Fimbel’s keylogger to do so. The Java input system had several limitations that impacted the flexibility of the data being collected. Specifically, the Java application only captured data for text-based inputs within a confined text box [2] and therefore could not capture data from a spreadsheet application or a Web browser.

In order to provide more thorough analysis of different kinds of input, the current system was revised to incorporate data from any source. The new frontend of the PKBS now relies on Fimbel’s keylogger, to collect input from any application while removing limitations on the amount of data that can be collected. As Fimbel’s keylogger produces keystroke data files in TSV formats, a converter was necessary to ready these files for the existing backend PKBS tools (feature extraction and classification). As such, this new system relies on a Java-based converter application to parse the keylogger recordings and produce files in Extensible Markup Language (XML) format. This format fits the input requirements of the Feature Extractor tool (Figure 2).

  1. Methodology

The overall system consists of the following components:

·  Keylogger data collection process

·  Data converter

·  PKBS backend processor

·  ROC curve generator

4.1. Keylogger Data Collection Process

Our experiment for the Keylogger Keystroke Biometric System uses Fimbel’s basic keylogger to collect samples. However, a utility program was developed to control starting and stopping the keylogger as well as transmitting samples to a centralized server. This utility program also allows tag information to be specified, such as user name and sample type (i.e., Word-processing, Spreadsheet, Browser, Open, etc.). This tag information, along with a time stamp, is used to uniquely describe and identify samples. For example, if a user is generating a sample from a multiple application environment—such as surfing the Internet, working on a spreadsheet, checking e-mail, creating a PowerPoint presentation, etc.—the user would choose Open as the sample type.

8

Figure 2. Pace Keystroke Biometric System (New Additions) [2]

8

The experiment focused on collecting Microsoft Excel and Web browsing data samples. A standard Excel template was used for entering numeric data and the keystrokes and mouse movement were captured using the Fimbel’s basic keylogger.

The recording session begins by launching the utility program, which requires first and last name and application type. Since our experiment focused on capturing Excel and Web data samples, Spreadsheet and Browser were selected for the application types. Clicking the Start button launches the keylogger as indicated by the appearance of a blue icon in the taskbar.

The user proceeds by entering the required data in the Excel template (spreadsheet experiment) or searching for directions or recipes in a Web browser (web data experiment). Once finished, the user clicks on the Stop button which closes the keylogger and prevents any more data from being captured. The last step in the process is the transmission of the keylogger files. When the user clicks on the Transmit button, two actions occur. First, the key_log.tsv and kpc_log.tsv files are renamed using the user’s name, application type (Spreadsheet and Browser), a time stamp, and file name (KEY and KPC) to indicate the proper log. An example of the sample output files are Nagesar_Padma_KEY_Spreadsheet_2011-11-05-12-39-23.txt and Nagesar_Padma_KEY_Browser_2011-11-28-20-24-25.txt. The file name and extension were changed in order to easily identify each user file (Nagesar_Padma_KEY and Nagesar_Padma_KPC) and to provide an easy way of opening the file in Notepad, thus the extension, .txt. Figure 3 shows the first step for collecting the data samples via the Fimbel Keylogger.

Figure 3. Fimbel Keylogger

4.2. Data Converter

All the sample files are converted via a new improved Java-based converter to produce XML files that are compatible with the Pace Keystroke Biometric System.

The new converter eliminates the login prompt that was part of the old converter. Instead it parses the user’s first and last name from the KPC file name generated by the Fimbel Keylogger, shown in Figure 4. The user name is important because it distinguishes the samples for each person in the Feature Vector file, which is generated during feature extraction.

Figure 4. New Converter (Pre-Processing)

Figure 5. New Converter (Post-Processing)

The new converter displays the number of keystrokes and anomalies, as shown in Figure 5. The anomalies refer to the keystrokes the converter could not successfully convert. An anomaly file is automatically generated with any “bad” keystrokes and is used for debugging purposes.

The original converter was refactored to address some conversion problems and to improve maintainability. These modifications include the use of the KPC log file as the main driver for the output. This is an easier file to parse and originally was discarded because it was thought to be unable to include keystroke release times. The KEY log file is still used, but mainly as a lookup to obtain data elements not found in the KPC file, such as scan and key codes. Including the use of the KPC file eliminated the “Out of Bounds” error that would occasionally occur during processing.

The converter program is made up of a series of method calls that display a user dialog to capture the name and location of the input and output files, obtain the user name (if any) from the input file name, parse the keystroke data, and generate the converted output into the appropriate XML format. Once the input and output paths are specified, clicking on the Convert to XML button fires off the process to convert and produce the output file. Figures 4 and 5 show the converter procedure from the pre-and post-processing stages.

4.3. PKBS Backend Processor

PKBS backend processing consists of feature extraction and user authentication. The XML files are processed by the feature extractor to generate a single feature vector file. The feature vector file is then split into two files, one for testing samples and the other for training samples.

The testing and training samples are then passed to the BAS authentication system to classify and determine the performance results. The dichotomy model is applied during this process which generates metadata files consisting of intra-class and inter-class sizes from the test and train samples. The subsequent results obtained from this process are in the form of False Acceptance and False Rejection Rates (FAR and FRR).

For authentication (verification), a vector-difference model transforms a multi-class problem into a two-class problem. The resulting two classes are “within-class (intra-person), you are authenticated” and “between-class (inter-person), you are not authenticated.” This is a strong inferential statistics method found to be particularly effective for multidimensional feature-space problems [9].

(a) Feature space (b) Feature-difference space

Figure 6. Transformation from feature space (a) to feature distance space (b), adapted from [9]

To explain the dichotomy transformation process, take an example of three people {P1,P2,P3} where each person supplies three biometric samples. Figure 6(a) plots the biometric sample data for these three people in two-dimensional feature space. This feature space is transformed into a feature-difference space by calculating vector distances between pairs of samples of the same person (intra-person distances, denoted by xÅ) and distances between pairs of samples of different people (inter-person distances, denoted by xÆ). Let dij represent the individual feature vector of the ith person’s jth biometric sample, then xÅ and xÆ are calculated as follows: