Title : Malicious Url Detection Using Machine Learning

04-04-2017 PROGRESS REPORT - MACHINE LEARNING (CS 219)PRAGATHI NARENDRA

TITLE : MALICIOUS URL DETECTION USING MACHINE LEARNING

ABSTRACT:

Purpose:

A malicious URL is a threat to cyber security. There are so many cases of information loss, monetary loss, spyware and malware installation programs leading to loss of billions of dollars every year. It is very important to browse safe. The project aims at achieving cyber security by detecting malicious URL's and avoiding access to such sites.An automated approach for blacklisting malicious URL's involving usage of machine learning is mandatory to understand and use the features of already existing malicious URL's to detect new malicious URL's.

Goal:

Given the URL ,lexical features , host based features and popularity features , the URL is classified as malicious or benign .

Design:

In lexical feature analysis we first distinguish two parts of URL : host name and path. We then look for tokens in domain name and path as malicious websites have large number of tokens. On the next level we look for length of URL as malicious URL's are long in length and also for suspicious words tokens. In host based feature analysis we do analysis on authenticity and reputability of hosts as malicious websites are hosted by less reputable and less authenticated hosts. In popularity feature analysis we analyze popularity of URL as malicious URL's are less popular than benign one's.

PROGRESS STATUS:

Deviations:

The only deviation from proposal is that, in proposal I stated to classify URL's as benign(0) , spam(1) or malicious(2). Since the spam URL's constitute very small ratio of dataset compared to benign and malicious URL's, I converted the spam URL's to malicious URL's in the dataset. Hence the final classification will be only benign(0) or malicious(1).

Project Development:

i) Dataset description:

My dataset was downloaded from github of a person who had a similar project in python. (
Size: 832 rows, 22 variables
Each attribute is a lexical or host based feature of URL
I used regression in R to explore the dataset for useful attributes and thereby use only the efficient attributes for the classification process. (R code uploaded in my website)

ii) Design Methodology:

I have decided to use various supervised learning classification algorithms for the project.

The successful algorithms currently used for classification are Naive Bayes and Neural networks. (R code uploaded in my website)

The analytic task involved hereafter is the selection of best attributes required for the decision tree as I have also planned to use the decision tree for classification.
Performance evaluation is done through the use of confusion matrix for each method.
Tool used: RStudio, MS excel

iii) Schedule of remaining tasks

Week:11 and 12- Decision trees (if possible random forests too)
Week 13 - Project report

References:

1. Malicious URL Detection using Machine Learning: A Survey Doyen Sahoo, Chenghao Liu, and Steven C.H. Hoi