Predicting the Outcome of Police Calls in San Francisco

John Van Gilder

Introduction

San Francisco is known by many as the tech capital of the world. Many new computer science graduates across the country move there looking to start their careers at new and exciting startup companies. The city is known to others, however, as one of the crime capitals of the United States.1 Data involving these crimes is easy to access; however, because each crime is unique and so much data on these incidents exists, it’s very hard to use this data in a way that is helpful to law enforcement agencies. Previous groups have tried2,3 to use machine learning techniques to classify this dataset but have so far not had much success developing an accurate model using all of the features given to predict an emergency call’s specific outcome2 or crime committed.3 This paper will present a different method, using a more limited set of features to predict a more general, but still useful, classifier.

Work Performed

Data Preparation

The dataset was taken from the Kaggle “San Francisco Crime Classification” challenge.4 The raw data was in CSV format and contained nine fields: Date, Day of the Week, a description of the incident, the name of the Police District where the crime was reported, the address of the crime reported, the latitude and longitude of the report, the type of crime and its resolution. In development of this model, a number of these features were considered extraneous and were removed: address and police district, for instance, were redundant with latitude and longitude data; descriptions of the incidents were, being written by different people for different events, too varied to be immediately useful and were discarded also. Finally, it was decided that predicting the specific type of crime committed was beyond the scope of this project, as knowledge of the relative severity of crimes was unavailable and the model that was wished to be developed focused solely on the resolution of the incident report. Once these were removed, the dataset was left with three features and one classifier: day of the week, latitude, and longitude made up the feature vector, and resolution (a binary variable: 0 for no arrest made and 1 for arrest made) made up the label. The data was then converted to HDF5 format with a MATLAB script that was written specifically for this project.

Algorithm

The algorithm used in the development of this model was the CAFFE deep learning model, a convolutional neural network. Documentation on the specifics of this toolset can be found on their website5 but this implementation used two convolution layers, which are very similar to perceptron layers in MLP implementations, and two pooling layers, which are designed to lie between convolution layers and reduce the feature space to ease computation and reduce the probability of model overfitting.

Experiments Performed

The main parameters varied in this experiment were the number of neurons in each layer and the base learning rate. The Caffe framework reports the accuracy and loss of its model after training; this accuracy tested the model developed on 3/4 of the data on the remaining 1/4, in a method known as 4-way cross validation. The loss is calculated using a softmax log-loss calculation, as detailed in the documentation for the Caffe framework.6 As the 4-way cross validation method results in four different train/test pairs, the best of these four is reported here. The base learning rate is the rate for the first set of iterations; the learning rate is then decreased by a factor of 10 every 2000-10000 iterations, depending on the number of iterations to be run.

Results and Discussion

Trial 1 / Trial 2 / Trial 3 / Trial 4 / Trial 5 / Trial 6 / Trial 7
Base Learning Rate / 0.01 / 0.001 / 0.02 / 0.02 / 0.01 / 0.001 / 0.001
Neurons, Convolution Layer 1 / 20 / 20 / 20 / 10 / 10 / 5 / 2
Neurons, Convolution Layer 2 / 20 / 20 / 20 / 10 / 10 / 5 / 2
Accuracy / 0.662727 / 0.662727 / - / 0.652548 / 0.662727 / 0.652548 / -
abs(Loss) / 1.66733 / 1.66733 / NaN* / 1.21764 / 1.66733 / 1.39102 / NaN*

*NaN as Caffe’s reported loss value signals that learning has become divergent and training cannot continue.7

As is apparent from the above table of values, the parameters that minimize the loss of the model are those found in trial 4; maximizing accuracy are trials 1, 2, and 5. An accuracy value of 0.5 would mean the model is no better than guessing (as the label is a binary value), so a value of 0.663 means that the model is at least somewhat better than a random guess. The time to train each model is quite high, as 10-100K or more iterations are performed, so more trials could be run in the future.

Conclusions and Future Directions

This project succeeded in generating a model that can predict, with some degree of accuracy better than simple guessing, the outcome of an emergency call in San Francisco given its location. This kind of model, could it be perfected, could be used by law enforcement officials to more efficiently allocate its resources, or for people new to the city to decide where to live.

Future directions for this project may include the addition of features (including the type of crime) into the dataset, given some way to rank them relative to each other in terms of severity. Discussion with an expert in the field of law enforcement or criminal justice may yield insights into this problem.

References

[1] Neighborhood Scout. Crime Rates for San Francisco, CA.

[2] Ang, S.T.; Wang, W.; Chyou, S. San Francisco Crime Classification. Fall 2015,

[3] Damien, RJ. Machine learning to predict San Francisco crime. July 20, 2015.

[4] Kaggle. San Francisco Crime Classification.

[5] Jia, Y; Shelhamer, E; Donahue, J; Karayev, S; Long, J; Girshick, R; Guadarrama, S; Darrell, T. Caffe: Convolutional Architecture for Fast Feature Embedding. 2014.

[6] Jia, Y; Shelhamer, E; Donahue, J; Karayev, S; Long, J; Girshick, R; Guadarrama, S; Darrell, T. Caffe: Layers Tutorial.

[7] Jia, Y; Shelhamer, E; Donahue, J; Karayev, S; Long, J; Girshick, R; Guadarrama, S; Darrell, T. Caffe: Solver Tutorial.

Appendix A: Matlab code for Conversion from CSV to HDF5 format

clc; clear all;

filename = './CSVdata/train2r.csv'; %Replace with the pathway to the dataset

M = csvread(filename);

datawidth = 3;

labelwidth = 1;

[l,w] = size(M);

prtsz = l/4;

M1 = M(1:prtsz,:);

M2 = M((prtsz+1):(2*prtsz),:);

M3 = M((2*prtsz+1):(3*prtsz),:);

M4 = M((3*prtsz+1):(4*prtsz),:);

%Mf =[M1;M2;M3]; %uncomment one of these when making a training set

%Mf =[M1;M2;M4];

%Mf =[M1;M3;M4];

%Mf =[M2;M3;M4];

%Mf = M1; %uncomment one of these when making a test set

%Mf = M2;

%Mf = M3;

Mf = M4;

traindata = reshape(Mf(:,2:datawidth),[1,2,1,prtsz]);

trainlabel = Mf(:,4);

trainlabel = permute(trainlabel,[2 1]);

h5create('test4.h5', '/data',size(traindata))

h5create('test4.h5', '/labelvec',size(trainlabel))

h5write('test4.h5', '/data',traindata)

h5write('test4.h5', '/labelvec',trainlabel)

h5disp(‘test4.h5');

FILE=fopen('test4.txt', 'w');

fprintf(FILE, './examples/SF/%s', 'test4.h5');

fclose(FILE);

fprintf('HDF5 filename listed in %s \n', 'test4.txt');

Appendix B: Other Files

The solver, data definitions and other Caffe files I wrote (some based heavily on existing Caffe files) are attached in the same ZIP file as this document. The data set was too large of a file to include, as were the rest of the Caffe tools. All attached files should be openable by a text editor, regardless of extension; Caffe uses its own .prototxt format, but it’s readable by TextEdit or Notepad. Notably, SF_solver.prototxt contains a number of parameters for the execution of training (number of iterations, base learning rate, etc.) and SFData.prototxt contains the configuration of the network.