Impact of Automatic Feature Extraction in Deep Learning Architecture

Impact of Automatic Feature Extraction in Deep Learning Architecture 1

Impact of Automatic Feature Extraction in Deep Learning Architecture

In Accordance with the American Psychological Association Style Guide

Research Advisor: Prof. Brijesh Verma

Course Coordinator: Dr. Jo Luck

Fatma Sheen

S0348133

RSCH2001 Fundamentals of Reasearch

Friday 12 August 2016

Introduction

Convolution Neural Network (CNN) is one of the successful machine learning techniques for image classification in the form of deep learning that has emerged recently. CNN is generally referred as biologically-inspired variants of Multi-Layers Perceptron (MLPs). Deep learning involves multiple processing layers, composed of multiple linear and non-linear transformations. The method is motivated by the animal’s visual cortex i.e., based on the arrangement of cells and its learning process. On the other hand, MLP is a popular form of artificial neural network which can be used for classification using a manual feature extraction or without a feature extraction. MLP does not contain automatic feature extraction as in CNN.

This research project will present the impact of automatic feature extraction used in a deep learning architecture such as Convolutional Neural Networks (CNN). A systematic study to investigate CNN’s feature extraction will be presented. CNN with automatic feature extraction will firstly be evaluated on a number of benchmark datasets and then a simple traditional Multi-Layer Perceptron (MLP) with full image and manual feature extraction will be evaluated on same benchmark datasets. The purpose is to see whether feature extraction in CNN will perform any better than a simple feature with MLP and full image with MLP. Many experiments will be systematically conducted by varying number of epochs and hidden neurons.

To form the literature review later, the following annotated bibiography has been collected and summarized from related deep learning papers for image classification/ recognition. Each paper has proposed/investigated techniques, type of data used and accuracies obtained. This will help in knowing more about what related works have been done in the field and how this research topic and will add to that.

Annotated Bibliography

Ba, J., Mnih, V., Kavukcuoglu, K. (2015). Multiple Object Recognition with Visual Attention.

ICLR Conference 2015. Published as a conference paper. Retrieved from

A new version of a deep recursion attention model (DRAM) was described which uses “an attention mechanism to decide where to focus its computation” (Ba et al, 2015),training end-to-end to consecutively classify multiple images found in a single image. The model was more successful with multi-digit house number recognition than the current state-of-the-artconvolutional neural networks(ConvNets). Since the model uses fewer parameters as well as less computation than ConvNets it is predicted that both accuracy and efficiency can be improved. Data was obtained using a variation of the handwritten MNIST dataset for multi-object classification tasks. The multi-digit SVHN (street view house numbers) data was used for real-world object recognition.

Donahue, J., Hendricks, L., Guadarrama, S., Rohrback, M., Venugopalan, S., Saenko, K, Darrell,

T. (2015). Long-term Recurrent Convolutional Networks for Visual Recognition and Description. Retrieved from

A recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable was developed which demonstrated the value of these models on benchmark video recognition tasks. The architecture was evaluated on the UCF-101 dataset of over 12,000 videos categorized into 101 human action classes. The best results were an improvement of .49% for RGB on the baseline network and 5.27%for flow on the baseline network.

Dundar, A., Jin, J., and Culurciello, E. (2016). Convolutional Clustering for Unsupervised

Learning, arXiv.1511.03241v2. [cs.LG].

The need to label data for training deep neural networks is daunting and it is proposed that a clustering algorithm (k-means) can be applied to reduce the number of correlated parameters and increase test categorization accuracy. A new input patch extraction method for feature extraction was used to reduce the redundancy between filters at neighboring locations. A test accuracy of 74.1% was obtained on an image recognitionSTL-10 dataset and a test error of 0.5% on MNIST.

Krizhevsky, A., Sutskever, I., Hinton, G. (2012). ImageNet Classification with Deep

Convolutional Neural Networks. Retrieved from

Using five convolutional layers and three fully connected layers, the model trained between five and six days on two NVIDIA GTX 580 3GB GPUs. The data used was 1.2 million high-resolution images obtained from the Large Scale Visual Recognition Challenge 2010 (LSVRC-2010) and then trained into 1000 different classes. Error rates of the top-1 and top-5 rates (37.5% and 17.0% respectively) were achieved. Non-saturated neurons were used to speed up the training along with efficient GPU implementation and a method called “dropout” was effective in reducing “overfitting in the fully-connected layers” (Krizhevsky et al, 2012). Other results were a win in the ILSVRC-2012 competition with a top-5 test error rate of 15.3% over the next best entry of 26.5%.

Lenz, I., Lee, Honglak, and Saxena, Ashutosh. (2013). Deep Learning for Detecting Robotic

Grasps. Retrieved from

Using a two-step cascaded structure with two deep networks (top detections from the first network are re-evaluated by the second) features are learned, and robotic grasps are detected and classified. Data used was an extended version of the Cornell grasping unlabeled dataset which consists of 280 graspable objects (approximately half have graspable handles and half do not). Recognition performance improved significantly using deep learning methods (9%) over previous models which utilized hand-engineering.

Romaszko, L. (2013). A Deep Learning Approach with an Ensemble-Based Neural Network

Classifier for Black Box ICML 2013 Contest. Retrieved from

In order to correctly classify a dataset provided by the organizers of the International Conference on Machine Learning contest, an ensemble of neural network classifiers was used in the training while a reduced input vector was used to increase accuracy. An accuracy score of 67.18% was achieved for the solution which placed it in the top 3% of submissions in the contest.

Syafeeza, A. R., Khalil-Hani, M., Liew, S. S., Bakhteri, R. (2014). Convolutional Neural

Network for Face Recognition with Pose and Illumination Variation. International Journal of Engineering and Technology, V6(1), Feb-Mar. P. 44 – 57.

Four-layer Convolutional Neural Network (CNN) architecture was used. Rather than use the common LeNet-5 architecture, the convolution and subsampling layer were fused forming a simplified version of CNN. Using an AR database consisting of over 4000 images corresponding to 126 people (male and female) the proposed CNN solution achieved 99.5% recognition accuracy while tests on the 35-subject face recognition technology (FERET) database achieved an accuracy of 85.13%.

Tishby, N., Zaslavsky, N. (2015), arXiv1503.02406v1 [cs.LG]. Deep Learning and the

Information Bottleneck Principle.

A representation was established showing that any DNN can be quantified by the mutual information between the layers and the input and output variables. This representation was then used to calculate the optimal information theoretic limits of the DNN and obtain finite sample generalization bounds. A purely information theoretic view of DNNs was proposed allowing for quantification of their performance, providing a theoretical limit on their efficiency and illuminating a new finite sample.

Valpola, H. and Karhunan, J. (2012). An Unsupervised Ensemble Learning Method for

Nonlinar Dynamic State-Space Models. Neural Computation 14(11), pp. 2647-2692), MIT Press.

A Bayesian ensemble learning method is introduced for unsupervised extraction of dynamic processes from noisy data generated by an unknown nonlinear mapping from unknown factors. The nonlinear mappings are represented by using multilayer perceptron networks. The results showed that the described method is able to learn up to approximately 15-dimensional spaces or blindly extract about 15 source processes, which are less than the current PCA and ICA unsupervised linear techniques, however since this is a non-linear process the results comparisons are more difficult to obtain and further work is needed to decrease the computational load of the method, and the method itself needs to be further simplified. It is believed that because of the complicated approach, learning can sometimes fail.

Yin, Q., Zhang, J., Zhang, C., Ji, N. (2014). A Novel Selective EnsembleAlgorithm for

Imbalanced Data Classification based on Exploratory Undersampling.

In an attempt to enhance the diversity between individual classifiers through feature extraction and diversity, a selective ensemble construction method based on exploratory undersampling was used.Comprehensive comparisons on 20 real-world imbalanced data sets using a nonparametric statistical test and various evaluation criteria between this method and some state-of-the-art imbalanced learning methods revealed a significant increase in performance.

Assessment criteria

Criteria / Pass / Comments
Introduction clearly introduces your proposed research project and demonstrates the linkage of the annotations to the project.
Introduction justifies why you chose these references.
Annotations summarise the key themes/arguments of each reference, provide the context of each study and an evaluation of the author(s) credentials.
Demonstrate critical analysis of each reference.
Evaluation of each reference includes its relevance to the proposed research project.
Correct style, format, and layout applicable to an Annotated Bibliography.
Academic style and referencing is consistent with stated referencing style.
Well edited writing style with correct spelling and grammar.
Your Result = Pass, Fail or Re-submit