Vision Recognition

Final Report

Eric Johnson

Ben Baird

David Dzado

Table of Contents

Executive Summary 3

Review 3

Challenges 4

Source Code 5

Functional Specification Document 17

Product Description 17

Project Requirements 17

Customer Needs 17

Interpretation of Customer Needs 18

Project Specifications 18

Concept Generation and Selection 19

Introduction 19

Body of Facts 19

Alternatives 19

Image descriptors SIFT or SURF comparison 19

Concept Scoring 20

Summary and Selection 20

Result 20

Executive Summary

The overall goal for the semester was to come up with a vision recognition system using OpenCV that would allow us to take in a live video and determine if the camera had previously seen the current image. The program was implemented using C++ and was divided into three major tasks: Storing the vocabulary in a database, feature identification using words found in the database, and the word infrequency algorithm. Together these three tasks interact to provide us with a program based on the ‘Bag of Words’ concept, that fairly accurately determines if the current image captured on video has been seen before by the program.

Review

The database is formed by test images that were collected over the semester. This part of the program takes in an image and runs feature detecting algorithms on them. It was determined that the fastest way to accomplish this was to use SURF features and descriptors. The different features represent words. The words are then mapped to a K-means coordinate system and clustered together. Words that are closest together in the coordinate system are considered close enough to be deemed the same word. This establishes a dictionary of sorts that will be used to describe the features in the sampled frames of the video.

At this point, video is taken in by the webcam on the laptop. A similar feature detection algorithm is performed on the new images coming in. These images are not stored, and the features found in the images are not stored in the database. The features found in the images are labeled according to how closely they resemble different features found in the K-means map. This is done by using the center of each cluster in the K-means map and finding the Mahalanobis distance from the center to the current feature in question. Once each image is described with the features listed in the database, this information is stored in vectors and handed to the word infrequency algorithm.

Word infrequency is simply finding which features are most unique in an image. The features that are most unique to an image best describe that image and are weighted more heavily in the vision recognition process. All of the features found in every image are counted, along with the total number of images. A weighted logarithm function is used to determine how unique a feature is in an image and if it is overall unique to the current set of images. These weight totals are saved in a vector. Each vector has the number of each feature in the database and how many times it shows up in the image in question. Because of this uniform structure of the vectors, a dot product can be done with two of these vectors to determine if the images are similar, exact or not matches. The result of the dot product has a maximum of ‘1’ for exact match and falls off to ‘0’ linearly when it is not a match.

The program is able to fairly accurately identify images that it has seen before. Other images that have similar features do produce high results giving somewhat of a false positive. Looking forward, this can be taken care of when established thresholds and by tweaking the current program to discriminate against these false positives.

Challenges

One of our biggest challenges throughout the semester was the big learning curve with OpenCV. We jumped straight into coding and decided to learn OpenCV in real time. This was due to time constraints presented by the structure of the senior project. It introduced bugs into the program that may have been avoided if we had taken more time to learn the OpenCV library and data structures.

Source Code

* BOWdatabase.h

* BOW

* Created by Eric Johnson on 3/30/11.

#include "highgui.h"

#include "cv.h"

#include <ctype.h>

#include <stdio.h>

#include <stdlib.h>

#include <iostream>

#include <vector>

using namespace std;

using namespace cv;

#define ARRAY_SIZE 1024

#define NUM_IMG 299 //This is the number of images that will be used to create the "dictionary"

void BOWdatabase();

* BOWdatabase.cpp

* BOW

* Created by Eric Johnson on 3/30/11.

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

// All Block comments refer directly to the "Video Google: A Text Retrieval Approach to Object Matching in Videos" Paper

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

#include "BOWdatabase.h"

void BOWdatabase(){

char path[NUM_IMG][ARRAY_SIZE];

// initailize image data set

vector<IplImage*> img (NUM_IMG);

//Reads images into path array. This blcok of code could be combined with the next set of logic.

for (int i = 0; i<NUM_IMG;i++){

char buffer [33];

sprintf(buffer, "DatabaseFixed/Image_%03d.JPG",i+1);

strcpy(path[i], buffer);

}

/*for (int i = 0; i<NUM_IMG;i++){

char buffer [33];

sprintf(buffer, "rect-left%03d.pgm",i+1);

strcpy(path[i], buffer);

}*/

//first data set

/*strcpy(path[0],"Picture 36.jpg");

strcpy(path[1],"Picture 37.jpg");

strcpy(path[2],"Picture 38.jpg");

strcpy(path[3],"Picture 39.jpg");

strcpy(path[4],"Picture 40.jpg");

strcpy(path[5],"Picture 41.jpg");*/

//new data set

//strcpy(path[6],"Picture 42.jpg");

//strcpy(path[7],"Picture 43.jpg");

//strcpy(path[8],"Picture 44.jpg");

//strcpy(path[9],"Picture 45.jpg");

//strcpy(path[10],"Picture 46.jpg");

//strcpy(path[11],"Picture 47.jpg");

/////////////////////////////////////////////////////////////////////////////////////////////////////////////

// 2. Viewpoint invariant description -- SURF descriptors used with SURF features descriptors

// It is possible to replace much of this code with OpenCV's BOW library which is completely based on

// the Video Google Paper.

/////////////////////////////////////////////////////////////////////////////////////////////////////////////

// load image data set (grayscale). This block could probably be combined with block above.

for (int i = 0; i < NUM_IMG; i++)

{

img[i] = cvLoadImage( path[i], CV_LOAD_IMAGE_GRAYSCALE );

}

double t = cvGetTickCount();

//Detect Features. Features can be changed to Mser,Surf,Sift,etc.

//by changing detector declaration.

printf("Finding features\n");

vector<vector<KeyPoint> > keypoints(NUM_IMG);

for (int i = 0;i < NUM_IMG; i++)

{

printf("Image %d\n",i);

SurfFeatureDetector detector;

detector.detect(img[i],keypoints[i]);

}

//extract surf descriptors. Sift Descriptors can be used, but Label.cpp will need to be adjusted

//to work with the 129-element vectors.

printf("Extracting descriptors\n");

vector<Mat> descriptors(NUM_IMG);

for (int i = 0;i < NUM_IMG; i++)

{

SurfDescriptorExtractor extractor;

extractor.compute(img[i], keypoints[i], descriptors[i]);

}

t = cvGetTickCount() - t;

printf( "Descriptors found in %g ms\n",t/((double)cvGetTickFrequency()*1000.) );

// create memory matrix

int N=NUM_IMG; // number of images in the data set

int Nw=150;//number of clusters in kmeans. This can be changed to adjust and tune the results.

cv::Mat NoF(N,1,CV_32SC1);

//create matrix that describes the number of features in each document

int Nf=0;

for (int i = 0;i<N;i++)

{

NoF.row (i) = descriptors[i].rows;

Nf += descriptors[i].rows;

}

printf("Total Number of Pictures %d\n",N);

printf("Total Number of features %d\n",Nf);

cv::Mat m_image1(Nf, descriptors[1].cols, CV_32F);

cv::Mat clusters(Nf, 1, CV_32SC1);

cv::Mat cluster_center(Nw, descriptors[1].cols, CV_32F);

//concatinate descriptor matrices into m_image. This can be improved because of conversion

//issues with cv::Mat and CvMat.

CvMat temp_m_image = m_image1;

int count1 = 0;

for (int i = 0; i<N; i++) {

CvMat tempDescriptor = descriptors[i];

for (int j = 0; j<descriptors[i].rows;j++) {

for (int k = 0; k<descriptors[i].cols; k++) {

*( (float*)CV_MAT_ELEM_PTR( temp_m_image,count1,k) ) = CV_MAT_ELEM(tempDescriptor,float,j,k);

}

count1++;

}

//converting back to cv::Mat.

cv::Mat m_image(&temp_m_image,true);

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

// 3. Building a visual vocabulary -- Here we create a database for the features we find in the images using KMEANS.

// A matrix is outputed that has the list of features, and another with the center point of the feature in KMEANS.

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

//Perform kmeans. This is also a point of adjustment, using different parameters for the clusters

printf("Performing kmeans\n");

cv::kmeans (m_image,Nw,clusters,cv::TermCriteria (),4,cv::KMEANS_RANDOM_CENTERS,&cluster_center);

//write out matrices from kmeans. This come out in xml format.

printf("Writing cluster and center_cluster matrices\n");

CvMat desc = descriptors[0];

desc = clusters;

CvFileStorage* fs = cvOpenFileStorage("clusters.xml",0,CV_STORAGE_WRITE);

cvWrite(fs,"kmeans_clusters",&desc);

cvReleaseFileStorage( &fs );

desc = cluster_center;

fs = cvOpenFileStorage("cluster_center.xml",0,CV_STORAGE_WRITE);

cvWrite(fs,"kmeans_cluster_centers",&desc);

cvReleaseFileStorage( &fs );

//create separate matrices for each cluster. This is to help calculate the covariance.

vector<CvMat*> sampleMatrices (Nw);

for (int i = 0; i<Nw; i++){

int count2 = 0;

for (int j = 0; j<clusters.rows; j++) {

if (clusters.at<int>(j,0) == i) {

count2++;

}

sampleMatrices[i] = cvCreateMat(count2,cluster_center.cols, CV_32F);

}

for (int i = 0; i<Nw; i++){

int count2 = 0;

for (int j = 0; j<clusters.rows; j++){

if (clusters.at<int>(j,0) == i) {

for (int k = 0; k<m_image.cols; k++) {

*( (float*)CV_MAT_ELEM_PTR( *sampleMatrices[i],count2,k) ) = CV_MAT_ELEM(temp_m_image,float,j,k);

}

count2++;

}

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

// 4. Calculating Covariance Matrix. This is for future use by Label.cpp to calculate mahalanobis distance.

// It is here to save on calculation time and computing resources.

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

//calculate inverse covariance of each cluster group

//outputs vector of Nw matrices, each matrix square of the size of descriptors

printf("Calculating covariance of each cluster\n");

vector<cv::Mat> covarianceMatrices(Nw);

for (int i = 0; i<Nw; i++) {

covarianceMatrices[i] = cv::Mat::eye(m_image.cols,m_image.cols,CV_32F);

if (sampleMatrices[i]->rows > 0) {

cv::Mat samples(sampleMatrices[i],true);

cv::Mat center(cluster_center.row(i));

calcCovarMatrix(samples, covarianceMatrices[i], center, CV_COVAR_USE_AVG + CV_COVAR_ROWS + CV_COVAR_NORMAL, CV_32F);

}

printf("Writing covariance file\n");

fs = cvOpenFileStorage("covariance_inverted.xml",0,CV_STORAGE_WRITE);

for (int i = 0; i<Nw; i++) {

cv::Mat temp = covarianceMatrices[i].inv(CV_SVD);

desc = temp;

char buffer [33];

sprintf(buffer,"Cluster_%d",i);

cvWrite(fs,buffer,&desc);

cvStartNextStream(fs);

}

cvReleaseFileStorage( &fs );

for (int i = 0; i<N; i++) {

cvReleaseImage(&img[i]);

}

* Label.h

* label

* Created by Ben Baird on 3/24/11.

#include "highgui.h"

#include "cv.h"

#include <ctype.h>

#include <stdio.h>

#include <stdlib.h>

#include <iostream>

#include <vector>

#include <time.h>

#include <math.h>

using namespace std;

using namespace cv;

#define NUM_WORDS 150 //The number of words in the "dictionary"

#define SURF 1 //These were going to be used to dynamically choose feature detectors

#define SIFT 2

#define STAR 4

cv::Mat Label(IplImage* test_img);

* Label.cpp

* label

* Created by Ben Baird on 3/24/11.

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

// This function takes in a new image and finds which features it has based on the features we have seen in other images

// A lot of clean up can be done. Most of the code is straight forward. One improvement can be

// placing the covariance_inverted.xml file into memory once, since it is read everytime this fuction is called.

// If improvements in the speed of calculating the actual distance can be made, do it since it is the weakest link

// of the function.

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

#include "Label.h"

cv::Mat Label(IplImage* test_img)

{

//Retrieve the covariance matrices from an xml file

clock_t start_program = clock();

CvFileStorage* fs_cm = cvOpenFileStorage("covariance_inverted.xml",0,CV_STORAGE_READ);//TAKES FOREVER, only run once

cout << "Loading time: " << (clock() - start_program)/(double)CLOCKS_PER_SEC << endl;

vector<CvMat*> cov_mat(NUM_WORDS);

int count(0);

clock_t Extract_time = clock();

for (int i=0; i<NUM_WORDS; i++)

{

count++;

char buffer [33];

sprintf(buffer, "Cluster_%d",1);

cov_mat[i] = (CvMat*) cvReadByName(fs_cm,NULL,buffer);

}

cout << "Time to extract matrices: " << (clock() - Extract_time)/(double)CLOCKS_PER_SEC << endl;

//Retrieve the cluster centers from an xml file

clock_t Extract_centers = clock();

CvFileStorage* fs_cc = cvOpenFileStorage("cluster_center.xml",0,CV_STORAGE_READ);

CvMat* cluster_centers = (CvMat*) cvReadByName(fs_cc,NULL,"kmeans_cluster_centers");

cout << "Time to extract centers: " << (clock() - Extract_centers)/(double)CLOCKS_PER_SEC << endl;

cvReleaseFileStorage(&fs_cm);

cvReleaseFileStorage(&fs_cc);

//Extract SURF Features

vector<KeyPoint> keypoints;

MserFeatureDetector detector;

detector.detect(test_img,keypoints);

cv::Mat descriptors;//change

SurfDescriptorExtractor extractor;

extractor.compute(test_img, keypoints, descriptors);

int Num_features = descriptors.rows;

cv::Mat cluster_cent(cluster_centers,true);

vector<double> distance(NUM_WORDS);

//vector<int> feature_cluster(Num_features);

cv::Mat feature_cluster(Num_features,1,CV_32FC1);

// cv::Mat t2(64,1,CV_32FC1);

// cv::Mat t3(1,1, CV_32FC1);

// cv::Mat t4(1,1, CV_32FC1);

clock_t Mahalanobis_time = clock();

for(int i=0;i<Num_features;i++)

{

for(int j=0;j<NUM_WORDS;j++)

{

cv::Mat cov_matrix(cov_mat[j],true);

//Calculate Mahalanobis

// cv::Mat t1(1,64,CV_32FC1);

// subtract(cluster_cent.row(j),descriptors.row(i),t1 );

// t2 = t1*cov_matrix;

// t3 = t2*t1.t();

// sqrt(t3.row(0),t4);

// distance[j] = t4.at<float>(0,0);

//Or us Mahalanobis function

distance[j] = Mahalanobis( cluster_cent.row(j), descriptors.row(i), cov_matrix);

//distance[j] = rand();

//cout << distance[j] << endl;

}

int min_loc = min_element(distance.begin(), distance.end()) - distance.begin();

feature_cluster.row(i) = min_loc;

//cout << "Location: " << min_loc << " Value: " << distance[min_loc] << endl;

}

cout << "Time to calculate Mahalanobis distance: " << (clock() - Mahalanobis_time)/(double)CLOCKS_PER_SEC << endl;

///////////////////////////////////////////////////////////////////////////////////////////

// for (int i=0; i<Num_features; i++)

// {

// cout << feature_cluster.row(i) << endl;

// }

return (feature_cluster);

}

* main.cpp

* BOW

* Created by Eric Johnson on 3/30/11.

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

// All Block comments refer directly to the "Video Google: A Text Retrieval Approach to Object Matching in Videos" Paper

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

#include "BOWdatabase.h"

#include "Label.h"

#include <ctype.h>

#include <stdio.h>

#include <stdlib.h>

#include <iostream>

#include <vector>

#include <time.h>

#include <math.h>

#define TOT_DATABASE 6

#define NUM_TEST 6

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

// inverse_doc_frequency calculated the Log(N/ni) expression of the document vector described in the Video Google Paper.

// The function outputs a new matix everytime the fuction is called. The function should be called everytime a new image

// is retrieved for comparison.

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

cv::Mat inverse_doc_frequency(cv::Mat inverse_doc_freq,cv::Mat Ni,int count){

//It probably isn't necessary to read in the clusters xml file any more

//This was based off the assumption that Log(N/ni) needed information from the "dictionary"

CvFileStorage* fs = cvOpenFileStorage("clusters.xml", 0, CV_STORAGE_READ);

CvMat* clusters_temp = (CvMat*) cvReadByName(fs,0,"kmeans_clusters");

cvReleaseFileStorage(&fs);

cv::Mat clusters(clusters_temp,true);

//This can most likely be ignored. It was used for testing purposes.

/*cv::Mat Ni = cv::Mat::zeros(NUM_WORDS, 1, CV_32F);

for (int i = 0; i<NUM_WORDS; i++) {

for (int j = 0; j<clusters.rows; j++) {

if (clusters.at<int>(j,0)==i) {

Ni.row(i)=Ni.row(i)+1;

}

}*/

for (int i = 0; i<NUM_WORDS; i++) {

if (Ni.at<float>(i,0)>0) {

double cc = double(count)/double(Ni.at<float>(i,0));

inverse_doc_freq.row(i) = log(cc);

}

return inverse_doc_freq;

}

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

// The word_frequency function calculates the nid/nd expression of the document vector described in the Video Google Paper.

// It takes in a new document vector and outputs the frequency of features in that image/document.

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

cv::Mat word_frequency(cv::Mat vi){

cv::Mat word_freq = cv::Mat::zeros(NUM_WORDS, 1, CV_32F);

cv::Mat nid = cv::Mat::zeros(NUM_WORDS, 1, CV_32F);

double nd = (double) vi.rows;

for (int i = 0; i<NUM_WORDS; i++) {

for (int j = 0; j<vi.rows; j++) {

if (vi.at<float>(j,0)==i) {

nid.row(i)=nid.row(i)+1;

}

for (int i = 0; i<NUM_WORDS; i++) {

if (nid.at<float>(i,0)>0) {

word_freq.row(i) = double(nid.at<float>(i,0))/nd;

}

return word_freq;

}

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

// This was to be a function that would do the rest of the math when comparing images. Similar code is located in the main

// function. Eventually, the code in the main fuction should be replaced by this function for better coding style and

// portability.

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

void compare_images(IplImage* img,vector<cv::Mat> word_freq,vector<cv::Mat> vdoc,cv::Mat inverse_doc_freq){

cv::Mat vi = Label(img);

word_freq.push_back(word_frequency(vi));