Food Label Data Collection Using OCR

Royce Nobles, UNCW CSC592 Pattern Recognition, Spring 2004


1.1 Motivation for Food Label Data Collection Using OCR

The difficulty involved in nutrient data collection is a major issue facing the nutrient analysis community. Nutrient data collection is generally undertaken by researchers from the United States Department of Agriculture’s Nutrient Data Laboratory and major nutrient analysis software companies. The Nutrient Data Laboratory periodically releases its findings in the form of a database known as the National Nutrient Database for Standard Reference (NNDSR), which is freely available to the general public. The heart of the problem lies in the frequency with which the NNDSR is released and in the completeness and resolution of the database. The most current release as of April 26, 2004, NNDSR16-1, contains less than 7,000 foods and is not likely to be updated within the next year. An added vexation is the fact that values for specific foods are often averaged and presented as categories which can be quite general.

This poses a major problem for nutrient analysis software companies, as end users demand that nutrient analysis software products contain a broad range of specific food items. The result is that these companies must collect nutrient data for foods not specifically covered in the most current NNDSR version. The process of collecting and categorizing foods and their corresponding nutrient data currently involves individuals who manually enter nutrient data from Food Labels into a food database. There are major drawbacks to this method of data collection including high labor costs, the introduction of human data entry errors and the relatively slow speed at which data can be collected manually. The focus of this project is to automate this process thereby reducing the negative impact of these issues.

1.2 Specific Aim And Implementation

The specific aim of this project is to develop an Optical Character Recognition system to automate the process of collecting nutrient data from standard Nutrient Facts Food Label images. The optical character recognition system is divided into five easily discernable components and discussed in detail. A brief description of each component is listed below.

1. Optical Sensing – A computer attached optical scanner is used to collect an image from a Nutrient Facts Food Label. The image is then filtered and prepared for segmentation and grouping.

2. Segmentation and Grouping – The image is first segmented into horizontal rows of pixel data. Each row is then further divided into specific words which correspond to classes.

3. Feature Extraction – Specific features are extracted from each word and prepared for classification.

4. Classification – Individual words are compared with template data using documented pattern recognition techniques and classified.

5. Post Processing – According to the recommendation of the classifier, nutrients and their values are united and stored in a food database.


2.1 Collection of Sample Data

An initial set of eight Nutrient Facts Food Labels ranging in height from 399 to 794 pixels were selected for use as sample data. These labels were chosen primarily because of similarities in their font sizes and weights, and because they contain a good sampling of the set of nutrients displayed on Food Labels. Each label consists of a light colored background, dark blue or black text and a dark border. Food Labels are stored as 256 color GIF images allowing them to be opened and manipulated easily.

2.2 Sensing Procedures

The Food Label image is opened and stored as an object containing an image and the label height and width in pixels. The Food Label object is then prepared for segmentation by an image filter class consisting of a border filter, a Prewitt edge detector and a median filter. SOBEL and Laplace edge detectors were also tested in place of the Prewitt edge detector but yielded significantly less helpful results.

The border filter is a simple set of procedures that remove any existing borders from the image to simplify the segmentation process. It begins by testing the top edge of the image for horizontal regions of high pixel intensity. Once a region of high pixel intensity is located near the top of the image, it is marked as the lower boundary of the border and all pixels above that region are converted to white. The same basic procedure is used to locate and effectively erase the bottom, left, and right borders respectively.

Figure 1 – S how s the set of two 3x3 convolution mask s used

by the Prewitt edge detector .

Once any borders have been removed from the image, a Prewitt edge detector is used to reduce noise and improve the systems ability to locate related regions of low pixel intensity. The Prewitt edge detector is designed to respond to edges of contrast running vertically and horizontally relative to the pixel grid with one mask for each orientation. The primary benefit of choosing this algorithm is that it partially fills the space between individual characters in words while reducing pixilation and maintaining word spacing.

A 3x3 median filter is used to remove any residual noise left behind by the Prewitt edge detector. The median filter is used because it is extremely useful in removing speckling, a common side effect of optical scanning, without the loss of image quality generated by other commonly used filters such as low pass and mean. The 3x3 mask was chosen as opposed to the more common 5x5 mask to preserve the delicate edges associated with alpha numeric characters.

The result of filtering a Nutrient Facts Food Label image with the techniques described above is illustrated in Figure 2 . The contrast between the foreground text and background is greatly enhanced, and nearly all noise generated by scanning has been removed. Also notice that the white space between characters has been greatly reduced thereby decreasing the likelihood that words will be incorrectly segmented internally.

Figure 2 - S how s a section of Sample Image 1 before filtering (left), and after filtering (right).


The image is first divided into rows of data by locating horizontal areas of high pixel intensity separating horizontal areas of low pixel intensity. The low pixel intensity areas are considered to be rows of potentially valuable text and are collected for further segmentation. Each row is stored as a list containing a one dimensional array of pixel intensity values and the height and width of the row. Rows collected measuring less than eight pixels in height are discarded due to the observation that actual rows of text range from fifteen to seventeen pixels in height. These narrow rows are commonly residual noise left behind after filtration or the horizontal dividing lines between rows containing nutrient data.

Each row is then divided into words based on vertical columns of high pixel intensity separating areas of lower pixel intensity. These words are treated as specific classes by the system and are stored as word objects with properties including a pixel intensity array, width, height, row number and the order of occurrence within the row.


Features used for classification are extracted from the pixel data of each word. In this system, the power spectrum consisting of eight normalized measurements is collected as the primary feature. This is generated by collecting a 64x16 pixel sample from the upper left corner of the word. Words less than 64 pixels in width are stuffed with white space to fill 64 pixels, while wider words are truncated. As demonstrated in Figure 3, the percentage of window filled by each word is quite variable.

Figure 3 - S how s the 64x16 pixel data sample window with collected words.

A two dimensional Fourier Transform is then applied to a complex number representation of the 64x16 pixel data window. The powers are collected by summing the squares of the real and complex components of the transform data and stored in a 64x16 array as shown in Figure 4. The positive harmonics are collected from the upper left 32x8 section of the array and divided into eight 8x4 feature bins. The values in each bin are summed and normalized by the maximum component to yield the normalized power spectrum with values ranging from zero to one.

Figure 4 - S how s the eight 8x4 power spectrum bins collected from the s patial frequency information .


5.1 The Standard Template

Several attempts at creating a template for use with the different classification techniques failed early on. The most notable problem occurred with the first template which consisted of slightly modified versions of each nutrient found on Sample Label 1. The lack of variance demonstrated by the template resulted in the failure of the Bayesian classifier and greatly degraded the performance of the Feed Forward Back Propagation Neural Network when attempting to classify sample labels 2 through 7.

These issues gave rise to the creation of the standard template which consists of every nutrient name found on each of five Food Labels chosen from the original set of eight. The five labels used to create the standard template were chosen because they exhibited font size and color patterns that appeared to be representative of the complete set of eight labels. Once created, the standard template was processed and filtered by the same system used to collect features from the sample labels to ensure the template data would accurately reflect the sample data.

A value key consisting of a five digit binary number for every word found on the standard template was created as a means of assigning values for the classification of collected words. The key values for each word range from zero to seventeen and are assigned in alphabetical order. Figure 5 shows the complete binary key set as well as a small portion of the standard template.

Figure 5 - S how s the binary key for the class values (left) and a sample of the

standard template (right).

5.2 The Bayesian Classifier

The a priori data for the Bayesian Classifier is collected from the eight power spectrum bin values associated with each word from the standard template. The bin values constitute class exemplars which are used as input vectors to calculate class means. The class mean is then subtracted from each input vector for the class and multiplied by its transverse to create a matrix. These matrices are then added together and multiplied by the reciprocal of the number of exemplars to yield a covariance matrix. The determinant and the inverse covariance matrix are then calculated. From there the Bayesian formula is used to create a discriminant function for the class. The discriminant function for each class is applied to the bin value input vector of the word object being classified. The class whose discriminant function returns the largest value is then chosen as the match.

The first attempted implementation of the Bayesian classifier failed due to a lack of variability in the original template as mentioned in Section 5.1. The Bayesian classifier now executes correctly with the standard template, but statistics have not yet been compiled to indicate the accuracy of the predictions.

5.3 The Feed Forward Back Propagation Neural Network

The Feed Forward Back Propagation Neural Network is composed of eight input neurons corresponding to the eight power spectrum bin values, one hidden layer with fifty neurons, and five output neurons corresponding to the five binary digits of the class key values. A sigmoid response function is used with a learning rate of one to calculate connection weights between neurons, and each neuron is assigned a bias of one. The network was trained on 200 patterns per session randomly chosen from the template until a Total Sum Square Error of 0.001 was reached. The network typically trained successfully in 30,000 to 40,000 sessions.

Results are encouraging with the Feed Forward Back Propagation Neural Network as it correctly classifies 93.40% of words when false positives generated by the collection of non-features during feature extraction are omitted. When these results are averaged with the results of actual features classified, the percentage of correct classification is somewhat lower.


Post processing has not been implemented in the system at this point. The major goals of post processing will be to link the features collected to form nutrients with corresponding values and to check for obvious errors generated during classification. The row and order of classified words will be used to link them together forming complete nutrient names. Words that are combined to form non-existent nutrient names can be flagged as classification errors.

Figure 4 – Shows an example of constructing full nutrient names from words.


Much work remains to be done in several areas of the project. The key topics for exploration include better methods for determining which features are relevant to extract from the labels, a separate classification system for identifying nutrient values, scaling methods to allow a greater size range of labels to be classified, experimentation with other classifiers such as ART-2 and Kohonen and post processing methods for error checking and data storage. Experimental results thus far are encouraging, and further investigation of the issues mentioned above will most likely yield improved results.