Physiological Data Analysis
J. E. Mott and R. M. Pipke
SmartSignal Corporation
901 Warrenville Road, Suite 300
Lisle, IL 60532
208.522.6656
17 June 2004
Introduction
Discussed here are the analyses of data acquired from a variety of subjects wearing a BodyMedia SenseWear Pro armband. The data, which are recorded every minute, are divided into identified sessions that begin when the armband is put on and end when it is removed. Two data sets are analyzed: a training set comprising almost 10,000 hours involving 18 users and a test set comprising a little more than 12,000 hours involving an unknown number of users. In both data sets information was recorded at one-minute intervals for 13 fields: 2 user characteristics, 9 armband sensor values, the session number, and the session time. The training set further contains 3 more fields for an annotation number indicating the activity being performed, gender number, and user identification number.
Analyses of the training data set are used to determine relations between subsets of 11 independent variables – the 2 user characteristics and 9 armband sensor values – and the gender number, annotation 3004, and annotation 5102 both for individual records and for blocks of records. The analyses are entirely data-driven with no knowledge of the meanings of the user characteristics, armband sensor values, gender numbers, or activities indicated by the annotation numbers.
Method
The modeling method we use is one of a class of empirical methods that rely on historical data to develop and refine a model. Neural nets and several classical optimization methods are in this class. Assume there are reference column vectors refXi with m variable values each that are collected at i = 1 to n different times during the operation of a system. The variable values may be values from multiple sensors, or may be sets of features. The different times should encompass all the operating states of the system. The historical data is a collection of these n vectors that comprise a data matrix H with m rows and n columns. Typically n is much greater than m. Now assume that there is a new observation of the system contained in the vector newXj. The model newYj of newXj is a vector function of H and newXj as follows:
newYj = F(H, newXj) (1)
Note that this is an asynchronous method that does not depend on any time series, i.e. the model newYj of newXj is based only on the single instance of the pattern of elements in newXj. Equation 1 is quite general and describes several different modeling methods that employ different vector functions F(H, newXj).
One classical method is where, for a given newXj, not all the refXi in the H matrix are used. Only a local D matrix that contains a subset of refXi that are closest to newXj using a Euclidean distance measure is used. This subset must be chosen such that the D matrix has more substantially more rows than columns. In this case the newYj is given by
newYj = D(DTD) -1DTnewXj (2)
Equation 2 arises from the assumption that newYj is a linear combination of the refXi. The linear combination coefficients, given by the term (DTD) -1DTnewXj, may be derived by minimizing the square of the Euclidian distance between newYj and newXj. Because of this minimization, newYj may be distorted by outlying or faulted elements in newXj.
Our similarity-based method may be understood as a generalization of equation 2 where instead of the normal inner product between two vectors a generalized inner product, referred to as similarity and indicated by the # symbol, is used in two places:
newYj = D(DT#D) –1(DT #newXj) (3)
The similarity, based on a series of patents and patent applications, is unity for identical vectors and zero for vectors that have nothing in common. The D matrix contains a subset of refXi that have the highest similarities to newXj. Equation 3 also arises from the assumption that newYj is a linear combination of the refXi. The linear combination coefficients are given by the term (DT#D) –1(DT #newXj) and are normalized to unity. It is particularly important to note that the similarity is insensitive to outlying elements in newXj. Also especially important is the fact that equation 3 can be applied no matter what are the relative numbers of rows and columns in the D matrix. This is quite unlike the classical situation where equation 2 is useful only when there are significantly more rows than columns.
For the purposes of this investigation the H matrix was always first examined for the presence of the newXj vector. If the newXj vector was present in the H matrix then the newXj vector was removed from the H matrix. This feature allows automatic leave-one-out cross-validation and modeling of the large training data set from which the much smaller H matrix was derived without biasing the results. D matrices with 10 vectors were used based on a study of the effects of the D matrix size on modeling accuracy.
Reference Data Analyses
A training data set of almost 10,000 hours recorded every minute contains 16 fields for 2 user characteristic numbers, an annotation number for the activity being performed, gender number, 9 armband sensor values, user identification number, session number, and session time. These data were first examined for anomalies and band-pass filters were applied to produce a filtered data set that was to be used to create the H matrices appropriate for modeling purposes. This filtering process removed about 1% of the original data. There are users 1, 2, 4, 5, 6, 9, 11, 13, 14, 15, 17, 18, 19, 20, 23, 25, 26, and 32 in the unfiltered training data set. Data for user 17 is completely absent in the filtered training data set so these data were investigated further. It turned out that the reason for the omission of user 17 was because sensor 3 values were all equal to zero. Normally sensor 3 values lie between approximately 10 and 40. Furthermore, user 17 data contained only annotation 0. For the analyses discussed below, user 17 was never a member of the H matrix.
The gender numbers in the training data set are either 0 or 1 and occur with vastly different base rates: gender number 0 comprises 96.1% of the filtered data while gender number 2 comprises only 3.9%. Similarly, users in the training data set occur with vastly different base rates: user 25 comprises 24.3% of the unfiltered data while user 18 comprises only .3% of the unfiltered data. To accommodate such vastly different base rates the approach was taken to try to include in the gender H matrix approximately equal numbers of approximately equally spaced vectors from each user, and an algorithm to achieve this was developed. This resulted in a gender H matrix comprising about 3,000 vectors. Additionally, the similarities between the vectors in the gender H matrix were restricted to be not too similar to each other, a restriction that removed about 500 vectors.
The column vector elements in the gender H matrix initially contained the 11 independent variable values of characteristic 1, characteristic 2, and sensors 1 thru 9 plus the 1 dependent variable value of gender number. Leave-one-out cross-validation was then performed to determine the effects of removing different independent variables on the correlation coefficient and rms deviation between the actual and modeled values of the H matrix. The results were that removal of sensors 4, 7, and 8 all yielded higher correlation coefficients and lower rms deviations than the case where all 11 independent variables were present. Thus the 8 independent variables chosen to model gender number were characteristic 1, characteristic 2, and sensors 1, 2, 3, 5, 6, and 9.
During the process of determining the 8 independent variables to use in modeling gender number, it was noticed that characteristic 2 was very important. Further investigation showed that characteristic 2 had a value of 0 for all gender number 0 vectors in the training data set while characteristic 2 had a value of 1 for all vectors associated with 4 of the 6 users with gender number 1. In the unfiltered training data set there are 556,335 records with gender number equal to 0 and characteristic 2 equal to 0, but no records with gender number equal to 0 and characteristic 2 equal to 1. On the other hand, there are 23,929 records with gender number equal to 1 of which 21,153 records have characteristic 2 equal to 1. The major indicator of gender number is therefore characteristic 2 and it appears that a record with characteristic 2 equal to 1 makes it overwhelmingly likely that the record should also have gender number equal to 1.
The annotation numbers in the training data set vary over a wide range and occur with vastly different base rates: annotation 5102 comprises 16.97% of the filtered data while annotation 3004 comprises only .77%. The vectors that were known not to be annotation 5102 vectors included all annotations that were not 0, 2901, 2902, 3004, or 5103. Again, to accommodate such vastly different base rates, and to differentiate between annotations 5102 and 3004, the approach was taken to try to include in the annotation 5102 H matrix approximately equal numbers of approximately equally spaced vectors from each user having annotation 3004, each user having annotation 5102, and each user not having annotations 0, 2901, 2902, 3004, or 5103, and an algorithm to achieve this was developed that produced an H matrix with about 3,000 vectors. Additionally, the vectors in the annotation 5102 H matrix were restricted to be not too similar to each other, a restriction that removed about 500 vectors. Class number 1 was assigned to all the annotation 5102 vectors and class number 0 was assigned to all vectors that did not have annotation 5102. Using a process similar to that described for gender modeling, a final set of 8 independent variables to model the annotation 5102 class number was chosen as characteristic 1, characteristic 2, and sensors 1, 2, 5, 6, 7, and 8.
The vectors that were known not to be annotation 3004 vectors included all annotations that were not 0, 3003, 3004, 5101, or 5199. Again, to accommodate vastly different base rates, and to differentiate between annotations 3004 and 5102, the approach was taken to try to include in the annotation 3004 H matrix approximately equal numbers of approximately equally spaced vectors from each user having annotation 3004, each user having annotation 5102, and each user not having annotations 0, 3003, 3004, 5101, or 5199, and an algorithm to achieve this was developed that produced an H matrix with about 3,000 vectors. Additionally, the vectors in the annotation 3004 H matrix were restricted to be not too similar to each other, a restriction that removed about 500 vectors. Class number 1 was assigned to all the annotation 3004 vectors and class number 0 was assigned to all vectors that did not have annotation 3004. Using a process similar to that described for gender modeling, a final set of 9 independent variables to model the annotation 3004 class number was chosen as characteristic 1, characteristic 2, and sensors 1, 2, 3, 5, 6, 7, and 8.
Unfiltered Training Data Analyses
The 580,264 vectors available in the unfiltered training data set were separately modeled with the appropriate H matrix and independent/dependent variables for gender number, class number for annotation 3004, and class number for annotation 5102 according to equation 3. In this modeling process the local D matrices contained 10 vectors and were never allowed to contain any of the newXj vectors, a situation that would happen infrequently because the H matrix was derived from vectors in the unfiltered training data set. These efforts produced a modeled gender number, a modeled class number for annotation 3004, and a modeled class number for annotation 5102 for each of the 580,264 vectors in the unfiltered training data set.
The modeled gender number and class numbers were all continuously variable, assuming values from a little less than 0 to a little more than 1. A moving window was applied to these continuous values to assign integer values of 0 or 1 to blocks of vectors. A moving window approach requires several parameters to be defined and actions to be taken based on measurements of those parameters. A threshold above which the continuous numbers produce a count is first necessary. Window width and the acceptable fraction of the width for which these counts exist are also necessary. Finally the step size with which the window is moved over the data set must also be chosen. If the acceptable fraction occurs then an action is necessary to assign integer numbers to particular vectors in the window. This window approach requires the data blocks to be separated by more than their widths. Note that for a window width of 1 the window method can reduce to the usual practice of rounding the continuously variable modeled gender number or class numbers.
The key base-rate-insensitive parameters that measure classification of a set of vectors are Sensitivity and Specificity. In our case, Sensitivity is defined as the fraction of the vectors with actual gender number or class number equal to 1 that also have the integer modeled gender number or class number equal to 1. Specificity is defined as the fraction of the vectors with actual gender number or class number equal to 0 that also have the integer modeled gender number or class number equal to 0. The values of Sensitivity and Specificity are functions of four choices: 1) the threshold above which the continuous numbers produce a count, 2) the window width, 3) the acceptable fraction of the window width for which counts exist, and 4) the action taken to assign integer modeled values to blocks of vectors. As a choice is varied, the values of Sensitivity and Specificity change and can be used to evaluate the effects of the choice on the classification. The four choices all have different, discrete effects on Sensitivity and Specificity. Choice 1 generally can be varied over its full range with Sensitivity having a maximum value of 1 at one end of the range and Specificity having a maximum value of 1 at the other end of the range. Choices 2, 3, and 4 generally cause Sensitivity and Specificity to achieve maxima somewhere near the middle of their acceptable ranges.
For choice 1, Sensitivity as a function of 1 – Specificity forms a receiver operating characteristic (ROC) curve that visually summarizes, quantitatively measures the accuracy of the classification methodology, and determines an optimal value for choice 1. Accuracy of a classification method is indicated by the area under the ROC curve. The optimum value for choice 1 is when the tangent to the ROC curve assumes a 45-degree angle. Before the 45-degree angle, any decrease in Specificity is more than made up for by an increase in Sensitivity while after the 45-degree angle the reverse is true. While choices 2, 3, and 4 do not exhibit the kind of behavior that make an ROC curve useful, we can still examine the effects of their variations on Sensitivity and Specificity and evaluate when any change in Specificity is more than made up for by the change in Sensitivity or vice-versa. Our goal was to find a set of the four choices that produced the best tradeoff between Sensitivity and Specificity that was reasonably possible.
For the case of gender number we know that gender is constant during each session and the choice of window width is simply the length of each session encountered. Basing the other analysis choices on the tradeoffs between Sensitivity and Specificity produced a value for Sensitivity equal to 1 and a value for Specificity equal to 1. These are ideal results for which it is not necessary to produce an ROC curve. Our methodology identifies 23,929 or 4% of the 580,264 records in the unfiltered training data set that are consistent with having gender number 1.
Basing all analysis choices on the tradeoffs between Sensitivity and Specificity produced the ROC curve for annotation 5102 shown in Figure 1. The results are quite good with an area under the ROC curve of about .99. Based on a detailed analysis of the data producing Figure 1, a value of .58 was chosen for the threshold above which continuous modeled class numbers produce a count. 98,172 of the 580,264 records in the unfiltered training data set are known have annotation 5102 while 73,668 records are known not to have annotation 5102. Our methodology correctly identified 96,288 and 72,251 of these records respectively. Including the unknown records, our methodology identified 173,759 or 30% of the 580,264 records in the unfiltered training data set that are consistent with having annotation 5102.
Basing all analysis choices on the tradeoffs between Sensitivity and Specificity produced the ROC curve for annotation 3004 shown in Figure 2. The results are quite good with an area under the ROC curve of about .96. Based on a detailed analysis of the data producing Figure 2, a value of .48 was chosen for the threshold above which continuous modeled class numbers produce a count. Only 4,413 of the 580,264 records in the unfiltered training data set actually have annotation 3004 while 167,368 records are known not to have annotation 3004. Our methodology correctly identified 4,129 and 157,993 of these records respectively. Including the unknown records, our methodology identifies 80,511 or 11% of the 580,264 records in the unfiltered training data set that are consistent with having annotation 3004.
Test Data Analyses
A test data set of over 12,000 hours recorded every minute contains 13 fields for 2 user characteristic numbers, 9 armband sensor values, session number, and session time. No filtering was applied to the test data. The gender number, the class number for annotation 3004, and the class number for annotation 5102 were modeled as outlined below.