Bench press exercise detection and repetition counting

Martin Barus

Introduction

Most frequently used method for tracking gym exercising and personal progress in weightlifting is writing down the number of repetitions in each set along with used weight. I am working on a project, which aims to automatize this process, using wearable device and smart phone.

Person wears two armbands, connects them with the mobile application through blue tooth, enters the weight and exercise she/he is about to do (similar to the notebook) in the smart phone application, hits the start button, exercises, hits the end button and than sees descriptive statistics about the workout such as number of repetitions, min, max, average velocity etc.


This task is hard, because beside the movements corresponding to the exercise, there are many movements corresponding to different activities such as stretching, re-racking weights, walking and many others.

This following figure shows:

a) raw acceleration data from one recording

b) some of the extracted features of the same signal

c) corresponding ground truth label, what is an exercise (value 1) and what is not (value -1).

Notice the black line in second plot (b) which remains constant during periodic movement (exercise) it is the RMS value of one of the accelerometers, however, it sometimes has similar values during non-exercise movements.

There are two goals to this project, one is to classify movement as either exercise or non-exercise, the other is to count the number of repetitions in each exercise recording.

Previous achievements by other people

The Berkeley research paper[2] achieves 90 % accuracy in exercise classification (but they focus on classifying different types of exercises) along with average 5% miscount of repetitions (averaged for all exercises), however they achieved 16% miscount for the bench press exercise using Naive Bayes and Hidden Markov models for classification and Hidden Markov Models for repetition counting.

The Microsoft research paper[1] does 3 things, 1. segmentation into exercise – non exercise, 2.classification of exercise type and 3. repetition counting.

They report 3 different precisions for segmentation, based on overlapping true exercise and predicted exercise:

100% for traditional - precision requires only that a predicted set uniquely overlap with a ground truth set, the segmenter said a set happened here

98.8% for close - precision requires both boundaries of a predicted set to fall within five seconds of the corresponding ground truth

85.6% for tight - precision requires both boundaries of a predicted set to fall within two seconds of the corresponding ground truth set

They process many exercises, however bench press is surprisingly not one of them.

The exercise classification phase of their paper is not important in my case, as I am only focused on one exercise and therefore the segmentation is what I should compare my results to.

For the repetition counting, they present that for 77% of all exercises they have counted the number of repetitions exactly, 93% percent are within 1 repetition and 97% are within 2 repetitions miscount.

They do the counting by projecting 3D signal into 1D using PCA and counting the number of autocorrelation peaks of this projected signal.

There are many more papers about exercise type classification (bench press vs squat f.e.), however, as in our application people create their workout plan beforehand and we know what exercise are they going to do, we don't need to do that.

All the papers use the same sensors, accelerometers and gyroscopes. Our wearable device uses the same sensors

Data preparation

We have collected nearly 130 minutes (390 000 discrete signal samples, 50 Hz sampling rate) of bench press exercise data consisting of 9 signals (acceleration,gyroscope and gravity vectors in X,Y, Z) using a mobile application connected to the device recording the sensor outputs while people were exercising.

I have then manually labeled whether each signal data sample belongs to actual bench press exercise or not using following notation: Exercising (1), Non Exercising (-1).

Based on the research on how other researchers addressed this task, I used moving window to extract some characteristic features of the signal, having pairs: signal window features, exercise/non exercise label.

From each 5 second signal window, consisting of 250 data samples of 9 different signals (Acceleration in X,Y,Z Gyroscope in X,Y,Z, Gravity in X,Y,Z) I extracted following features for each signal separately: Root mean square value Rms, mean, standard deviation, variance for first and second half separately (4*2 features), power band bins (frequency bin histograms, 10 bins), and autocorrelation features: maximum value, index of max val, number of peaks, number of prominent and weak peaks, first peak after zero crossing and its index (7 features).

These features were also used by the Microsoft research paper[1].

The resulting number of features is 8+10+7 = 25 features for each signal, having 9 different signals results in 25 * 9 = 225 features all together.

I extracted this features from all non-overlaping windows. 384 windows with label 1 ( 32 minutes of exercise ) were extracted and 1165 windows with label -1 ( 97 minutes of non exercise ), each window being 5 second part of the recording.

Classification

I decided to compare different types of classifiers and also try to see, if using unsupervised learning would yield some reasonable results, as the labeling of the data was time consuming.

I tried various classifiers – linear and different kernel SVMs and Naive Bayes both with and without the use of PCA, ANNs and compared them to the results of 2 clustering algorithm, Kmeans and Hierarchical clustering, setting the number of clusters to 2.

For the clustering algorithms, I assigned the mapping from the clusters to the classes so that the class having label 1 would map to the cluster having most data samples with the same true label, which is 'cheating' as it is not completely unsupervised, but I needed to design the mapping somehow.

I would like to mention two python libraries I worked with.

Scikit-learn

This python library was truly amazing, It made it extremely easy to try and use different classifiers, preprocessing steps and chaining them all for the cross validation evaluation.

To use the same code for evaluating performance of SVM with PCA and Data normalization, Naive Bayes or Kmeans clustering with normalization requires only slight changes:

pipeline = Pipeline(steps=[('scaler',MinMaxScaler()), ('pca', PCA()),('svm', svm.SVC(kernel='poly',tol=1e-4))])

pipeline = Pipeline(steps=[( ('classifier', GaussianNB()))])

pipeline = Pipeline(steps=[('scaler',MinMaxScaler()), ('classifier', KMeans(init='k-means++', n_clusters=2, n_init=10))])

it also used various helpful cross-validation related functions like Stratified KFold data partitioning and grid search algorithm searching for the best possible parameters.

Tensorflow

I used this library for training and using multilayer perceptrons. Using it was much more low level and less fun, everything had to be designed as matrix multiplication. Having 5 years old laptop did not help with the experience, the computations took some time.

Classification results

For evaluating different classification techniques I used 5-fold stratified cross validation, for algorithms like SVM and other with use of PCA, where parameters like C, gamma for RBF kernel, degree of the polynomial kernel, number of principal components, etc. were used, I used the GridSearchCV class of scikit-learn library, implementing finding the best parameters by cross validation on training data.

I derived the different number of principal components to try in cross-validation based on following plot of the singular values of the data matrix.


I tried different classifiers with different parameters found by cross validation, these are the Confusion matrices and corresponding Accuracy for the best of each different classifier, with the best parameters found in each cross validation fold.

Linear SVM with L1 regularization (sparseness)

Best params: {'svm__C': 0.1}

Best params: {'svm__C': 1}

Best params: {'svm__C': 1}

Best params: {'svm__C': 0.1}

Best params: {'svm__C': 0.1}

Accuracy: 0.92 (+/- 0.05) done in 114.99 sec

[[ 329 55]

[ 76 1089]]

Linear SVM with L1 regularization (sparseness) and PCA

Best params: {'pca__n_components': 3, 'svm__C': 0.1}

Best params: {'pca__n_components': 16, 'svm__C': 0.1}

Best params: {'pca__n_components': 16, 'svm__C': 0.1}

Best params: {'pca__n_components': 3, 'svm__C': 10}

Best params: {'pca__n_components': 8, 'svm__C': 10}

Accuracy: 0.90 (+/- 0.04) done in 85.30 sec

[[ 327 57]

[ 98 1067]]

Linear SVM with L2 regularization

Best params: {'svm__C': 0.01}

Best params: {'svm__C': 0.01}

Best params: {'svm__C': 0.1}

Best params: {'svm__C': 0.01}

Best params: {'svm__C': 0.01}

Accuracy: 0.91 (+/- 0.05) done in 6.15 sec

[[ 339 45]

[ 89 1076]]

Linear SVM with L2 regularization and PCA

Best params: {'pca__n_components': 3, 'svm__C': 0.1}

Best params: {'pca__n_components': 16, 'svm__C': 0.1}

Best params: {'pca__n_components': 16, 'svm__C': 1}

Best params: {'pca__n_components': 3, 'svm__C': 1}

Best params: {'pca__n_components': 8, 'svm__C': 1}

Accuracy: 0.90 (+/- 0.04) done in 83.35 sec

[[ 322 62]

[ 95 1070]]

Radial basis function SVM

Best params: {'svm__C': 10, 'svm__gamma': 0.1}

Best params: {'svm__C': 10, 'svm__gamma': 0.1}

Best params: {'svm__C': 100, 'svm__gamma': 0.01}

Best params: {'svm__C': 100, 'svm__gamma': 0.01}

Best params: {'svm__C': 100, 'svm__gamma': 1}

Accuracy: 0.91 (+/- 0.06) done in 40.30 sec

[[ 334 50]

[ 87 1078]]

Radial basis function SVM with PCA

Best params: {'pca__n_components': 3, 'svm__C': 10, 'svm__gamma': 0.1}

Best params: {'pca__n_components': 3, 'svm__C': 10, 'svm__gamma': 0.1}

Best params: {'pca__n_components': 3, 'svm__C': 100, 'svm__gamma': 0.01}

Best params: {'pca__n_components': 3, 'svm__C': 100, 'svm__gamma': 0.01}

Best params: {'pca__n_components': 16, 'svm__C': 100, 'svm__gamma': 1}

Accuracy: 0.91 (+/- 0.06) done in 59.58 sec

[[ 336 48]

[ 85 1080]]

Polynomial SVM

Best params: {'svm__degree': 5, 'svm__C': 1}

Best params: {'svm__degree': 5, 'svm__C': 1}

Best params: {'svm__degree': 5, 'svm__C': 1}

Best params: {'svm__degree': 5, 'svm__C': 1}

Best params: {'svm__degree': 2, 'svm__C': 1000}

Accuracy: 0.91 (+/- 0.06) done in 3.73 sec

[[ 344 40]

[ 97 1068]]

Polynomial SVM with PCA

Best params: {'svm__degree': 3, 'pca__n_components': 3, 'svm__C': 10}

Best params: {'svm__degree': 3, 'pca__n_components': 3, 'svm__C': 10}

Best params: {'svm__degree': 3, 'pca__n_components': 8, 'svm__C': 1000}

Best params: {'svm__degree': 3, 'pca__n_components': 5, 'svm__C': 100}

Best params: {'svm__degree': 3, 'pca__n_components': 8, 'svm__C': 1000}

Accuracy: 0.92 (+/- 0.06) done in 168.67 sec

[[ 325 59]

[ 70 1095]]

Naive Bayes

Accuracy: 0.82 (+/- 0.13) done in 0.99 sec

[[372 12]

[264 901]]

Naive Bayes with PCA

Best params: {'pca__n_components': 3}

Best params: {'pca__n_components': 16}

Best params: {'pca__n_components': 3}

Best params: {'pca__n_components': 3}

Best params: {'pca__n_components': 10}

Accuracy: 0.88 (+/- 0.05) done in 7.11 sec

[[ 346 38]

[ 144 1021]]

K-means clustering

Accuracy: 0.78 done in 0.34 sec

[[379 3]

[328 801]]

K-means clustering with PCA

Accuracy: 0.79 done in 0.08 sec

[[382 2]

[331 834]]

Hierarchical clustering

Accuracy: 0.74 done in 1.83 sec

[[377 5]

[386 743]]

Hierarchical clustering with PCA

Accuracy: 0.86 done in 1.79 sec

[[375 7]

[207 922]]

Neural network (1 hidden layer with 1 neuron)

Accuracy: 0.92 (+/- 0.05) done in 35.68 sec

[[ 322 60]

[ 57 1072]]

Neural network (1 hidden layer with 1 neuron) with PCA

Accuracy: 0.92 (+/- 0.04) done in 73.69 sec

[[ 338 44]

[ 74 1055]]

Neural network (1 hidden layer with 17 neurons)

Accuracy: 0.92 (+/- 0.05) done in 239.92 sec

[[ 320 62]

[ 62 1067]]

Neural network (1 hidden layers with 10 and 5 neurons)

Accuracy: 0.91 (+/- 0.05) done in 43.27 sec

[[ 330 52]

[ 80 1049]]

Neural network (1 hidden layers with 20 and 5 neurons)

Accuracy: 0.91 (+/- 0.06) done in 56.41 sec

[[ 321 61]

[ 73 1056]]

Classification conclusion

From the supervised learning algorithms, Naive Bayes performed slightly worse than SVM, when the PCA with NB was used the results improved to 88% accuracy. For the SVMs and ANNs accuracy 92% was achieved. I was not able to improve accuracy of classification using more neurons in hidden layers or more hidden layers. For Unsupervised learning, best accuracy 86% was achieved using Hierarchical clustering with PCA, however heuristics described before for the mapping from clusters to actual labels was used, so the solution was not purely unsupervised.

Best accuracy results / without PCA / with PCA
Linear SVM / 0.92 (+/- 0.05) / 0.90 (+/- 0.04)
Polynomial Kernel SVM / 0.91 (+/- 0.06) / 0.92 (+/- 0.06)
Radial Basis Kernel SVM / 0.91 (+/- 0.06) / 0.91 (+/- 0.06)
Naive Bayes / 0.82 (+/- 0.13) / 0.88 (+/- 0.05)
K-means clustering / 0.78 / 0.79
Hierarchical clustering / 0.74 / 0.86
ANN / 0.92 (+/- 0.05) / 0.92 (+/- 0.04)

All results have displayed averaged accuracy value over 5 folds and its standard deviation (+/- std) except the clustering algorithms as I was not sure if it makes sense to implement cross-validation for clustering.

Repetition Counting

Combining the two ideas from the research papers, I proposed a way of identifying repetitions.

The Microsoft research paper[1] projects the 3D acceleration signal (x,y,z) onto its first principal component and then counts the number of autocorrelation peaks above certain threshold of this signal, however that requires further signal processing. As stated earlier for 77% of all exercises they have counted the number of repetitions exactly, 93% percent are within 1 repetition and 97% are within 2 repetitions miscount. However they do many exercises but not the bench press.

The Berkeley research paper[2] uses Hidden Markov Model to predict “hidden states” of the signal which correspond to peak - middle part – valley of the signal and then they smooth out the sequence and count the number of transitions into the starting hidden state. They report 16% miscount error for the bench press exercise.

I combined these ideas to first make PCA projection onto first principal component just as the Microsoft does, and then segment the signal, corresponding to exercising (output of the classification) into peaks – middle parts – valleys using percentile thresholding. Everything below 33% percentile is valley, above 66% is peak and in between is the middle. The middle states are ignored and the number of transitions from starting state to opposite state and back is counted, similarly to Berkeley Paper[2].

Using my proposed technique combining the classification, PCA and percentile thresholding I achieved 18.6% miscount for the bench press exercise.

Repetition Counting results

Here I present the best results from each type of classifier along with two extreme approaches as a baseline. One of the extreme approaches is to assume all signal corresponds to exercising, let's call it Naive approach, the opposite extreme is assuming we know the exactly when the person was exercising, let's call this approach Ground Truth.

The testing protocol was following. For each different recording consisting of 2 sets of data (left and right arm device) use all other data to train the classifier using parameters obtained from cross-validation (in classification step, parameters most often chosen as best) and then count the error on current 2 sets of data and compute the Root mean squared error and Mean error of all the recordings.

Here I show the best Root mean squared error of repetition miscount RMSE and mean error ME for the best of each classifier types. Mean error represents the number of repetitions miscounted on average for the classifier.

Classifier / Mean error / Root mean squared error
Naive approach / 5.26 / 7.07
Ground truth / 1.02 / 1.93
Naive Bayes with PCA / 1.82 / 2.73
Linear SVM / 1.35 / 2.12
Polynomial Kernel SVM / 1.38 / 2.18
Kmeans with PCA / 3.26 / 3.26
Hierarchical clustering / 3.37 / 4.27
ANN / 1.75 / 2.83

Average number of repetitions in recording is 7.26.

Repetition Counting Conclusion

Using the ground truth the algorithm for repetition detection would have mean error 1.02 and RMSE 1.93. The best result I was able to achieve using classification was with Linear SVM and with mean error 1.35 and RMSE 2.12 it was close to the result obtained using the ground truth.

As there were on average 7.26 repetitions in one recording, the miscount rate for the linear SVM is

1.35/7.26 = 18.6%, which is close to the results of the Berkeley research paper[2], achieving 16% miscount.

I was surprised by the fact that even though Neural Networks had similar accuracy in the classification step, they did not perform that close to the Linear SVM results, after many attempts. It may have been caused by the testing protocol, or just my inability to fine tune the Neural networks using Tensorflow framework.

Conclusion

In this project I implemented feature extraction from accelerometer and gyroscope signal based on the research papers I found on this topic, build and compared the performance of Linear and Kernel SVM, Naive Bayes, Artificial Neural networks, K-means and Hierarchical clustering for the task of exercise detection (classification), experimented with the use of PCA and proposed an algorithm for repetition detection combining two ideas of two papers.

I enjoyed using scikit-learn data mining python library and found it easy to use and play around with, providing wide variety of tools for preprocessing, cross-validation and many implemented classifiers working out of the box or requiring only slight effort to make them work.

In contrast, I struggled with the Tensorflow machine learning library from google as it was much more low level and less intuitive and fun to use. I had to implement a lot of basic functionality myself (cross validation, data preprocessing …) and I did not find the results worth the effort. It is fairly possible that I just did not use this tool correctly, but it was pretty vast and overwhelming.

To compare my results with the results of the papers I used as a source of inspiration, I achieved 92% accuracy of bench press exercising compared to 90% overall accuracy for many exercises by the Berkeley paper, the Microsoft paper was not dealing with bench press but claimed 100% precision for exercise/non-exercise segmentation. For the repetition counting, I achieved 18.6% percent miscount rate compared to Berkeley's[2] 16% miscount rate for the same exercise.

Reference

[1]

[2] http://www.cs.berkeley.edu/~kenghao/publications/freeweight_ubicomp2007.PDF

Apendix

Source files are included in the zip file