Latin Rhythms Classification on MIDI Files

by Arturo Camacho

Abstract

This paper describes an algorithm that classifies songs (in MIDI format) into different music styles. Although the principle we use is general enough to include several music styles, we restrict our work to three of the most popular Latin American styles: Salsa, Merengue, and Cumbia. The reason is that these styles have well defined patterns for some instruments, among them the bass and the conga, which is exploited by our algorithm. We studied the patterns that those instruments play and created templates for them. In order to classify a song, we compare it against each of the created templates, and we select among all the templates the one that matches stronger the input song. Then we declare that the song belongs to the style represented by such template. One problem with this approach is that patterns inside a style are not unique. There exist many variations of a style. Furthermore, within the same song, any instrument usually does not always play the same pattern. Here we describe a method to deal with such variances and classify songs using degrees of membership to styles.

Introduction

The goal of this project is to classify music according to its style (rhythm). The rhythms to be classified are Salsa, Merengue and Cumbia, which are some of the most common Latin rhythms. The songs to be analyzed will come in the form of MIDI files. The reason for choosing this format is that doing classification on pure audio is very difficult, at least for the approach we will use in this project, which is based on the position of the notes within a measure.

The first approach will be based on a feature that is usually distinguishable on each of the rhythms: the bass line. To recognize which is the bass line, we will use the General MIDI (GM) specification. Within GM, programs correspond to bass sounds are predefined. Then, we will analyze the conga, a percussion instrument common to the three rhythms. It is also easy to extract because come in pre-defined keys and channel. Finally we will use information about the tempo. However, it works only to distinguish Merengue from the others, because Salsa and Cumbia tempos typically fall into the same range.

The techniques to classify the songs will be based in template matching, a measure of similarity between the input data and pre-defined templates for each of the styles. We will start with fixed templates but then will move to more flexible structures the allow variance in the styles. Finally, we will discuss details of the implementation.

Feature Extraction

To classify a song into a category we will extract from the song two of the most characteristic instruments in Latin Music: bass and congas. The bass is easy to extract. As every MIDI channel has to have a program assigned (from 0 to 127) and the MIDI Standard establish which instrument corresponds to each program, we just have to look for the programs corresponding to basses. These are located between the programs 32 and 39. Therefore, to extract the bass line we just have to extract the notes from the MIDI channels that have one of those programs assigned. Something similar occurs with the congas. They are located in keys 62, 63, and 64 in the standard percussion channel (channel 10). Thus, we have only to extract the notes corresponding to these keys in that channel.

Template Matching

Once obtained the notes to be used to recognize a song, we have to define templates for the patterns that those instruments use to play in each of the categories to recognize. The templates for the bass are shown in figure 1. As it can be seen, the template for each of the categories consists in a two-beats measure long. This is not always the case for Salsa. However, in such cases the difference is so slight that we still can obtain good results using this two-beats template.

Figure 1. Bass templates for Merengue, Cumbia, and Salsa. Bass templates for these rhythms emphasize distinct parts of the measure and most songs follow very closely these templates. This is what make the bass a good feature for classification.

Intuitively, if a bass plays along the whole song the pattern given in one of the templates, it should be very likely that it belongs to that category. However, this is not always true. It could be that the other instruments are playing things that do not belong to that category and that the overall result is a different category. In other words, if we are analyzing only one instrument from a song that contains many of them, we cannot be sure about the decision we take. An example is some types of Flamenco. Many Flamenco songs have a bass line similar to the bass line of Salsa. Nevertheless, this is the approach we are going to follow. We will extract and analyze only a few instruments (the ones that have more well defined patterns) and will base our decision on them. The reason is that for some instruments the variance of styles of playing is so wide that it is difficult to define a template.

If an instrument does not match perfectly any of the templates (what occurs most of the time) along the whole song, we still would like to know if it belongs to one of those categories. Again, this can be true because it is very unlikely that a template is perfectly matched along a complete song, although for a person that knows about music is clear that the instrument is playing that style of music.

Therefore, we have to come up with a way to measure how similar is the pattern an instrument is playing respect a template. After that, intuitively, we could establish a threshold above which we can claim an instrument is playing a particular style of music. The measure we propose is the following. Start with the analysis of similarity of only one measure and come up with a score. Then, to obtain the overall score of the whole song, sum the scores obtained on each of the measures.

In the case we have the same number of notes played in the template and in the input data for a given measure, it sounds natural to give a score proportional to the distances of the notes in the input and the template. However, it is not clear what to do in the cases in which less or more notes are played in the input than in the template. Besides, this is not the natural way we perceive music (at least not musical trained people). When we listen to music we usually have in mind a “minimum common divisor” of the rhythm. Our mind quantizes (relates) each sound with the closest of these units (see figure 2). In Latin music, every beat is usually divided into four parts and every note played in the middle of these is perceived as if it was intended to be played in that time but was slightly delayed or ahead. We follow the way our brain simplifies music and we accommodate each note in the input and in the template into one of the four parts (from now called bins) that divide the beat.

Figure 2. Quantization. Notes that are in the middle of a minimal common division are interpreted as equivalent to the bin center.

To compute the similarity of an input and a template in a given measure, we compare the values of each of the bins in the input the template. If a bin is activated in both the input and the template, or if it is deactivated in both, we increment by one the score for the respective category. On the other hand, if a bin is activated in the input but not in the template, or vice versa, we do not increment the score. As each of our templates consists of a two-beat measure, we have a total of eight bins in a measure, and therefore a maximum score of eight can be obtained on each measure. The total score for the whole song will be the sum obtained on each of the measures.

Figure 3. Template Matching. Only if both the input and the template have the same state (activated or inactivated) the score is increased. If they are different the score stays.

It is important to realize that this is a linear system. Instead of comparing the bins of every measure against the template and accumulating the score, we could make a histogram of the bins along the song and only at the end compare against the template. In such case, for each of the eight bins we only would need to add it frequency f to the score if the bin was activated in the template, or add to the score its complement m-f to the number of measures m of the song.

Clearly, with this template approach, songs for which instruments are playing patterns very similar to the template most of the time will have a very high score for that style. On the results we obtained it could be seen that the templates we used were different enough to give a significant difference in the score to be able to classify a song into a category just by taken the one with highest score. Unfortunately, templates for styles are not unique. There can be slight differences on templates for styles that if they are not taken into consideration our could perform very bad.

Variations to Templates

As we mentioned when we started discussing “Templates Matching”, many times there are slight variations in styles that makes necessary (1) to define more than one template for a style, or (2) to define a more flexible structure that let accept variations into styles. The first one is optimal because we could make as many templates as we want to match all the variations we know. However, that would be computationally too expensive.

We came with a way to perform the second option producing almost as good results as the first one, but requiring only few extra computations. The rule is not “rule of thumb” and depends on the case. For example, in Salsa and Cumbia it is not always true that the first note of the measure is not played. Actually, in most of Cumbias the first note is played, but in some of them it is not. It can be seen therefore that such note does not conform a criteria to decide whether or not a song is a Cumbia. Therefore, we could just ignore this note from our analysis. This is not the case for the note in the fifth bin (the first note in the template). This note is almost always played.

This idea suggests the following method. Define a vector v = [v1, v2,…,v8] such that vi {-1,0,1} for i  {1,2,…,8}. If there is a note in the i-th bin of the template and that note is strictly required (as the 5th and 7th bins in the bass line for Cumbia), set vi=1. If it is just optional (i.e., it cannot be used as a criteria, as the 1st bin in the bass line for Cumbia) set vi=0. If there is no note in the i-th bin of the template (as the 2nd, 3rd, 4th, 6th, and 8th bins in the bass line for Cumbia), set vi=-1. This means that if that note is played decrease the score by one. In the Cumbia bass line case we would get in the best case a score of 2 and in the worst case a score of –5. We can add a bias of 6 and therefore keep the score in the range [0,12] (actually [1,12], with the free gift of the 1st bin). This criteria was used in all the templates and helped to expand the variations recognized on each of the styles.

Figure 4. Templates with optional notes. Required notes have a value of positive one such that if they are present in the input they contribute positively to the score. Optional notes have a value of zero. If they are present or not it makes no difference. The rest of the bins have a value of –1 decreasing the score if notes are present in the input.

Another difficult we met was related with the simplification we made using only two-beats measures. Latin Music measures are intrinsically four-beat based most of the time. We made the simplification for two reasons. First, usually the first and the second part of the four-beat measure are so similar that it made some sense to treat them as equal. Second, the first part of the four-beat measure template and the second part can appear permuted (i.e., in some songs the first one can come as second and vice versa). Furthermore, this order can change inside a song itself. A typical case is the common change of ”clave” in Salsa. Some Salsas start with the 3-2 clave and at more rhythmic parts change to 2-3 clave.

To solve this problem of mapping a four-beats measure into a two-beats measure we realized the following. If a part of the pattern is played only on one half the pattern but not in the other, after collapsing the four-beats measures into two-beats measures that part of the pattern should appear in the overall song only half the number of two-beat measures (see figure 5). Therefore, as the number of times that subset of the pattern goes away from the mean (half the number of two-beat measures), the similarity with the template reduces and therefore the score should decrease. This can be done by replacing the frequency of that bin with its distance from the mean and setting its value in the vector to –1. Therefore, as the distance from the mean increases the score decreases.

Figure 5. Collapsing 4-beats measures into 2-beats measures. Putting the notes of the bins 9 to 16 into bins 1 to 8 makes that the two lower notes (bins 12 and 13) appear in bins 4 and 5 with half the frequency than the rest of the notes.

The last problem we have to solve is intrinsically related with the non-pitch-definiteness of the percussion sounds. It turns out that for example in the conga line of the Cumbia the tradition is to play a high pitch sound on the first upbeat (3rd bin) and a lower pitch sound on the second upbeat (7th bin). However, as we have 3 conga sounds (mute, open hi, and open low) each with different pitches, it turns out that we have three options to meet this requirement. One way to solve the problem would be to define three templates, but as we have discussed before, this is computationally expensive. As the template for conga has only one sound at the time, playing only one note at the time subject to the defined pitch constraints should produce a perfect score. The problem is that as we have two correct conga sounds for each of the bins (e.g, mute and open hi for the 3rd bin, and open hi and low for the 7th bin) we do not which to choose. Simply bias the system to benefit the election of the style (Cumbia) selecting as the correct sound the one that was played the most. If only that one was played and not the other, then the score will be maximum. If the other note was also played it should decrease the score because according to the template only one of them should be played.

Tempo Criteria

The last feature that we consider is one of the most simple but important: the tempo. It turns out that the tempo (i.e., speed) for a specific style goes into an approximated range. For example, Salsas and Cumbias usually have a tempo between 70 quarters per minute and 105 quarters per minute. Merengues, however, goes from 120 quarters per second to 160 quarters per second. Therefore, this criteria can serve to distinguish Merengue from the rest, but cannot distinguish from Salsa and Cumbia.

Implementation

In our implementation, we used the ideas described here for the conga but not for the bass. The reason is that initially we start classifying the songs based only on the bass and a simpler approach gave good enough results, so we did not implement the more complex approach for the bass. We had to use it for the conga, though, because the simple approach would not work for it.

For the classification based on the bass we just took the histogram and select the two maximum bins. If they matched the bins of one of the templates, then we classified the song as belonging to that style, if not, the song was unclassified. In figures 6,7, and 8 can be seen that most Merengues has maximums in the 1st and 5th bins, most Salsas has maximums in the 4th and 7th bins, and most Cumbias has maximums in the 5th and 7th bins. Recall that for Salsa and Cumbia we did not consider the 1st bin as criteria for classification.