Using mothur to cluster sequences
Author & Date:
L. Higgins, 11/2011, updateded from A. Lipus 01/2010
Purpose:
Mothur is a command-line computer program for analyzing sequence data from microbial communities. Most frequently, we use mothur to cluster sequences – that is, group them into collections of related sequences, called genotypes. It’s pretty much the first step in any sort of phylogenetic or genotype-based community analysis.
Procedure:
You start with a fasta file of clean sequences. Fasta is a very common format for conveying sequence information in a lot of bioinformatics applications. It looks like this:
>SequenceName1
gaatgacctatgatcc....
>SequenceName2
gaatggcctttgatcc....
>SequenceName3
taatgacctatgatcg....
Note that the case doesn’t matter (i.e., GAATC is the same as gaatc), and sequence names are designated with a greater-than symbol (>) and differentiated from the sequences they represent by a hard return. Many programs (i.e., Sequencher, MacClade, MEGA, etc.) are capable of exporting sequences in this format. Also, there are a number of online converters that can take sequences in one format and translate them into a different one. Just search around.
Assemble all of the sequences you’re interested in into one fasta file and name it something descriptive, like “ProjectNameDATE.fasta”. Make sure there are no spaces in the filename. Save it into a clearly named folder – like “ProjectNameDATE”.
Before we can cluster them, we need to align the sequences using the algorithm MUSCLE (stands for MUltiple Sequence Comparison by Log-Expectation). We use a software program called MEGA for that task (and for many other tasks as well). MEGA is a Windows-only application, so open up the VMware Windows environment, located in the Applications folder. Then, copy and paste your fasta file onto the Windows system.
Open the file by starting up MEGA, selecting Align > Edit/Build Alignment > Create/Build New Alignment, then selecting your file. You should see your color-coded sequences listed, and they should be slightly (or severely) misaligned. To align them, click on the flexing arm icon and select “align DNA”. Leave all of the settings as they are, then click compute. After a few seconds, you should have a nice aligned set of sequences. If the alignment looks fishy, don’t continue – it will only mess up the clustering algorithm, and you’ll have wasted an afternoon. If it looks right, export the alignment by doing Data > Export Alignment > FASTA format. Save them as something like “ProjectNameDATE_Aligned”. MEGA will give the file a .fas suffix, which won’t work with any other programs. Just go into the folder it’s saved in and change it to .fasta. Then, copy and paste it back into your ProjectNameDATE folder back in the Mac system, and close out of VMware.
Now you’re ready to run mothur. Open up Mothur, type “cd” (stands for “change directory” -- “directory” being a fancy computer science name for “folder”) followed by a space, navigate to your ProjectNameDATE folder, and then drag the entire folder into Terminal. Hit enter. Now, go to the Applications menu, and drag the Mothur icon into Terminal and hit enter. You’ll see some text about the author and citation information for the program, and this:
mothur >
You tell the program what to do by typing commands after the carat. First off, we need to tell the program to calculate the distances between each pair of sequences in your file. It is the resulting list of distances between sequences that mothur will actually use to group the sequences into clusters of closely-related sequences. The command to do this is called dist.seqs. After mothur >, type the command exactly as follows:
mothur > dist.seqs(fasta=ProjectNameDATE_Aligned.
fasta, output=lt)
Then hit enter. As you can see, the result is a new file called “ProjectNameDATE_Aligned.phylip.dist”, saved in your ProjectNameDATE folder. Phylip is another sequence file format, and if you open up this file in TextWrangler (which is not necessary, but you can try it just for the purpose of illustration), you’ll see a lower-triangular matrix, with the first column listing all of your sequences, and then what looks like a bunch of small decimal numbers. Like a matrix of driving distances between cities in a road atlas, this tells you how “far apart” any given pair of sequences is. Back in Terminal, you should once again see “mothur >”, which means that the program is awaiting your next command.
Before we can cluster the sequences, we need to have mothur read off the distances it just calculated. One would think that the program might remember these previous calculations, but no; that’s just not how computers work. Type the following command, then hit enter:
mothur > read.dist(phylip=ProjectNameDATE_Aligned.
phylip.dist)
You should see a progress bar that rapidly fills up and, once again, the mothur > that indicates the program is waiting for your next command. Now, we need to have mothur cluster the sequences based on the distances we just calculated. The command to do this is, logically enough, called “cluster”. Execute the following command:
mothur > cluster(method=average)
What you’ve just done is told the program to cluster your sequences using the average neighbor method. There are two other methods one can use as well: nearest neighbor, and furthest neighbor. Furthest neighbor is the most stringent method (i.e., results in more clusters, each containing a few very similar sequences), and it is the default, so if you just type the command cluster() and hit enter, it will cluster your sequences using this method. On the other extreme, nearest neighbor is the least stringent and will tend to create fewer, larger clusters that each contain sequences that are not quite as similar to each other. Average neighbor, which is the method I prefer, is a sort of middle ground between the other two. If you have some free time, it’s quite instructive to try clustering your sequences using each of the three methods and then comparing the results. You can do this using the commands cluster(method=furthest), cluster(method=average), and cluster(method=nearest).
What the program spits out is three new files:
ProjectNameDATE_Aligned.phylip.an.sabund,
ProjectNameDATE_Aligned.phylip.an.rabund, and
ProjectNameDATE_Aligned.phylip.an.list.
The .an. part indicates that you used the average neighbor method. The suffixes .sabund, .rabund, and .list indicate the output format. We’ll use the .list file to look at our clusters.
Go into your ProjectNameDATE folder and add a .txt suffix to ProjectNameDATE_aligned.phylip.an.list. Then, open Excel, select File > Import... > Text File, navigate to your .list.txt file, and click “Get Data”. In the Text Import Wizard box, just click “Finish”.
In your spreadsheet, going down column A, you’ll see something like “unique, 0, 0.01, 0.02, 0.03”, etc. These numbers indicate the cutoff levels for sequence clusters. So, unique means that a cluster is defined by all sequences that are identical along every single base in the sequence. Ones with a 0% cutoff differ by at most a couple of bases, but are less than 0.5% different from each other. 0.01 indicates less than 1% difference, and so on. Column B tells you how many clusters were recovered at each cutoff level, and then the remaining columns are filled with the clusters themselves. One cell contains one cluster, and each sequence in the cluster is listed, and they’re separated by commas. It’s not the most elegant output format, but it’s all we’ve got.
So, if for example you’re interested in finding out how many clusters were recovered at a ≥97% sequence similarity cutoff, go to the row corresponding to 0.03 (for 3%), and read off in column B that there were X different clusters. Then, you can look at the cells to the right to see which sequences clustered together.
SPECIAL NOTE:
What we’ve just done is only a very small portion of the different things you can do in mothur. If you want to learn more about the different mothur commands, how to modify them, or more about the theory behind what we’ve done, go to the mothur wiki manual at http://www.mothur.org/wiki/Mothur_manual. The commands we’ve just used are listed under the OTU-based approaches tab.
1
mothur genotype clustering