Complementary coevolution between paired nucleotides

The scripts for complementary coevolution analysis were written in Python, C and R programming. They allow to test for associations between sites predicted to be base-paired and sites that have significant degrees of complementary coevolution.

System and software requirement:

-  This analysis requires a computer cluster and MPI libraries

-  Python (http://www.python.org/getit/) must to be installed

-  The HYPHY package should be installed on the system

Step 1: Preparing the input files

Source: http://web.cbio.uct.ac.za/~brejnev/downloads/ComputationalTools/Preparation_of_input_alignment_and_tree_files.zip

(1)  Detecting recombination:

Use RDP4 (available: http://darwin.uvigo.es/rdp/rdp.html) to detect recombination within your sequence alignment, and after completion chose the option “Save distributed alignment (with recombinant regions separated)”. This will move the recombinant regions from alignment to the bottom of the alignment as different sequences.

(2)  Renaming sequences:

Place the “Editing_Seq_Names.py” script within a folder together with all the distributed alignments obtained in (1) and run the script. It will rename all sequences contained in each alignment and rename alignment files by replacing “.fas” with “E.fas”. This is to avoid sequence names containing special characters that would cause PhyML and HYPHY to crash.

(3)  Generating recombination free alignments:

For each of the renamed distributed alignments, the “Split_Alignment_Draw_ML_Trees.py” script will be edited by specifying the distributed alignment name, the number of sequences within the original alignment (before recombination detection) and the length of the alignment. Run the script “Split_Alignment_Draw_ML_Trees.py” which will split the alignment into recombination free sub-alignments (in phylip format) and draw a maximum likelihood tree for each sub-alignment.

N.B. run one distributed alignment at a time, and each time keep aside the generated sub-alignments and trees.

Step 2: Run the coevolution script in HYPHY

Source: http://web.cbio.uct.ac.za/~brejnev/downloads/ComputationalTools/Running_the_coevolution_script_on_the_computer_cluster.zip

Here, the sub-alignments and ML trees obtained from each distributed alignment are used separately.

(1)  Place these sub-alignments and corresponding ML trees within the directory where the “Coevolution_script.c” is located and run the “mk_submission_sh.py” script to generate indexed python scripts (Submission{x}.py) and a shell script “array.sh” to be used running all Submission{x}.py as an array.

(2)  Run the submission shell “array.sh” to submit all the submission scripts created. For each of the sub-alignments and its corresponding tree, the “Coevolution_script.c” script will generate output files containing p-values and λ (λ>1 indicates tendency to complementary coevolution while λ<1 indicates tendency to non-complementary coevolution) for every site versus every site which is at most 100-nucleotides apart. Due to computational intensity, it is highly recommended to use the parallel version of HYPHY on a computer cluster, and make sure the paths to executables program are updated.

N.B. For each run, move the obtained output files into a folder bearing the name of the original distributed alignment.

Step 3: Run the test for association

Source: http://web.cbio.uct.ac.za/~brejnev/downloads/Computational_Tools/Test_for_association_between_coevolving_and_base-paired_sites.zip

(1)  Within the “Pairing_Vs_Coevolution.py” script, make sure you edit the path to the output directory for every distributed alignment, and also specify the path of the base-pairing files from NASP.

Run the script to perform the association test.