Supplementary Materials for:

Genetic impact of vaccination on breakthrough HIV-1 sequences

from the Step trial

Morgane Rolland1*, Sodsai Tovanabutra2*, Allan C. deCamp3*, Nicole Frahm3*, Peter B. Gilbert3, Eric Sanders-Buell2, Laura Heath1, Craig A. Magaret3, Meera Bose2, Andrea Bradfield2, Annemarie O’Sullivan2, Jacqueline Crossler2, Teresa Jones2, Marty Nau2, Kim Wong1, Hong Zhao1, Dana N. Raugi1, Stephanie Sorensen1, Julia N. Stoddard1, Brandon S. Maust1, Wenjie Deng1, John Hural3, Sheri Dubey5, Nelson L. Michael2, John Shiver5, Lawrence Corey3, Fusheng Li3, Steve G. Self3, Jerome Kim2, Susan Buchbinder4, Danilo R. Casimiro5, Michael N. Robertson5, Ann Duerr3, M. Juliana McElrath3, Francine E. McCutchan2¶, and James I. Mullins1§

*contributed equally

Supplementary methods.

Laboratory methods

Plasma specimens. HIV-1 testing was done at day 1, weeks 12, 30 and 52, and every 6 months thereafter through year 4 of trial participation. Specimens were screened with an immunoassay and HIV-1 infection was confirmed with a Western blot and a plasma viral RNA assay. Earlier samples were tested to validate the timing of HIV-1 infection. After HIV-1 diagnosis, study participants were followed at weeks 1, 2, 8, 12, 26, 52 and 78. Plasma specimens from the first available RNA positive specimen were used for HIV-1 sequence amplification.

HIV-1 near full-length genome (nflg) sequencing. This study was designed to characterize HIV-1 viruses from volunteers infected before the unblinding date of October 17, 2007. Plasma specimens collected at the time of HIV-1 diagnosis were obtained from 88 study volunteers. Viral RNA extracted from plasma (QIAamp Viral RNAKit, Qiagen) was the template for cDNA synthesis. cDNA synthesis was done using SuperScript III Reverse Transcriptase (Invitrogen) at the University of Washington (UW), or ThermoScript™ RT (Invitrogen) at the Military HIV Research Program (MHRP). Oligo dT was used to prime cDNA synthesis in the UW studies and HIV-specific primers - either JL68R (5’-CTTCTTCCTGCCATAGGAGATGCCTAAG-3’) or UNINEF-7’ (5’-GCACTCAAGGCAAGCTTTATTGAGGCTT-3’)- were used to prime cDNA synthesis at MHRP. nflg or half genomes were amplified by nested PCR following endpoint-dilution of cDNA templates (Expand polymerase) according to methods described in Rousseau et al.1. PCR products derived from single amplifiable viral genome templates were gel purified and sequenced directly2,3. Sequences have been deposited in GenBank under accession numbers JF320002–JF320643.

IFN-g ELISpot Assays. IFN-γ ELISpot assays were performed on frozen peripheral blood mononuclear cells (PBMC) obtained 4 weeks after the second vaccination from all 27 vaccine recipients with available cells using overlapping 15-mer peptides matched to the vaccine inserts. Assays were conducted with 105 cells/well as previously described 4. Peptides were tested in a minipool format with each minipool containing 5 consecutive peptides; peptides corresponding to positive wells in the minipools were then tested individually. To be scored as positive, a response had to be greater than three times the mean background and at least 50 spot-forming cells per million (SFC/M) after background subtraction.

Sequence Analyses. nflg nucleotide sequences were error-corrected in Sequencher (Gene Codes Corporation). All sequences from a given volunteer were then aligned manually using the MacClade program (version 4.08)5 and each mutation rechecked in Sequencher. Sequences were normally accepted if they contained 5 ambiguous base calls. Each intra-host alignment was screened for phylogenetically informative sites, which are identified by removing all private mutations (mutations occurring only once) from the set of sequences using the InSites algorithm (http://indra.mullins.microbiol.washington.edu/cgi-bin/InSites/index.cgi). The amplification strategy was to obtain 5 to 10 nflg sequences per specimen, depending on intra-host sequence variation assessments: if 5 or more phylogenetically-informative sites were found in the first 5 nflg sequences generated per individual, then ~5 additional nflg were amplified and sequenced. There was no evidence of population stratification among the samples according to the lab where they were processed (Mean distance 0.055 at MHRP and 0.058 at UW, p = 0.62), or according to the number of sequences that we obtained from each individual (n = 3 to 14). Three test samples were evaluated in both labs and returned indistinguishable results. We found no evidence that there was a bias due to the time of infection, when categorizing individuals as seronegative or as in early HIV-1 infection (2-tailed Fisher’s exact test comparing vaccine and placebo: p = 0.17).

Coding sequences were extracted from alignments using Gene Cutter (http://www.hiv.lanl.gov), and multisubject alignments were created using MUSCLE6, as implemented in Seaview7. Codon-based alignments of individual genes along with the translated protein sequence output were manually refined as needed in MacClade.

Phylogenetic Analyses. Maximum-likelihood phylogenetic trees were reconstructed by estimating the GTR + I + G nucleotide substitution model using PhyML (version 3.0)8 implemented in DIVEIN9 (http://indra.mullins.microbiol.washington.edu/cgi-bin/DIVEIN). Gag, Pol and Nef trees included all founder sequence(s) for all volunteers as well as the MRKAd5 HIV-1 Gag/Pol/Nef (CAM-1 gag, IIIB pol, JRFL nef) vaccine, HXB2 and CON_B04 sequences. Based on nucleotide sequences from gag, pol and nef, pairwise diversity and tree-based divergence measures were calculated from the most-recent common ancestor sequence, the Step vaccine insert sequences, HXB2 and the HIV-1 B consensus 2004.

Based on phylogenetic analyses and inspection of nflg sequences, we estimated the number of founder variants for each subject and deduced the consensus sequences that corresponded to each founder virus (Supplementary Table 1). To identify multiple founders, groups of at least two sequences each must have a number of shared polymorphisms not shared with the remaining group(s). This number ranged from 1-4 in the current study. A gag tree based on all nucleotide sequences (Fig. 1) is shown along with an env nucleotide tree based on all-volunteer-derived env nucleotide sequences and circulating sequences from the US, Canada and Peru, where all infections in our dataset occurred (Supplementary Fig. 1).

CTL Epitope Prediction. Known HIV-1 epitopes were included and potential CTL epitopes were predicted using 2 methods: NetMHC10 (http://www.cbs.dtu.dk/services/NetMHC/) and Epipred11 (http://atom.research.microsoft.com/bio/epipred.aspx/). NetMHC predicts binding of peptides to 4-digit HLAalleles; the software discriminates on the basis of quantitative peptide MHC binding data and discerns strong and weak binders. We accepted known epitopes reported at the Los Alamos National Laboratory HIV database (HIVDB) as well as HIVDB-variant epitopes that had identical HXB2 coordinates and were strong or weak binders in the reference and founder sequences based on each individual’s HLA. Epipred identifiesknown and potential HIV-1 CTL epitopemotifs using 2-digit HLA information. In addition to the known HLA-restricted epitopes previously reported at LANL, we accepted all epitope motifs with a posterior probability of >0.8. HLA-specific epitopes were predicted in all HIV-1 proteins derived from the volunteers’ sequences and in the corresponding consensus founder sequences, based on each individual’s HLA genotype. Protein sequences from 3 volunteers were excluded from this analysis – those corresponding to one individual infected with a non-B subtype virus (CRF02_AG recombinant), the sole female infected volunteer in the study group, and one individual for whom HLA-genotyping information was not available. A separate set of analyses were done while excluding one or the other of two subjects with genetically very similar viruses. Results were the same whether or not both sequences were included, so results from analyses in which both subject’s sequences were used are reported. Epitopes were also identified in the MRKAd5 HIV-1 Gag/Pol/Nef vaccine sequences and in all proteins from HXB2 and from the subtype B 2004 consensus sequence (CON_B04) (available at the HIVDB, http://www.hiv.lanl.gov).

Sieve analyses. Two types of sieve analyses were performed: ‘global’ sieve analyses are based on summary measures of distances between founder sequences and a reference sequence (e.g., the MRK Ad5 insert) to compare vaccine and placebo-recipients, whereas ‘local’ sieve analyses scan the proteome and evaluate amino acid (AA) sites individually or as a set of sites (e.g., AA sites spanning an epitope’s length) that discriminate whether a founder sequence is from the vaccine or placebo group.

Statistical methods for global summary distance sieve analysis. Each founder sequence was assigned a value summarizing its distance to a reference sequence. The MRKAd5 HIV-1 Gag/Pol/Nef insert sequences were the main reference sequences. HXB2 and CON_B04 were also used as references, because they enabled analyses for HIV-1 regions outside the Gag, Pol, and Nef proteins represented in the MRKAd5 vaccine construct. For each distance measure defined below, its distribution was compared between infected vaccine and infected placebo recipients. Hypothesis testing for these comparisons was performed using either the consensus founder variant(s) for each subject or all individual sequences.

Whole-Protein Tree-based Sieve Analysis. Tree-based distances between all protein sequences were calculated using an HIV-specific substitution model of protein evolution 12. This HIV-1-specific-10% inter-subject similarity scoring matrix (HIV-10) up-weights substitutions depending on the evolutionary cost they incur and is available in PhyML8 within HyPhy13. Tree-based distances were extracted from the trees using the NewickTermBranch algorithm (http://indra.mullins.microbiol.washington.edu/perlscript/docs/NewickTermBranch.html). For each individual, the average of the distances between the MRKAd5 HIV-1 sequence and each founder sequence was computed, and these average distances were compared between the vaccine and placebo groups using a Wilcoxon/Mann-Whitney test. To address the issue of controlling false positive errors in the presence of multiplicity of analyses due to multiple insert proteins, we first assessed significance of the results for Gag-Pol-Nef combined, and, if significant, considered the p-values for the component proteins Gag, Pol, and Nef.

Global sieve effects restricted to CTL epitopes from founder variants. We considered measures of distance between the consensus founder sequence(s) of a subject and a reference sequence. Genetic distances were calculated using the HIV-10 model of evolution as described above12.

Two T cell epitope-based distance measures were used: the ‘CTL epitope’ and ‘K-mer’ distances. The ‘CTL epitope distance’ is defined in three steps. First, T cell epitopes (among linear peptides of length 8, 9, 10, or 11) were identified as described above in volunteers’ and reference sequences, based on each individual HLA type. Second, we focused on the subset of epitopes that were shared (i.e., had the same HXB2 positions) in the volunteers’ sequence(s) and the reference sequence (with at most 2 AA differences). Third, we computed all pairwise distances between these epitopes using the HIV-10 evolutionary model in PhyML. The CTL epitope distance was then defined for each subject as the average of the different epitope-specific pairwise distances. If there were no known or highly likely epitopes in either sequence, then the distance could not be defined and the subject's information was not used. A sieve effect on CTL epitope distances is interpreted in terms of greater epitope distances among AA changes that preserve epitopes. Results for these distances are reported in Fig. 2.

If an epitope in a founder sequence is mutated relative to the vaccine-insert sequence such that the mutated form is no longer recognized as an epitope, this peptide is not included in the CTL epitope distances. Thus, we developed a second measure to capture effects where founder sequences have mutations that preclude their identification as epitopes: the ‘K-mer distance or percent epitope mismatch distance’, which is based on epitopes in the reference sequence, regardless of whether corresponding peptides in the volunteers’ sequence(s) may be epitopes.

Using Epipred, the first step in computing the percent epitope mismatch distance is to compute the nonparametric maximum likelihood estimate (NPMLE) of the number of peptides shared between the reference sequence and the consensus founder sequence(s), defined as the sum of estimated epitope-probabilities across all 8, 9, 10, 11-mers in the reference sequence that are exactly matched in the founder sequence(s). Then, the distance is the NPMLE of the percent of mismatched peptides, defined as one minus the ratio of the NPMLE of the number of shared peptides (computed in the first step) and the NPMLE of the number of peptides in the reference sequence. The denominator NPMLE is computed as the sum across all 8, 9, 10, 11-mers in the reference sequence of the estimated epitope-probabilities. Hence, K-mer distances account for the whole distribution of peptides rather than the limited set of epitopes with a cut-off of 0.8 probability. Because the NetMHC software returns results of ‘non-binder’, ‘weak binder’, or ‘strong binder’, we defined the distance as the estimated percent of mismatched epitopes, the latter defined as the number of weak or strong binding 8, 9, 10, or 11-mers in the reference sequence that mismatch the corresponding peptide in the founder sequence(s).

For each distance measure, a Wilcoxon rank sum test (equivalently the Mann-Whitney test) with exact 2-sided p-value was used to test for a different distribution in the majority-consensus sequence summary measures between the infected vaccine and placebo groups.

Global sieve effects restricted to CTL epitopes in all individual sequences. Parallel to the above distances, we defined two epitope-based distances that account for all individual founder sequences. The CTL epitope distance was defined as above, except that in the first step, T cell epitopes were identified in all of the subject's founder sequences in addition to the reference sequence. The K-mer distance is defined as above, except that the numerator of the ratio is the estimated number of peptides shared between the reference sequence and all of the subject’s founder sequences. Hypothesis testing was the same as described above.

Local sieve analysis statistical methods. Individual sites and K-mers were evaluated as signatures wherein there are different AA distributions for vaccine versus placebo recipients. As the following analyses did not take HLA or predicted epitope sites into account, they were conducted on the 66 subjects used for the analyses above plus the one individual for whom HLA data was not available.

Signature positions in founder variants. For each AA site in Gag, Pol, and Nef, we compared the rate of AA mismatch to the MRKAd5 insert residue found in the infected vaccine group to this rate in the infected placebo group. The same analysis was done for Env with HXB2 as reference. We used the t-statistic numerator-type statistic of Gilbert, Wu, and Jobes14, and their permutation procedure, to compute an unadjusted p-value for each position. Positions with insufficient AA variability to potentially discover a significant result were screened out using Tarone’s procedure15 , which reduced the number of sites for analysis to 97 (from 542) for Gag; 76 (from 885) for Pol; 82 (from 236) for Nef; and 322 (from 957) for Env.