Resolving discrepancy between nucleotides and amino acids in deep-level arthropod phylogenomics: differentiating serine codons in 21-amino-acid models

Andreas Zwick¹, Jerome C Regier, Derrick J Zwickl

Affiliations

PMID: 23185239
PMCID: PMC3502419
DOI: 10.1371/journal.pone.0047450

Resolving discrepancy between nucleotides and amino acids in deep-level arthropod phylogenomics: differentiating serine codons in 21-amino-acid models

Andreas Zwick et al. PLoS One. 2012.

. 2012;7(11):e47450.

doi: 10.1371/journal.pone.0047450. Epub 2012 Nov 20.

Authors

Andreas Zwick¹, Jerome C Regier, Derrick J Zwickl

Affiliation

¹ Department of Entomology, State Museum of Natural History, Stuttgart, Germany. andreas.zwick@smns-bw.de

PMID: 23185239
PMCID: PMC3502419
DOI: 10.1371/journal.pone.0047450

Abstract

Background: In a previous study of higher-level arthropod phylogeny, analyses of nucleotide sequences from 62 protein-coding nuclear genes for 80 panarthopod species yielded significantly higher bootstrap support for selected nodes than did amino acids. This study investigates the cause of that discrepancy.

Methodology/principal findings: The hypothesis is tested that failure to distinguish the serine residues encoded by two disjunct clusters of codons (TCN, AGY) in amino acid analyses leads to this discrepancy. In one test, the two clusters of serine codons (Ser1, Ser2) are conceptually translated as separate amino acids. Analysis of the resulting 21-amino-acid data matrix shows striking increases in bootstrap support, in some cases matching that in nucleotide analyses. In a second approach, nucleotide and 20-amino-acid data sets are artificially altered through targeted deletions, modifications, and replacements, revealing the pivotal contributions of distinct Ser1 and Ser2 codons. We confirm that previous methods of coding nonsynonymous nucleotide change are robust and computationally efficient by introducing two new degeneracy coding methods. We demonstrate for degeneracy coding that neither compositional heterogeneity at the level of nucleotides nor codon usage bias between Ser1 and Ser2 clusters of codons (or their separately coded amino acids) is a major source of non-phylogenetic signal.

Conclusions: The incongruity in support between amino-acid and nucleotide analyses of the forementioned arthropod data set is resolved by showing that "standard" 20-amino-acid analyses yield lower node support specifically when serine provides crucial signal. Separate coding of Ser1 and Ser2 residues yields support commensurate with that found by degenerated nucleotides, without introducing phylogenetic artifacts. While exclusion of all serine data leads to reduced support for serine-sensitive nodes, these nodes are still recovered in the ML topology, indicating that the enhanced signal from Ser1 and Ser2 is not qualitatively different from that of the other amino acids.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Deep-level arthropod relationships based on six analytical approaches.**
Aligned sequences from 75 arthropods and five outgroup species for 62 nuclear protein-coding genes were analyzed under the likelihood criterion using six strategies: *20AA-JTT*, a 20-amino-acid JTT model ; *21AA-JTT*, a 21-amino-acid JTT model; *codon*, a codon model; *degen1*; *degen8; noLRall1nt2*. These strategies are described in the Data Set Encoding section of Materials and Methods and in , . Numbers of species representing terminal taxa are in parentheses. Bootstrap percentages (BP) are on internal branches (*20AA, 21AA, codon, degen1, degen8,* and *noLRall1nt2;* see figure key for order). Six nodes with a major increase in their bootstrap support from *20AA JTT* to *21AA JTT* are identified with filled circles. A more complete listing of results can be found in Table S1.

**Figure 2. Plot of average ECM codon substitution rates for synonymous, intra-serine (S/Z), and the most frequent nonsynonymous substitutions.**
Individual codon rates are summarized through averaging for each respective amino acid (synonymous) or change between amino acids (synonymous SER (S/Z), nonsynonymous).

Figure 3. Compositional distance trees (Euclidean distances) for six data sets – nucleotide composition for *nt123* data set, degenerated nucleotide composition for *degen1* data sets with and without serine, codon composition for *codon* data set, and amino acid composition for *20AA* and *21AA* data sets.
Bootstrap percentages >50% are displayed and indicate the strength of the compositional signal at particular nodes. The sum of all branch lengths reflects the total amount of compositional heterogeneity in the data set.

**Figure 4. Compositional distance tree (Euclidean distances) based on the codon composition of a data set that is restricted to *co-Ser* codons.**
Bootstrap percentages >50% are displayed and indicate the strength of the compositional signal at particular nodes. The sum of all branch lengths reflects the total amount of compositional heterogeneity in the data set.

Figure 5. Summary of the six key nodes that are recovered in all maximum likelihood topologies from *degen1* analyses of five nucleotide data sets with and without modifications (including deletions) of serine codons, along with their bootstrap values.
The complete topologies are condensed to illustrate that all six higher-level nodes under investigation are recovered by each of five data sets: 1. *Ser1→ Ser2* data set, in which *Ser1* codons (TCN) in the *degen1* data set are artificially changed to *Ser2* (AGY); 2. *Ser2→ Ser1* data set, in which all *Ser2* codons in the *degen1 data set* are artificially changed to *Ser1*; 3. *noSer1noSer2*, in which all *Ser1* and *Ser2* codons in the *degen1* data set are artificially changed to NNN; 4. *no change to Ser*, in which the *degen1* data set is analyzed as is; 5. *degenFS2*, in which all *Phe* (TTY) and *Ser2* (AGY) codons in the *degen1* data set are artificially changed to NNN.

See this image and copyright information in PMC

References

1. Lockhart PJ, Howe CJ, Bryant DA, Beanland TJ, Larkum AWD (1992) Substitutional bias confounds inference of cyanelle origins from sequence data. J Mol Evol 34: 153–162. - PubMed
1. Gruber KF, Voss RS, Jansa SA (2007) Base-compositional heterogeneity in the RAG1 locus among didelphid marsupials: Implications for phylogenetic inference and the evolution of GC content. Syst Biol 56: 83–96. - PubMed
1. Song H, Sheffield NC, Cameron SL, Miller KB, Whiting MF (2010) When phylogenetic assumptions are violated: Base compositional heterogeneity and among-site rate variation in beetle mitochondrial phylogenomics. Syst Ent 39: 429–448.
1. Regier JC, Zwick A (2011) Sources of signal in 62 protein-coding nuclear genes for higher-level phylogenetics of arthropods. PLoS ONE 6: e23408. - PMC - PubMed
1. Regier JC, Shultz JW, Zwick A, Hussey A, Ball B, et al. (2010) Arthropod relationships revealed by phylogenomic analysis of nuclear protein-coding sequences. Nature 463: 1079–1083. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Resolving discrepancy between nucleotides and amino acids in deep-level arthropod phylogenomics: differentiating serine codons in 21-amino-acid models

Affiliation

Resolving discrepancy between nucleotides and amino acids in deep-level arthropod phylogenomics: differentiating serine codons in 21-amino-acid models

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources