Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Apr 7:8:e8699.
doi: 10.7717/peerj.8699. eCollection 2020.

ECuADOR-Easy Curation of Angiosperm Duplicated Organellar Regions, a tool for cleaning and curating plastomes assembled from next generation sequencing pipelines

Affiliations

ECuADOR-Easy Curation of Angiosperm Duplicated Organellar Regions, a tool for cleaning and curating plastomes assembled from next generation sequencing pipelines

Angelo D Armijos Carrion et al. PeerJ. .

Abstract

Background: With the rapid increase in availability of genomic resources offered by Next-Generation Sequencing (NGS) and the availability of free online genomic databases, efficient and standardized metadata curation approaches have become increasingly critical for the post-processing stages of biological data. Especially in organelle-based studies using circular chloroplast genome datasets, the assembly of the main structural regions in random order and orientation represents a major limitation in our ability to easily generate "ready-to-align" datasets for phylogenetic reconstruction, at both small and large taxonomic scales. In addition, current practices discard the most variable regions of the genomes to facilitate the alignment of the remaining coding regions. Nevertheless, no software is currently available to perform curation to such a degree, through simple detection, organization and positioning of the main plastome regions, making it a time-consuming and error-prone process. Here we introduce a fast and user friendly software ECuADOR, a Perl script specifically designed to automate the detection and reorganization of newly assembled plastomes obtained from any source available (NGS, sanger sequencing or assembler output).

Methods: ECuADOR uses a sliding-window approach to detect long repeated sequences in draft sequences, which then identifies the inverted repeat regions (IRs), even in case of artifactual breaks or sequencing errors and automates the rearrangement of the sequence to the widely used LSC-Irb-SSC-IRa order. This facilitates rapid post-editing steps such as creation of genome alignments, detection of variable regions, SNP detection and phylogenomic analyses.

Results: ECuADOR was successfully tested on plant families throughout the angiosperm phylogeny by curating 161 chloroplast datasets. ECuADOR first identified and reordered the central regions (LSC-Irb-SSC-IRa) for each dataset and then produced a new annotation for the chloroplast sequences. The process took less than 20 min with a maximum memory requirement of 150 MB and an accuracy of over 99%.

Conclusions: ECuADOR is the sole de novo one-step recognition and re-ordination tool that provides facilitation in the post-processing analysis of the extra nuclear genomes from NGS data. The program is available at https://github.com/BiodivGenomic/ECuADOR/.

Keywords: Automated workflow; Bioinformatics; NGS; Phylogenomics; Plastome; Sliding window.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Figure 1
Figure 1. Schematic representation of plastomes and effects of the artifactual linearization during the assembly process.
Schematic representation of plastomes and effects of the artifactual linearization during the assembly process. IRa and IRb, Inverted Repeats; LSC, Large Single Copy region; SSC, Small Single Copy region. (A) Circular representation showing the potential cuts (numbered black arrows) during assembly (approximate positions); green arrow: conventional start of the plastome sequence (resulting in the structure LSC–IRb–SSC–IRa); usual approximate sizes for each region are indicated. (B) Linear representations of a circular plastome, cut according to the black arrows in (A); line numbers according to (A). Note that IRs are split in three fragments in configurations 5 and 9.
Figure 2
Figure 2. Flow chart of ECuADOR.
Figure 3
Figure 3. Phylogenetic tree constructed with 161 cpDNAs, using fast likelihood-based method (aLRT SH-like) as implemented in PhyML (Guindon et al., 2010).
Numbers on nodes indicate probability values. Families highlighted in red show an inconsistency found in the placement of Ranunculaceae (Ranunculus macranthus), which groups together with Piperaceae, Dioscoreaceae and Chloranthaceae.
Figure 4
Figure 4. Accuracy of ECuADOR in retrieving the correct IR locations in plastomes of decreasing quality.
Vertical axis percentage of simulations where correct (grey) or incorrect (yellow) IR locations were retrieved. Horizontal axis: percentage of mismatching positions introduced in the IR sequences of the Arabidopsis thaliana reference plastome sequence (NC_000932) for 1,000 simulations. Values in the lower part shows the total assigned variation in base pairs for each set respectively. Red values below the bars show the error average in base pairs for the positioning of the uncertain annotation.

References

    1. Bi G, Mao Y, Xing Q, Cao M. HomBlocks: a multiple-alignment construction pipeline for organelle phylogenomics based on locally collinear block searching. Genomics. 2018;110(1):18–22. doi: 10.1016/j.ygeno.2017.08.001. - DOI - PubMed
    1. Borsch T, Hilu KW, Quandt D, Wilde V, Neinhuis C, Barthlott W. Noncoding plastid trnT–trnF sequences reveal a well resolved phylogeny of basal angiosperms. Journal of Evolutionary Biology. 2003;16(4):558–576. doi: 10.1046/j.1420-9101.2003.00577.x. - DOI - PubMed
    1. Brázda V, Lýsek J, Bartas M, Fojta M. Complex analyses of short inverted repeats in all sequenced chloroplast DNAs. Biomed Research International. 2018;2018:1–10. - PMC - PubMed
    1. Castandet B, Hotto AM, Strickler SR, Stern DB. ChloroSeq, an optimized chloroplast RNA-seq bioinformatic pipeline, reveals remodeling of the organellar transcriptome under heat stress. G3: Genes, Genomes, Genetics. 2016;6(9):2817–2827. - PMC - PubMed
    1. Chase MW, Christenhusz MJ, Fay MF, Byng JW, Judd WS, Soltis DE, Mabberley DJ, Sennikov AN, Soltis PS, Stevens PF. An update of the angiosperm phylogeny group classification for the orders and families of flowering plants: APG IV. Botanical Journal of the Linnean Society. 2016;181(1):1–20. doi: 10.1111/boj.12385. - DOI

LinkOut - more resources