Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Aug 11:2023:baad053.
doi: 10.1093/database/baad053.

The MetaGens algorithm for metagenomic database lossy compression and subject alignment

Affiliations

The MetaGens algorithm for metagenomic database lossy compression and subject alignment

Gustavo Henrique Cervi et al. Database (Oxford). .

Abstract

The advancement of genetic sequencing techniques led to the production of a large volume of data. The extraction of genetic material from a sample is one of the early steps of the metagenomic study. With the evolution of the processes, the analysis of the sequenced data allowed the discovery of etiological agents and, by corollary, the diagnosis of infections. One of the biggest challenges of the technique is the huge volume of data generated with each new technology developed. To introduce an algorithm that may reduce the data volume, allowing faster DNA matching with the reference databases. Using techniques like lossy compression and substitution matrix, it is possible to match nucleotide sequences without losing the subject. This lossy compression explores the nature of DNA mutations, insertions and deletions and the possibility that different sequences are the same subject. The algorithm can reduce the overall size of the database to 15% of the original size. Depending on parameters, it may reduce up to 5% of the original size. Although is the same as the other platforms, the match algorithm is more sensible because it ignores the transitions and transversions, resulting in a faster way to obtain the diagnostic results. The first experiment results in an increase in speed 10 times faster than Blast while maintaining high sensitivity. This performance gain can be extended by combining other techniques already used in other studies, such as hash tables. Database URL https://github.com/ghc4/metagens.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
Three main steps of a metagenomic pipeline: filtering, aligning and annotating. Source: the authors.
Figure 2.
Figure 2.
Evolution of the GenBank database, maintained by the NCBI. showing an increase in the curve of genomic data deposits, which results in computational difficulties of data processing. Source: NBCI statistics webpage.
Figure 3.
Figure 3.
Chromatogram with multiple peaks per base—low quality data. Source: Roswell Park Comprehensive Cancer Center.
Figure 4.
Figure 4.
Chromatogram indicating good quality of data sequence. Source: U-M Biomedical Research Core Facilities.
Figure 5.
Figure 5.
Two basic types of mutations: in transitions there is an exchange of bases of the same class (purine or pyrimidine), while in transversions there is an exchange of bases of different classes. Source: the authors.
Figure 6.
Figure 6.
In the alignment of genetic sequences, mismatches (blue), indels (green) and matches (red) can occur. The occurrence of these events does not necessarily mean that they are different subjects. Source: the authors (sample on NCBI Blast).
Figure 7.
Figure 7.
Dynamic programming illustrated by a matrix ‘mn,’ where the highlighted path represents the optimal alignment between the sequences. Source: the authors, based on (25).
Figure 8.
Figure 8.
Hypothetical reads. The bases in red are used as a distance marker (green line). The result, in blue, makes up the sequence identity. Source: the authors.
Figure 9.
Figure 9.
Example of a possible wave and its frequency. SARS-CoV-2 reference sample read from NCBI Sequence Read Archive (SAR): ERR4329467.
Figure 10.
Figure 10.
The indel problem. Source: the authors.
Figure 11.
Figure 11.
Partial result of the algorithm. Source: the authors.
Figure 12.
Figure 12.
Graphical interface of the MetaGens software showing the quality control panel. It is a user-friendly interface that allows the specification of the quality control parameters. Source: the authors.
Figure 13.
Figure 13.
Graphical interface of the MetaGens software showing the database filter panel. When filtering the reference sequences, it is possible to reduce the size of the database in order to reduce the analysis time. Source: the authors.
Figure 14.
Figure 14.
Graphical interface of the MetaGens software showing the database query panel. On the query screen, it is possible to follow the progress of the alignment. The table shows the subjects under analysis and their matches. Source: the authors.

References

    1. Chen K. and Pachter L. (2005) Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS Comput. Biol., 1, 106–112. - PMC - PubMed
    1. Editorial (2009) Metagenomics versus Moore’s law. Nat. Methods, 6, 623–623.
    1. Kakirde K.S., Parsley L.C. and Liles M.R. (2010) Size does matter: application-driven approaches for soil metagenomics. Soil Biol. Biochem., 42, 1911–1923. - PMC - PubMed
    1. Compeau P. (2015) Bioinformatics Algorithms, vol.i, 2nd edn. Active Learning Publishers, La Jolla, CA.
    1. Chiu C.Y. and Miller S.A. (2019) Clinical metagenomics. Nat. Rev. Genet., 20, 341–355. - PMC - PubMed

Publication types