Erythrogene: a database for in-depth analysis of the extensive variation in 36 blood group systems in the 1000 Genomes Project

Mattias Möller¹, Magnus Jöud^{1

2}, Jill R Storry^{1

2}, Martin L Olsson^{1

2}

Affiliations

¹ Hematology and Transfusion Medicine, Department of Laboratory Medicine, Lund University, Lund, Sweden; and.
² Department of Clinical Immunology and Transfusion Medicine, Laboratory Medicine, Office of Medical Service, Lund, Sweden.

PMID: 29296939
PMCID: PMC5737168
DOI: 10.1182/bloodadvances.2016001867

Erythrogene: a database for in-depth analysis of the extensive variation in 36 blood group systems in the 1000 Genomes Project

Mattias Möller et al. Blood Adv. 2016.

. 2016 Dec 16;1(3):240-249.

doi: 10.1182/bloodadvances.2016001867. eCollection 2016 Dec 27.

Authors

Mattias Möller¹, Magnus Jöud^{1

2}, Jill R Storry^{1

2}, Martin L Olsson^{1

2}

Affiliations

¹ Hematology and Transfusion Medicine, Department of Laboratory Medicine, Lund University, Lund, Sweden; and.
² Department of Clinical Immunology and Transfusion Medicine, Laboratory Medicine, Office of Medical Service, Lund, Sweden.

PMID: 29296939
PMCID: PMC5737168
DOI: 10.1182/bloodadvances.2016001867

Abstract

Blood group genotyping has recently developed into a clinical tool to improve compatibility of blood transfusions and management of pregnancies. Next-generation sequencing (NGS) is rapidly moving toward routine practice for patient and donor typing and has the potential to remedy some of the limitations of currently used platforms. However, a large-scale investigation into the blood group genotypes obtained by NGS in a multiethnic cohort is lacking. The 1000 Genomes Project provides information on genome variation among 2504 individuals representing 26 populations worldwide. We extracted their NGS data for all 36 blood group systems to a custom-designed database. In total, 210 412 alleles from 43 blood group-related genes were imported and curated. Matching algorithms were developed to compare them to blood group variants identified to date. Of the 1241 non-synonymous variants identified in the coding regions, 241 are known blood group polymorphisms. Interestingly, 357 of the remaining 1000 variants are predicted to occur on extracellular portions of 31 different blood group-carrying proteins and some may represent undiscovered antigens. Of the alleles analyzed, 1504 were not previously described. The ABO/GBGT1/FUT2/FUT3 and GYPB/GYPC genes showed the highest degree of variation per kilobase coding sequence, and ACKR1 variants had the most skewed distribution across 5 continental superpopulations in the dataset. Results were exported to an online search engine, www.erythrogene.com, which presents data according to the allele nomenclature developed for clinical reporting by the International Society of Blood Transfusion. The established database deepens our knowledge on blood group polymorphism globally and provides a long-sought platform for future research.

PubMed Disclaimer

Conflict of interest statement

Conflict-of-interest disclosure: The authors declare no competing financial interests.

Figures

**Figure 1.**
**Graphic working model of the project**. To create a database of blood group gene alleles, we imported data from the 1000G project, reference data sources, and LRG into a SQL database. Data were curated carefully with several automated and manual validating procedures. This process also discovered previously undetected errors in the reference data. A matching procedure integrated the data sources, allowing for a searchable interface to the database. The database is available via a Web interface, Erythrogene. dbRBC, Blood Group Antigen Mutation Database; GRCh37, Genome Reference Consortium human genome build 37; ISBT, International Society of Blood Transfusion; LRG, Locus Reference Genomic.

**Figure 2.**
**Processing of variants**. An overview of the processing of variants after they have been imported into the database. Firstly, variants found in multiallelic sites (n = 312) were split up and treated as separate individual variants. A multiallelic site is a specific locus in the genome with ≥2 alternate sequences observed in addition to the reference sequence. Secondly, all variants were remapped from the GRCh37 assembly to their corresponding LRG reference. Discrepancies between these references were found at 101 variant locations, so to handle the transition to the LRG reference correctly, additional variants were generated to reflect these differences. Thirdly, accurate descriptions according to HGVS nomenclature were generated at the DNA level. All variants were then classified according to terms defined by Sequence Ontology v2.5 and alleles generated. 1000 Genomes, 1000 Genomes Project; GRCh37, Genome Reference Consortium human genome build 37; HGVS, Human Genome Variation Society; LRG, Locus Reference Genomic. The Sequence Ontology set of terms and relationships are used to describe features and attributes of biological sequences.

**Figure 3.**
**Matching of 1000 Genomes data with reference alleles**. (A-C) Matching 1000G data with reference allele data for 43 blood group genes. (A) Total number of alleles and (B) unique alleles in the 1000G that were successfully matched or not matched to the entries in any of the reference data. A majority of all alleles but only a minor proportion of the unique alleles were matched, suggesting that common alleles were successfully matched while rare alleles were not. (C) Total number of nonsynonymous CDS variants that were found in the reference data. Only a minor proportion of the variants were listed in the reference data. (D-E) Number of alleles and variants in the reference allele data that could be matched to 1000G data. (D) Total number of alleles in the reference data that could be matched or not matched to 1000G data. Not matchable alleles are alleles that contain unmatchable variants. (E) Total number of alleles in the reference data that could be matched or not matched to 1000G data. Not matchable variants are rearrangements, hybrid alleles, and indistinctly defined variants.

See this image and copyright information in PMC

References

1. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature. 2004;431(7011):931-945. - PubMed
1. International HapMap Consortium. A haplotype map of the human genome. Nature. 2005;437(7063):1299-1320. - PMC - PubMed
1. 1000 Genomes Project Consortium, Abecasis GR, Altshuler D, et al. A map of human genome variation from population-scale sequencing. Nature 2010;467(7319):1061-1073. - PMC - PubMed
1. 1000 Genomes Project Consortium, Auton A, Brooks LD, et al. A global reference for human genetic variation. Nature 2015;526(7571):68-74. - PMC - PubMed
1. Sudmant PH, Rausch T, Gardner EJ, et al. ; 1000 Genomes Project Consortium. An integrated map of structural variation in 2,504 human genomes. Nature. 2015;526(7571):75-81. - PMC - PubMed

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Erythrogene: a database for in-depth analysis of the extensive variation in 36 blood group systems in the 1000 Genomes Project

Affiliations

Erythrogene: a database for in-depth analysis of the extensive variation in 36 blood group systems in the 1000 Genomes Project

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources

Other Literature Sources