Statistical phasing of 150,119 sequenced genomes in the UK Biobank
- PMID: 36450278
- PMCID: PMC9892698
- DOI: 10.1016/j.ajhg.2022.11.008
Statistical phasing of 150,119 sequenced genomes in the UK Biobank
Abstract
The first release of UK Biobank whole-genome sequence data contains 150,119 genomes. We present an open-source pipeline for filtering, phasing, and indexing these genomes on the cloud-based UK Biobank Research Analysis Platform. This pipeline makes it possible to apply haplotype-based methods to UK Biobank whole-genome sequence data. The pipeline uses BCFtools for marker filtering, Beagle for genotype phasing, and Tabix for VCF indexing. We used the pipeline to phase 406 million single-nucleotide variants on chromosomes 1-22 and X at a cost of £2,309. The maximum time required to process a chromosome was 2.6 days. In order to assess phase accuracy, we modified the pipeline to exclude trio parents. We observed a switch error rate of 0.0016 on chromosome 20 in the White British trio offspring. If we exclude markers with nonmajor allele frequency < 0.1% after phasing, this switch error rate decreases by 80% to 0.00032.
Keywords: UK Biobank; genotype phasing; haplotype; haplotype phasing.
Copyright © 2022 American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.
Conflict of interest statement
Declaration of interests The authors declare no competing interests.
Figures

References
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources