. 2021 Apr 1;22(1):174.

doi: 10.1186/s12859-021-04096-6.

HARVESTMAN: a framework for hierarchical feature learning and selection from whole genome sequencing data

Trevor S Frisby^#¹, Shawn J Baker^#¹, Guillaume Marçais¹, Quang Minh Hoang², Carl Kingsford³, Christopher J Langmead⁴

Affiliations

¹ Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA.
² Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, USA.
³ Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA. carlk@cs.cmu.edu.
⁴ Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA. cjl@cs.cmu.edu.

^# Contributed equally.

PMID: 33794760
PMCID: PMC8017869
DOI: 10.1186/s12859-021-04096-6

HARVESTMAN: a framework for hierarchical feature learning and selection from whole genome sequencing data

Trevor S Frisby et al. BMC Bioinformatics. 2021.

. 2021 Apr 1;22(1):174.

doi: 10.1186/s12859-021-04096-6.

Authors

Trevor S Frisby^#¹, Shawn J Baker^#¹, Guillaume Marçais¹, Quang Minh Hoang², Carl Kingsford³, Christopher J Langmead⁴

Affiliations

¹ Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA.
² Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, USA.
³ Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA. carlk@cs.cmu.edu.
⁴ Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA. cjl@cs.cmu.edu.

^# Contributed equally.

PMID: 33794760
PMCID: PMC8017869
DOI: 10.1186/s12859-021-04096-6

Abstract

Background: Supervised learning from high-throughput sequencing data presents many challenges. For one, the curse of dimensionality often leads to overfitting as well as issues with scalability. This can bring about inaccurate models or those that require extensive compute time and resources. Additionally, variant calls may not be the optimal encoding for a given learning task, which also contributes to poor predictive capabilities. To address these issues, we present HARVESTMAN, a method that takes advantage of hierarchical relationships among the possible biological interpretations and representations of genomic variants to perform automatic feature learning, feature selection, and model building.

Results: We demonstrate that HARVESTMAN scales to thousands of genomes comprising more than 84 million variants by processing phase 3 data from the 1000 Genomes Project, one of the largest publicly available collection of whole genome sequences. Using breast cancer data from The Cancer Genome Atlas, we show that HARVESTMAN selects a rich combination of representations that are adapted to the learning task, and performs better than a binary representation of SNPs alone. We compare HARVESTMAN to existing feature selection methods and demonstrate that our method is more parsimonious-it selects smaller and less redundant feature subsets while maintaining accuracy of the resulting classifier.

Conclusion: HARVESTMAN is a hierarchical feature selection approach for supervised model building from variant call data. By building a knowledge graph over genomic variants and solving an integer linear program , HARVESTMAN automatically and optimally finds the right encoding for genomic variants. Compared to other hierarchical feature selection methods, HARVESTMAN is faster and selects features more parsimoniously.

Keywords: Feature selection; Hierarchical feature spaces; Integer linear programming; Knowledge graphs; Machine learning.

PubMed Disclaimer

Conflict of interest statement

C.K. is co-founder of Ocean Genomics, Inc. G.M. is VP of software engineering at Ocean Genomics, Inc.

Figures

**Fig. 1**
Harvestman can solve large problem instances in hours. Timing for graph construction and feature selection using Harvestman, SHSEL, and an MI threshold with access to varying numbers of CPUs on the 1000 Genome data. The initial graph consisted of 23,393,068 nodes. This figure was generated using Matplotlib version 3.2.1

**Fig. 2**
The knowledge graph is more informative than raw SNPs. AUC as a function of feature counts for five year survival (top) and five year disease free survival (bottom) obtained with logistic regression (left), random forest (middle), and SVM (right). Each point in each figure corresponds to a knowledge graph that has been filtered by one of four MI thresholds (0.125, 0.1, 0.075, and 0.05.). The points are ordered from left to right in order of decreasing MI thresholds. Harvestman selects different numbers of features (x-axis), depending on the input graph. Comparisons are made using models trained on binary encodings of SNPs that passed those same thresholds. This figure was generated using Matplotlib version 3.2.1

**Fig. 3**
Harvestman is more parsimonious than SHSEL, meaning it selects fewer features than SHSEL without sacrificing model AUC. AUC as a function of feature counts for five year survival (Top) and five year disease free survival (bottom) obtained with logistic regression (left), random forest (middle), and SVM (right). In each, Harvestman is applied to complete knowledge graphs with MI thresholds 0.125, 0.1, 0.075, and 0.05, moving left to right. This figure was generated using Matplotlib version 3.2.1

**Fig. 4**
Harvestman selects fewer redundant features than other methods as graph size increases. The average absolute pairwise correlation of 1000 randomly sampled selected features for each selection algorithm for both the five year survival (Left) and five year disease free survival (Right). Error bars denote standard errors over ten runs with different train-test permutations. This figure was generated using Matplotlib version 3.2.1

**Fig. 5**
Harvestman’s knowledge graph and variant encoding scheme. The knowledge graph is composed of the genomic hierarchy (blue boxes) and GO hierarchy (orange boxes). Binary vectors at the genomic hierarchy leaf nodes are determined directly from DNA sequences (shown by green bars, variants in sequence shown by red boxes). Binary vectors at parent nodes are computed by taking the logical or of their child nodes or directly from the DNA sequence. A GO threshold is determined for each GO term from variant sequences related to its connected gene nodes. We use this threshold to determine a binary vector that reflects whether or not each sample is greater or less than the threshold. This figure was generated using Matplotlib version 3.2.1 and OmniGraffle

See this image and copyright information in PMC

References

1. Leung MKK, Delong A, Alipanahi B, Frey BJ. Machine learning in genomic medicine: a review of computational problems and data sets. Proc IEEE. 2016;104(1):176–197. doi: 10.1109/JPROC.2015.2494198. - DOI
1. D’Argenio V. The high-throughput analyses era: Are we ready for the data struggle? High Throughput. 2018;7(1):8. doi: 10.3390/ht7010008. - DOI - PMC - PubMed
1. Clarke R, Ressom HW, Wang A, Xuan J, Liu MC, Gehan EA, Wang Y. The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat Rev Cancer. 2008;8:37–49. doi: 10.1038/nrc2294. - DOI - PMC - PubMed
1. Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics. Nat Rev Genet. 2015;16:321–332. doi: 10.1038/nrg3920. - DOI - PMC - PubMed
1. Domingos P. A few useful things to know about machine learning. Commun ACM. 2012;55(10):78–87. doi: 10.1145/2347736.2347755. - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Medical
- MedlinePlus Health Information
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

HARVESTMAN: a framework for hierarchical feature learning and selection from whole genome sequencing data

Affiliations

HARVESTMAN: a framework for hierarchical feature learning and selection from whole genome sequencing data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical

Miscellaneous