Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jul;104(7):8107-8121.
doi: 10.3168/jds.2020-19861. Epub 2021 Apr 15.

Evaluating the performance of machine learning methods and variable selection methods for predicting difficult-to-measure traits in Holstein dairy cattle using milk infrared spectral data

Affiliations
Free article

Evaluating the performance of machine learning methods and variable selection methods for predicting difficult-to-measure traits in Holstein dairy cattle using milk infrared spectral data

Lucio F M Mota et al. J Dairy Sci. 2021 Jul.
Free article

Abstract

Fourier-transform infrared (FTIR) spectroscopy is a powerful high-throughput phenotyping tool for predicting traits that are expensive and difficult to measure in dairy cattle. Calibration equations are often developed using standard methods, such as partial least squares (PLS) regression. Methods that employ penalization, rank-reduction, and variable selection, as well as being able to model the nonlinear relations between phenotype and FTIR, might offer improvements in predictive ability and model robustness. This study aimed to compare the predictive ability of 2 machine learning methods, namely random forest (RF) and gradient boosting machine (GBM), and penalized regression against PLS regression for predicting 3 phenotypes differing in terms of biological meaning and relationships with milk composition (i.e., phenotypes measurable directly and not directly in milk, reflecting different biological processes which can be captured using milk spectra) in Holstein-Friesian cattle under 2 cross-validation scenarios. The data set comprised phenotypic information from 471 Holstein-Friesian cows, and 3 target phenotypes were evaluated: (1) body condition score (BCS), (2) blood β-hydroxybutyrate (BHB, mmol/L), and (3) κ-casein expressed as a percentage of nitrogen (κ-CN, % N). The data set was split considering 2 cross-validation scenarios: samples-out random in which the population was randomly split into 10-folds (8-folds for training and 1-fold for validation and testing); and herd/date-out in which the population was randomly assigned to training (70% herd), validation (10%), and testing (20% herd) based on the herd and date in which the samples were collected. The random grid search was performed using the training subset for the hyperparameter optimization and the validation set was used for the generalization of prediction error. The trained model was then used to assess the final prediction in the testing subset. The grid search for penalized regression evidenced that the elastic net (EN) was the best regularization with increase in predictive ability of 5%. The performance of PLS (standard model) was compared against 2 machine learning techniques and penalized regression using 2 cross-validation scenarios. Machine learning methods showed a greater predictive ability for BCS (0.63 for GBM and 0.61 for RF), BHB (0.80 for GBM and 0.79 for RF), and κ-CN (0.81 for GBM and 0.80 for RF) in samples-out cross-validation. Considering a herd/date-out cross-validation these values were 0.58 (GBM and RF) for BCS, 0.73 (GBM and RF) for BHB, and 0.77 (GBM and RF) for κ-CN. The GBM model tended to outperform other methods in predictive ability around 4%, 1%, and 7% for EN, RF, and PLS, respectively. The prediction accuracies of the GBM and RF models were similar, and differed statistically from the PLS model in samples-out random cross-validation. Although, machine learning techniques outperformed PLS in herd/date-out cross-validation, no significant differences were observed in terms of predictive ability due to the large standard deviation observed for predictions. Overall, GBM achieved the highest accuracy of FTIR-based prediction of the different phenotypic traits across the cross-validation scenarios. These results indicate that GBM is a promising method for obtaining more accurate FTIR-based predictions for different phenotypes in dairy cattle.

Keywords: dairy cattle; gradient boosting machine; milk spectra; phenotypic prediction.

PubMed Disclaimer

Similar articles

Cited by

Substances

LinkOut - more resources