Genomic prediction of breeding values using a subset of SNPs identified by three machine learning methods
Journal article
Li, Bo, Zhang, Nanxi, Wang, You-Gan, George, Andrew W., Reverter, Antonio and Li, Yutao. (2018). Genomic prediction of breeding values using a subset of SNPs identified by three machine learning methods. Frontiers in Genetics. 9, p. Article 237. https://doi.org/10.3389/fgene.2018.00237
Authors | Li, Bo, Zhang, Nanxi, Wang, You-Gan, George, Andrew W., Reverter, Antonio and Li, Yutao |
---|---|
Abstract | The analysis of large genomic data is hampered by issues such as a small number of observations and a large number of predictive variables (commonly known as “large P small N”), high dimensionality or highly correlated data structures. Machine learning methods are renowned for dealing with these problems. To date machine learning methods have been applied in Genome-Wide Association Studies for identification of candidate genes, epistasis detection, gene network pathway analyses and genomic prediction of phenotypic values. However, the utility of two machine learning methods, Gradient Boosting Machine (GBM) and Extreme Gradient Boosting Method (XgBoost), in identifying a subset of SNP makers for genomic prediction of breeding values has never been explored before. In this study, using 38,082 SNP markers and body weight phenotypes from 2,093 Brahman cattle (1,097 bulls as a discovery population and 996 cows as a validation population), we examined the efficiency of three machine learning methods, namely Random Forests (RF), GBM and XgBoost, in (a) the identification of top 400, 1,000, and 3,000 ranked SNPs; (b) using the subsets of SNPs to construct genomic relationship matrices (GRMs) for the estimation of genomic breeding values (GEBVs). For comparison purposes, we also calculated the GEBVs from (1) 400, 1,000, and 3,000 SNPs that were randomly selected and evenly spaced across the genome, and (2) from all the SNPs. We found that RF and especially GBM are efficient methods in identifying a subset of SNPs with direct links to candidate genes affecting the growth trait. In comparison to the estimate of prediction accuracy of GEBVs from using all SNPs (0.43), the 3,000 top SNPs identified by RF (0.42) and GBM (0.46) had similar values to those of the whole SNP panel. The performance of the subsets of SNPs from RF and GBM was substantially better than that of evenly spaced subsets across the genome (0.18–0.29). Of the three methods, RF and GBM consistently outperformed the XgBoost in genomic prediction accuracy. |
Keywords | machine learning methods; single nucleotide polymorphisms; genomic prediction; breeding values; beef cattle; live weight |
Year | 2018 |
Journal | Frontiers in Genetics |
Journal citation | 9, p. Article 237 |
Publisher | Frontiers Media S.A. |
ISSN | 1664-8021 |
Digital Object Identifier (DOI) | https://doi.org/10.3389/fgene.2018.00237 |
PubMed ID | 30023001 |
Scopus EID | 2-s2.0-85049652826 |
PubMed Central ID | PMC6039760 |
Open access | Published as ‘gold’ (paid) open access |
Page range | 1-20 |
Funder | Shandong Provincial Education Department |
Shandong Provincial Science and Technology Development Program | |
Publisher's version | License File Access Level Open |
Output status | Published |
Publication dates | |
Online | 04 Jul 2018 |
Publication process dates | |
Accepted | 14 Jun 2018 |
Deposited | 06 Nov 2023 |
Grant ID | J16LN14 |
2014GGX101044 |
https://acuresearchbank.acu.edu.au/item/8zy4y/genomic-prediction-of-breeding-values-using-a-subset-of-snps-identified-by-three-machine-learning-methods
Download files
Publisher's version
OA_Li_2018_Genomic_prediction_of_breeding_values_using.pdf | |
License: CC BY 4.0 | |
File access level: Open |
41
total views17
total downloads0
views this month0
downloads this month