The different Machine Learning-based methods avaialble in the package are presented in the table below. The first column refers to the name of the method as it should be given in the prediction_method argument of the predict_trait_MET_cv() function to be used in Step 2.
The second column indicates how many From the third column on, different data subsets (= sub-sampling of features) are used and fitted on the training set. Some models, such as stacking ensembles, use a combination of base learners, which are individually fitted on different data subsets.
The type of predictive modeling approach used (e.g. tree-based methods such as gradient boosted trees; support vector machines; multilayer perceptrons) to fit each data subset is precised in the table.
For example, the model stacking_reg_1 uses as base learners two SVM models: one is fitted on the training data sub-sampled for marker data, and the second one is fitted on the training data subsampled for environmental data. A meta-learner (LASSO model) is used to determine the weight of the predictions from the respective base learners. Hence, the final model is based on the stacking of two base models fitted on the same training set, but sub-sampled with different predictor variables.
The suffix reg refers to the fact that the method should be used for regression tasks.

Name of the prediction_method in step 2	Genomic PCs derived from genotype matrix + environmental predictor variables	Genomic PCs derived from genomic relationship matrix + environmental predictor variables	All SNPs predictor variables + environmental predictor variables	Only molecular marker predictor variables	Only environmental predictor variables	GxE interaction dataset = QTLs with environmental variables	GxE interaction dataset = Principal components (from geno matrix) with environmental variables	Stacking model = combination of models
xgb_reg_1	X
xgb_reg_2		X
xgb_reg_3			X
rf_reg_1	X
rf_reg_2		X
rf_reg_3			X
stacking_reg_1				X (support vector machine. Linear, Radial Basis Function (RBF) Kernel, or Polynomial kernel)	X (support vector machine. Linear, Radial Basis Function (RBF) Kernel, or Polynomial kernel)			X
stacking_reg_2				X (support vector machine. Linear, Radial Basis Function (RBF) Kernel, or Polynomial kernel)	X (support vector machine. Linear, Radial Basis Function (RBF) Kernel, or Polynomial kernel)	X (support vector machine. Linear, Radial Basis Function (RBF) Kernel, or Polynomial kernel)		X
stacking_reg_3				X (support vector machine. Linear, Radial Basis Function (RBF) Kernel, or Polynomial kernel)	X (support vector machine. Linear, Radial Basis Function (RBF) Kernel, or Polynomial kernel)		X (support vector machine. Linear, Radial Basis Function (RBF) Kernel, or Polynomial kernel)	X
DL_reg_1	X
DL_reg_2		X
DL_reg_3			X

References

Kuhn and Wickham (2020) Friedman (2001) Breiman (2001) Chen et al. (2015) Van der Laan, Polley, and Hubbard (2007)

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32.

Chen, Tianqi, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen, et al. 2015. “Xgboost: Extreme Gradient Boosting.” R Package Version 0.4-2 1 (4): 1–4.

Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics, 1189–1232.

Kuhn, Max, and Hadley Wickham. 2020. Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles. https://www.tidymodels.org.

Van der Laan, Mark J, Eric C Polley, and Alan E Hubbard. 2007. “Super Learner.” Statistical Applications in Genetics and Molecular Biology 6 (1).

Overview on the different Machine Learning-based models in learnMET

Cathy Westhues

2022-09-29

References