Implement trait prediction based on SNP and environmental data with selection of prediction methods among Machine Learning approaches. This function should be used to assess the predictive ability according to a cross-validation scheme determined by the user.

predict_trait_MET_cv(
  METData,
  trait,
  prediction_method,
  lat_lon_included = F,
  year_included = F,
  cv_type = "cv0",
  cv0_type = "leave-one-environment-out",
  nb_folds_cv1 = 5,
  repeats_cv1 = 50,
  nb_folds_cv2 = 5,
  repeats_cv2 = 50,
  include_env_predictors = T,
  list_env_predictors = NULL,
  use_selected_markers = F,
  list_selected_markers_manual = NULL,
  seed = NULL,
  save_splits = F,
  save_processing = F,
  path_folder,
  save_model = F,
  ...
)

Arguments

METData

list An object created by the initial function of the package create_METData().

trait

character Name of the trait to predict.

prediction_method

character specifying the predictive model to use. Options are currently xgb_reg_1 (gradient boosted trees), xgb_reg_2 , xgb_reg_3, DL_reg_1 (multilayer perceptrons), DL_reg_2, DL_reg_3, stacking_reg_1 (stacked models), stacking_reg_2, stacking_reg_3, rf_reg_1, rf_reg_2, rf_reg_3.

lat_lon_included

logical indicates if longitude and latitude data should be used as numeric predictors. Default is FALSE.

year_included

logical indicates if year factor should be used as predictor variable. Default is FALSE.

cv_type

A character with one out of cv0 (prediction of new environments), cv00 (prediction of new genotypes in new environments), cv1 (prediction of new genotypes) or cv2 (prediction of incomplete field trials). Default is cv0.

cv0_type

A character with one out of leave-one-environment-out, leave-one-site-out,leave-one-year-out, forward-prediction. Default is leave-one-environment-out.

nb_folds_cv1

A numeric Number of folds used in the CV1 scheme. Default is 5.

repeats_cv1

A numeric Number of repeats in the CV1 scheme. Default is 50.

nb_folds_cv2

A numeric Number of folds used in the CV2 scheme. Default is 5.

repeats_cv2

A numeric Number of repeats in the CV2 scheme. Default is 50.

include_env_predictors

A logical indicating whether environmental covariates characterizing each environment should be used in predictions.

list_env_predictors

A character vector containing the names of the environmental predictors which should be used in predictions. By default NULL: all environmental predictors included in the env_data table of the METData object will be used.

use_selected_markers

A Logical indicating whether to use a subset of markers identified via single-environment GWAS or based on the table of marker effects obtained via Elastic Net as predictor variables, when main genetic effects are modeled with principal components.
If use_selected_markers is TRUE, and if list_selected_markers_manual is NULL, then the select_markers() function will be called in the pipeline. For more details, see select_markers()

seed

integer Seed value. Default is NULL. By default, a random seed will be generated.

save_splits

A Logical to indicate if the train/test splits should be saved.

save_processing

a logical indicating whether the processing steps obtained from the get_splits_processed_with_method() functions should be saved in a .RDS object. Default is FALSE.

path_folder

a character indicating the full path where the .RDS object and plots generated during the analysis should be saved (do not use a Slash after the name of the last folder). Default is NULL.

save_model

a logical indicating Logical indicating whether the fitted model for each training-test partition should be saved. Default is FALSE. Note that some models (e.g. stacked models) can require a large memory.

...

Arguments passed to the get_splits_processed_with_method() function.

Value

A list object of class met_cv with the following items:

list_results_cv

list of res_fitted_split elements. The length of this list corresponds to the number of training/test set partitions.

seed_used

integer Seed used to generate the cross-validation splits.

cv_type

integer Seed used to generate the cross-validation splits.

Author

Cathy C. Westhues cathy.jubin@uni-goettingen.de