Cross-validation procedure for phenotypic prediction of crop varieties.

Implement trait prediction based on SNP and environmental data with selection of prediction methods among Machine Learning approaches. This function should be used to assess the predictive ability according to a cross-validation scheme determined by the user.

predict_trait_MET_cv(
  METData,
  trait,
  prediction_method,
  lat_lon_included = F,
  year_included = F,
  cv_type = "cv0",
  cv0_type = "leave-one-environment-out",
  nb_folds_cv1 = 5,
  repeats_cv1 = 50,
  nb_folds_cv2 = 5,
  repeats_cv2 = 50,
  include_env_predictors = T,
  list_env_predictors = NULL,
  use_selected_markers = F,
  list_selected_markers_manual = NULL,
  seed = NULL,
  save_splits = F,
  save_processing = F,
  path_folder,
  save_model = F,
  ...
)

Arguments

METData	`list` An object created by the initial function of the package `create_METData()`.
trait	`character` Name of the trait to predict.
prediction_method	`character` specifying the predictive model to use. Options are currently `xgb_reg_1` (gradient boosted trees), `xgb_reg_2` , `xgb_reg_3`, `DL_reg_1` (multilayer perceptrons), `DL_reg_2`, `DL_reg_3`, `stacking_reg_1` (stacked models), `stacking_reg_2`, `stacking_reg_3`, `rf_reg_1`, `rf_reg_2`, `rf_reg_3`.
lat_lon_included	`logical` indicates if longitude and latitude data should be used as numeric predictors. Default is `FALSE`.
year_included	`logical` indicates if year factor should be used as predictor variable. Default is `FALSE`.
cv_type	A `character` with one out of `cv0` (prediction of new environments), `cv00` (prediction of new genotypes in new environments), `cv1` (prediction of new genotypes) or `cv2` (prediction of incomplete field trials). Default is `cv0`.
cv0_type	A `character` with one out of `leave-one-environment-out`, `leave-one-site-out`,`leave-one-year-out`, `forward-prediction`. Default is `leave-one-environment-out`.
nb_folds_cv1	A `numeric` Number of folds used in the CV1 scheme. Default is 5.
repeats_cv1	A `numeric` Number of repeats in the CV1 scheme. Default is 50.
nb_folds_cv2	A `numeric` Number of folds used in the CV2 scheme. Default is 5.
repeats_cv2	A `numeric` Number of repeats in the CV2 scheme. Default is 50.
include_env_predictors	A `logical` indicating whether environmental covariates characterizing each environment should be used in predictions.
list_env_predictors	A `character` vector containing the names of the environmental predictors which should be used in predictions. By default `NULL`: all environmental predictors included in the env_data table of the `METData` object will be used.
use_selected_markers	A `Logical` indicating whether to use a subset of markers identified via single-environment GWAS or based on the table of marker effects obtained via Elastic Net as predictor variables, when main genetic effects are modeled with principal components. If `use_selected_markers` is `TRUE`, and if `list_selected_markers_manual` is `NULL`, then the `select_markers()` function will be called in the pipeline. For more details, see `select_markers()`
seed	`integer` Seed value. Default is `NULL`. By default, a random seed will be generated.
save_splits	A `Logical` to indicate if the train/test splits should be saved.
save_processing	a `logical` indicating whether the processing steps obtained from the `get_splits_processed_with_method()` functions should be saved in a .RDS object. Default is `FALSE`.
path_folder	a `character` indicating the full path where the .RDS object and plots generated during the analysis should be saved (do not use a Slash after the name of the last folder). Default is `NULL`.
save_model	a `logical` indicating Logical indicating whether the fitted model for each training-test partition should be saved. Default is FALSE. Note that some models (e.g. stacked models) can require a large memory.
...	Arguments passed to the `get_splits_processed_with_method()` function.

Value

A list object of class met_cv with the following items:

list_results_cv: list of res_fitted_split elements. The length of this list corresponds to the number of training/test set partitions.
seed_used: integer Seed used to generate the cross-validation splits.
cv_type: integer Seed used to generate the cross-validation splits.

Author

Cathy C. Westhues cathy.jubin@uni-goettingen.de