Implement trait prediction based on SNP and environmental data with selection of prediction methods among Machine Learning approaches.

This function should be used to assess the predictive ability according to a cross-validation scheme determined by the user.

predict_trait_MET(
  METData_training,
  METData_new,
  trait,
  prediction_method,
  use_selected_markers = F,
  list_selected_markers_manual = NULL,
  lat_lon_included = F,
  year_included = F,
  include_env_predictors = T,
  list_env_predictors = NULL,
  seed = NULL,
  save_processing = T,
  path_folder,
  save_model = F,
  ...
)

Arguments

METData_training

list An object created by the function create_METData() that contains the training set.

@param METData_new list An object created by the function create_METData() that contains the test set (no phenotypic observations).

trait

character Name of the trait to predict. An ordinal trait should be encoded as integer.

prediction_method

character specifying the predictive model to use. Options are currently xgb_reg_1 (gradient boosted trees), xgb_reg_2 , xgb_reg_3, DL_reg_1 (multilayer perceptrons), DL_reg_2, DL_reg_3, stacking_reg_1 (stacked models), stacking_reg_2, stacking_reg_3, rf_reg_1, rf_reg_2, rf_reg_3.

use_selected_markers

A Logical indicating whether to use a subset of markers obtained from a previous step (see function select_markers()).

lat_lon_included

logical indicates if longitude and latitude data should be used as numeric predictors. Default is TRUE.

year_included

logical indicates if year factor should be used as predictor variable. Default is FALSE.

include_env_predictors

A logical indicating whether environmental covariates characterizing each environment should be used in predictions.

list_env_predictors

A character vector containing the names of the environmental predictors which should be used in predictions. By default NULL: all environmental predictors included in the env_data table of the METData object will be used.

seed

integer Seed value. Default is NULL. By default, a random seed will be generated.

save_processing

a logical indicating whether the processing steps obtained from the processing_train_test_split() or processing_train_test_split_kernel() functions should be saved in a .RDS object. Default is FALSE.

path_folder

a character indicating the full path where the .RDS object and plots generated during the analysis should be saved (do not use a Slash after the name of the last folder). Default is NULL.

save_model

a logical indicating Logical indicating whether the fitted model for each training-test partition should be saved. Default is FALSE. Note that some models (e.g. stacked models) can require a large memory.

...

Arguments passed to the processing_train_test_split(), processing_train_test_split_kernel(), reg_fitting_train_test_split(), reg_fitting_train_test_split_kernel() functions.

cv_type

A character with one out of cv0 (prediction of new environments), cv00 (prediction of new genotypes in new environments), cv1 (prediction of new genotypes) or cv2 (prediction of incomplete field trials). Default is cv0.

cv0_type

A character with one out of leave-one-environment-out, leave-one-site-out,leave-one-year-out, forward-prediction. Default is leave-one-environment-out.

nb_folds_cv1

A numeric Number of folds used in the CV1 scheme. Default is 5.

repeats_cv1

A numeric Number of repeats in the CV1 scheme. Default is 50.

nb_folds_cv2

A numeric Number of folds used in the CV2 scheme. Default is 5.

repeats_cv2

A numeric Number of repeats in the CV2 scheme. Default is 50.

Value

A list object of class met_cv with the following items:

list_results_cv

list of res_fitted_split elements. Detailed prediction results for each split of the data within each element of this list.

seed_used

integer Seed used to generate the cross-validation splits.

Author

Cathy C. Westhues cathy.jubin@uni-goettingen.de