Phenotypic prediction of unobserved data. — predict_trait

Implement trait prediction based on SNP and environmental data with selection of prediction methods among Machine Learning approaches.

This function should be used to assess the predictive ability according to a cross-validation scheme determined by the user.

predict_trait_MET(
  METData_training,
  METData_new,
  trait,
  prediction_method,
  use_selected_markers = F,
  list_selected_markers_manual = NULL,
  lat_lon_included = F,
  year_included = F,
  include_env_predictors = T,
  list_env_predictors = NULL,
  seed = NULL,
  save_processing = T,
  path_folder,
  save_model = F,
  ...
)

Arguments

METData_training	`list` An object created by the function `create_METData()` that contains the training set. @param METData_new `list` An object created by the function `create_METData()` that contains the test set (no phenotypic observations).
trait	`character` Name of the trait to predict. An ordinal trait should be encoded as `integer`.
prediction_method	`character` specifying the predictive model to use. Options are currently `xgb_reg_1` (gradient boosted trees), `xgb_reg_2` , `xgb_reg_3`, `DL_reg_1` (multilayer perceptrons), `DL_reg_2`, `DL_reg_3`, `stacking_reg_1` (stacked models), `stacking_reg_2`, `stacking_reg_3`, `rf_reg_1`, `rf_reg_2`, `rf_reg_3`.
use_selected_markers	A `Logical` indicating whether to use a subset of markers obtained from a previous step (see `function select_markers()`).
lat_lon_included	`logical` indicates if longitude and latitude data should be used as numeric predictors. Default is `TRUE`.
year_included	`logical` indicates if year factor should be used as predictor variable. Default is `FALSE`.
include_env_predictors	A `logical` indicating whether environmental covariates characterizing each environment should be used in predictions.
list_env_predictors	A `character` vector containing the names of the environmental predictors which should be used in predictions. By default `NULL`: all environmental predictors included in the env_data table of the `METData` object will be used.
seed	`integer` Seed value. Default is `NULL`. By default, a random seed will be generated.
save_processing	a `logical` indicating whether the processing steps obtained from the `processing_train_test_split()` or `processing_train_test_split_kernel()` functions should be saved in a .RDS object. Default is `FALSE`.
path_folder	a `character` indicating the full path where the .RDS object and plots generated during the analysis should be saved (do not use a Slash after the name of the last folder). Default is `NULL`.
save_model	a `logical` indicating Logical indicating whether the fitted model for each training-test partition should be saved. Default is FALSE. Note that some models (e.g. stacked models) can require a large memory.
...	Arguments passed to the `processing_train_test_split()`, `processing_train_test_split_kernel()`, `reg_fitting_train_test_split()`, `reg_fitting_train_test_split_kernel()` functions.
cv_type	A `character` with one out of `cv0` (prediction of new environments), `cv00` (prediction of new genotypes in new environments), `cv1` (prediction of new genotypes) or `cv2` (prediction of incomplete field trials). Default is `cv0`.
cv0_type	A `character` with one out of `leave-one-environment-out`, `leave-one-site-out`,`leave-one-year-out`, `forward-prediction`. Default is `leave-one-environment-out`.
nb_folds_cv1	A `numeric` Number of folds used in the CV1 scheme. Default is 5.
repeats_cv1	A `numeric` Number of repeats in the CV1 scheme. Default is 50.
nb_folds_cv2	A `numeric` Number of folds used in the CV2 scheme. Default is 5.
repeats_cv2	A `numeric` Number of repeats in the CV2 scheme. Default is 50.

Value

A list object of class met_cv with the following items:

list_results_cv: list of res_fitted_split elements. Detailed prediction results for each split of the data within each element of this list.
seed_used: integer Seed used to generate the cross-validation splits.

Author

Cathy C. Westhues cathy.jubin@uni-goettingen.de