Step 1: Specifying input data and processing parameters

First, we create an object of class with the function create_METData(). The user must provide as input data genotypic and phenotypic data, as well as basic information about the field experiments (e.g. longitude, latitude data at least), and possibly environmental covariates (if available). These input data are checked and warning messages are given as output if the data are not correctly formatted.
In this example, we use an indica rice dataset from Monteverde et al. (2019), which is implemented in the package as a “toy dataset”. From this study, a multi-year dataset of rice trials containing phenotypic data (four traits), genotypic and environmental data for a panel of indica genotypes across three years in a single location is available. (more information on the datasets with ?pheno_indica,?geno_indica,?map_indica,?climate_variables_indica,?info_environments_indica).
In this case, environmental covariates by growth stage are directly available and can be used in predictions. These data should be provided as input in [create_METData()] using the argument climate_variables. Hence, there is no need to retrieve with the package any daily weather data (hence compute_climatic_ECs argument set as FALSE).
Do not forget to indicate where the plots of clustering analyses should be saved using the argument path_to_save.

library(learnMET)

data("geno_indica")
data("map_indica")
data("pheno_indica")
data("info_environments_indica")
data("climate_variables_indica")

METdata_indica <-
  create_METData(
    geno = geno_indica,
    pheno = pheno_indica,
    climate_variables = climate_variables_indica,
    compute_climatic_ECs = F,
    info_environments = info_environments_indica,
    map = map_indica,
    path_to_save = '~/learnMET_analyses/indica/xgb'
  )
# No soil covariates provided by the user.
# Clustering of env. data starts.
# Clustering of env. data done.

Step 2: Cross-validated model evaluation

Specifying processing parameters and the type of prediction model

The goal of predict_trait_MET_cv() is to assess a given prediction method using a certain type of cross-validation (CV) scenario on the complete training set. The CV schemes covered by the package correspond to those generally evaluated in related literature on MET (Jarquı́n et al. (2014); Jarquı́n et al. (2017);Costa-Neto, Fritsche-Neto, and Crossa (2021)).

Here, we will use the CV2: predicting the performance of genotypes in incomplete field trials, meaning that one can use phenotypic information in the training set from other genotypes tested in the same environment, or from the same genotype evalauted in other environments. We also define the percentage of phentoypic observations which should be included in the training set, as well as the number of repetitions.

The function predict_trait_MET_cv() also allows to specify a specific subset of environmental variables from the METData$env_data object to be used in model fitting and predictions via the argument list_env_predictors.

How does predict_trait_MET_cv() works? When predict_trait_MET_cv() is executed, a list of training/test splits is constructed according to the CV scheme chosen by the user. Each training set in each sub-element of this list is processed (e.g. standardization and removal of predictors with null variance.

The function applies a nested CV to obtain an unbiased generalization performance estimate, implying an inner loop CV nested in an outer CV. The inner loop is used for model selection, i.e. hyperparameter tuning with Bayesian optimization, while the outer loop is responsible for evaluation of model performance. Additionnally, there is a possibility for the user to specify a seed to allow reproducibility of analyses. If not provided, a random one is generated and provided in the results file.
In predict_trait_MET_cv(), predictive ability is always calculated within the same environment (location–year combination), regardless of how the test sets are defined according to the different CV schemes.

Let’s use XGBoost algorithm (Chen and Guestrin (2016)) to predict a phenotypic trait!
We recommend to use xgb_reg_1 or xgb_reg_2 as prediction method, as these methods use principal component as genomic features instead of all SNPs as predictor variables. Hence, the model utilizes less features as input data and is much faster.

Here we also indicate the number of principal components used as genomic predictor variables, specified via the argument num_pcs. All of environmental predictors will be used (list_env_predictors is NULL by default.)

rescv0_1 <- predict_trait_MET_cv(
  METData = METdata_indica,
  trait = 'GC',
  prediction_method = 'xgb_reg_1',
  use_selected_markers = F,
  num_pcs = 80,
  lat_lon_included = F,
  year_included = F,
  cv_type = 'cv2',
  nb_folds_cv2 = 4,
  repeats_cv2 = 10,
  include_env_predictors = T,
  save_processing  = T,
  seed = 100,
  path_folder = '~/INDICA/xgb/cv2'
)

Note that many methods for processing data based on user-defined parameters and machine learning-based methods are using functions from the tidymodels (https://www.tidymodels.org/) collection of R packages (Kuhn and Wickham (2020)).

Step 3: Comes soon!

Step 4: Comes soon!

References

Chen, Tianqi, and Carlos Guestrin. 2016. “Xgboost: A Scalable Tree Boosting System.” In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 785–94.

Costa-Neto, Germano, Roberto Fritsche-Neto, and José Crossa. 2021. “Nonlinear Kernels, Dominance, and Envirotyping Data Increase the Accuracy of Genome-Based Prediction in Multi-Environment Trials.” Heredity 126 (1): 92–106.

Jarquı́n, Diego, José Crossa, Xavier Lacaze, Philippe Du Cheyron, Joëlle Daucourt, Josiane Lorgeou, François Piraux, et al. 2014. “A Reaction Norm Model for Genomic Selection Using High-Dimensional Genomic and Environmental Data.” Theoretical and Applied Genetics 127 (3): 595–607.

Jarquı́n, Diego, Cristiano Lemes da Silva, R Chris Gaynor, Jesse Poland, Allan Fritz, Reka Howard, Sarah Battenfield, and Jose Crossa. 2017. “Increasing Genomic-Enabled Prediction Accuracy by Modeling Genotype

$\times$ Environment Interactions in Kansas Wheat.” The Plant Genome 10 (2): 1–15.

Kuhn, Max, and Hadley Wickham. 2020. Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles. https://www.tidymodels.org.

Evaluation of the performance of a XGBoost model using CV2 on a rice dataset for phenotypic prediction

Cathy Westhues

2022-09-29