Step 1: Specifying input data and processing parameters

First, we create an object of class with the function create_METData(). The user must provide as input data genotypic and phenotypic data, as well as basic information about the field experiments (e.g. longitude, latitude data at least), and possibly environmental covariates (if available). These input data are checked and warning messages are given as output if the data are not correctly formatted.
In this example, we use an indica rice dataset from Monteverde et al. (2019), which is implemented in the package as a “toy dataset”. From this study, a multi-year dataset of rice trials containing phenotypic data (four traits), genotypic and environmental data for a panel of indica genotypes across three years in a single location is available (more information on the datasets with ?pheno_indica,?geno_indica,?map_indica,?climate_variables_indica,?info_environments_indica).
In this case, environmental covariates by growth stage are directly available and can be used in predictions. These data should be provided as input in create_METData() using the argument climate_variables. Hence, there is no need to retrieve with the package any daily weather data (hence compute_climatic_ECs argument set as FALSE).
Do not forget to indicate where the plots of clustering analyses should be saved using the argument path_to_save.

library(learnMET)

data("geno_indica")
data("map_indica")
data("pheno_indica")
data("info_environments_indica")
data("climate_variables_indica")

METdata_indica <-
  learnMET::create_METData(
    geno = geno_indica,
    pheno = pheno_indica,
    climate_variables = climate_variables_indica,
    compute_climatic_ECs = F,
    info_environments = info_environments_indica,
    map = map_indica,
    path_to_save = '~/learnMET_analyses/indica/stacking'
  )
# No soil covariates provided by the user.
# Clustering of env. data starts.
# Clustering of env. data done.

The function print.summary.METData() gives an overview of the MET data created.

summary(METdata_indica)
# object of class 'METData' 
# --------------------------
# General information about the MET data 
# 
# No. of unique environments represented in the data: 3 
#        unique years represented in the data: 3 
#        unique locations represented in the data: 1 
# 
# Distribution of phenotypic observations according to year and location: 
#       
#        TreintayTres
#   2010          327
#   2011          327
#   2012          327
# No. of unique genotypes which are phenotyped
#  327 
# --------------------------
# Climate variables 
# Weather data extracted from NASAPOWER for main weather variables?
# NO
# No. of climate variables available: 54 
# --------------------------
# Soil variables 
# No. of soil variables available: 0 
# --------------------------
# Phenotypic data 
#    No. of traits:  3 
# 
#        GY             PHR               GC         
#  Min.   : 6523   Min.   :0.4257   Min.   :0.00000  
#  1st Qu.:10179   1st Qu.:0.5950   1st Qu.:0.02241  
#  Median :10825   Median :0.6097   Median :0.03450  
#  Mean   :10922   Mean   :0.6055   Mean   :0.03631  
#  3rd Qu.:11605   3rd Qu.:0.6220   3rd Qu.:0.04752  
#  Max.   :15388   Max.   :0.6816   Max.   :0.22306  
# --------------------------
# Genotypic data 
#    No. of markers : 92430 
# --------------------------
# Map data
#    No. of chromosomes     12 
# 
#    markers per chromosome 
#   
#     1     2     3     4     5     6     7     8     9    10    11    12 
# 11431 10435  8807 10943  5892  6326  5643  6365  7196  5390  8215  5787

Step 2: Cross-validated model evaluation

Specifying processing parameters and the type of prediction model

The goal of predict_trait_MET_cv() is to assess a given prediction method using a certain type of cross-validation (CV) scenario on the complete training set. The CV schemes covered by the package correspond to those generally evaluated in related literature on MET (Jarquı́n et al. (2014); Jarquı́n et al. (2017);Costa-Neto, Fritsche-Neto, and Crossa (2021)).
Here, we will use the CV0: predicting the performance of genotypes in untested environment(s), i.e. no phenotypic observations from these environment(s) included in the training set, and more specifically, the leave-one-year-out scenario.

The function predict_trait_MET_cv() also allows to specify a specific subset of environmental variables from the METData$env_data object to be used in model fitting and predictions via the argument list_env_predictors.

How does predict_trait_MET_cv() works? When predict_trait_MET_cv() is executed, a list of training/test splits is constructed according to the CV scheme chosen by the user. Each training set in each sub-element of this list is processed (e.g. standardization and removal of predictors with null variance.

The function applies a nested CV to obtain an unbiased generalization performance estimate, implying an inner loop CV nested in an outer CV. The inner loop is used for model selection, i.e. hyperparameter tuning with Bayesian optimization, while the outer loop is responsible for evaluation of model performance. Additionnally, there is a possibility for the user to specify a seed to allow reproducibility of analyses. If not provided, a random one is generated and provided in the results file.
In predict_trait_MET_cv(), predictive ability is always calculated within the same environment (location–year combination), regardless of how the test sets are defined according to the different CV schemes.

Let’s use a meta-learner to predict a phenotypic trait!

Model stacking corresponds to an ensemble method which exploits many well-working models on a classification or regression task by learning how to combine outputs of these base learners to create a new model. Combined models can be very different: for instance, K-nearest neighbors, support vector machines and random forest models, trained on resamples with different hyperparameters can be used jointly in a stacked generalization ensemble.

In the example below, we run a cross-validation of type CV0 with this method:

rescv0_1 <- predict_trait_MET_cv(
  METData = METdata_indica,
  trait = 'GC',
  prediction_method = 'stacking_reg_1',
  use_selected_markers = F,
  lat_lon_included = F,
  year_included = F,
  cv_type = 'cv0',
  cv0_type = 'leave-one-year-out',
  kernel_G = 'linear',
  include_env_predictors = T,
  save_processing  = T,
  seed = 100,
  path_folder = '~/INDICA/stacking/cv0'
)

Note that many methods for processing data based on user-defined parameters and machine learning-based methods are using functions from the tidymodels (https://www.tidymodels.org/) collection of R packages (Kuhn and Wickham (2020)). We use here in particular the package stacks (https://stacks.tidymodels.org/index.html), incorporated in the tidymodels framework to build these stacking models.

Extraction of results from the output

Let’s have a look now at the structure of the output rescv0_1.

The object of class list, met_cv contains 3 elements: list_results_cv, seed_used and cv_type.

list_results_cv

The first element list_results_cv is a list of res_fitted_split objects, and has a length equal to the number of training/test sets partitions, which is determined by the cross-validation scenario.

Above, we ran predict_trait_MET_cv() with a leave-one-year-out CV scheme, and the dataset contains three years of observations. Therefore, we expect the length of to be equal to 3:

cat(length(rescv0_1$list_results_cv))
# 3
cat(class(rescv0_1$list_results_cv[[1]]))
# res_fitted_split

Within each of this res_fitted_split object, 9 elements are provided:

# prediction_method ; parameters_collection_G ; parameters_collection_E ; predictions_df ; cor_pred_obs ; rmse_pred_obs ; training ; test ; vip

The first sub-element precises which prediction method was called:

cat(paste(rescv0_1 $list_results_cv[[1]]$prediction_method))
# stacking_reg_1

For example, to retrieve the data.frame containing predictions for the test set, we extract the predictions_df subelement (predictions in column named “.pred”):

predictions_year_2010 <- rescv0_1 $list_results_cv[[1]]$predictions_df
predictions_year_2011 <- rescv0_1 $list_results_cv[[2]]$predictions_df
predictions_year_2012 <- rescv0_1 $list_results_cv[[3]]$predictions_df

To directly look at the Pearson correlations between predicted and observed values for each environment, we extract the cor_pred_obs.

cor_year_2010 <- rescv0_1 $list_results_cv[[1]]$cor_pred_obs
cor_year_2011 <- rescv0_1 $list_results_cv[[2]]$cor_pred_obs
cor_year_2012 <- rescv0_1 $list_results_cv[[3]]$cor_pred_obs

head(cor_year_2010)
# # A tibble: 1 x 2
#   IDenv               COR
#   <chr>             <dbl>
# 1 TreintayTres_2010 0.340
head(cor_year_2011)
# # A tibble: 1 x 2
#   IDenv               COR
#   <chr>             <dbl>
# 1 TreintayTres_2011 0.399
head(cor_year_2012)
# # A tibble: 1 x 2
#   IDenv               COR
#   <chr>             <dbl>
# 1 TreintayTres_2012 0.372

To get the root mean square error between predicted and observed values for each environment, we extract the rmse_pred_obs.

rmse_year_2010 <- rescv0_1 $list_results_cv[[1]]$rmse_pred_obs
rmse_year_2011 <- rescv0_1 $list_results_cv[[2]]$rmse_pred_obs
rmse_year_2012 <- rescv0_1 $list_results_cv[[3]]$rmse_pred_obs
head(rmse_year_2010)
# [1] 1.583467
head(rmse_year_2011)
# [1] 2.490381
head(rmse_year_2012)
# [1] 2.314676

For a stacked prediction model, one can have a look at the coefficients assigned by the LASSO metalearner to the predictions obtained from the single base learners (here, support vector machine regression models) which respectively use genomic information:

head(rescv0_1 $list_results_cv[[3]]$parameters_collection_G)
#          member         cost      coef
# 1 svm_res_G_1_1 3.126802e-02        NA
# 2 svm_res_G_1_2 2.549168e+00        NA
# 3 svm_res_G_1_3 9.875661e-04 0.2918148
# 4 svm_res_G_1_4 3.368297e-01        NA
# 5 svm_res_G_1_5 1.382958e+01        NA
# 6 svm_res_G_1_6 6.927184e-03        NA

… and based on environmental information:

head(rescv0_1 $list_results_cv[[3]]$parameters_collection_E)
#          member        cost        sigma      coef
# 1 svm_res_E_1_1 0.648700227 2.924271e-08 0.0000000
# 2 svm_res_E_1_2 0.026507046 9.279021e-07 0.0000000
# 3 svm_res_E_1_3 2.398485546 7.941728e-08 0.0000000
# 4 svm_res_E_1_4 0.286690960 1.587980e-03 0.6687404
# 5 svm_res_E_1_5 0.007003367 5.721131e-01 0.0000000
# 6 svm_res_E_1_6 0.001157208 2.387000e-10 0.0000000

seed_used

To get the seed used for building the train/test partitions (especially useful for CV1 and CV2):

cat(rescv0_1 $seed_used)
# 100

cv_type

To get the type of CV used in the evaluation by predict_trait_MET_cv():

cat(rescv0_1 $cv_type)
# cv0_leave-one-year-out

Step 3: Comes soon!

Step 4: Comes soon!

References

AlKhalifah, Naser, Darwin A Campbell, Celeste M Falcon, Jack M Gardiner, Nathan D Miller, Maria Cinta Romay, Ramona Walls, et al. 2018. “Maize Genomes to Fields: 2014 and 2015 Field Season Genotype, Phenotype, Environment, and Inbred Ear Image Datasets.” BMC Research Notes 11 (1): 1–5.

Biecek, Przemyslaw. 2018. “DALEX: Explainers for Complex Predictive Models in r.” Journal of Machine Learning Research 19 (84): 1–5. https://jmlr.org/papers/v19/18-416.html.

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32.

Chen, Tianqi, and Carlos Guestrin. 2016. “Xgboost: A Scalable Tree Boosting System.” In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 785–94.

Chen, Tianqi, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen, et al. 2015. “Xgboost: Extreme Gradient Boosting.” R Package Version 0.4-2 1 (4): 1–4.

Costa-Neto, Germano, Roberto Fritsche-Neto, and José Crossa. 2021. “Nonlinear Kernels, Dominance, and Envirotyping Data Increase the Accuracy of Genome-Based Prediction in Multi-Environment Trials.” Heredity 126 (1): 92–106.

De Los Campos, Gustavo, Paulino Pérez-Rodrı́guez, Matthieu Bogard, David Gouache, and José Crossa. 2020. “A Data-Driven Simulation Platform to Predict Cultivars’ Performances Under Uncertain Weather Conditions.” Nature Communications 11 (1): 1–10.

Fisher, Aaron, Cynthia Rudin, and Francesca Dominici. 2019. “All Models Are Wrong, but Many Are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously.” J. Mach. Learn. Res. 20 (177): 1–81.

Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics, 1189–1232.

Hartigan, John A, and Manchek A Wong. 1979. “Algorithm AS 136: A k-Means Clustering Algorithm.” Journal of the Royal Statistical Society. Series c (Applied Statistics) 28 (1): 100–108.

Heslot, Nicolas, Deniz Akdemir, Mark E Sorrells, and Jean-Luc Jannink. 2014. “Integrating Environmental Covariates and Crop Modeling into the Genomic Selection Framework to Predict Genotype by Environment Interactions.” Theoretical and Applied Genetics 127 (2): 463–80.

Holzworth, Dean P, Neil I Huth, Peter G deVoil, Eric J Zurcher, Neville I Herrmann, Greg McLean, Karine Chenu, et al. 2014. “APSIM–Evolution Towards a New Generation of Agricultural Systems Simulation.” Environmental Modelling & Software 62: 327–50.

Jamieson, PD, MA Semenov, IR Brooking, and GS Francis. 1998. “Sirius: A Mechanistic Model of Wheat Response to Environmental Variation.” European Journal of Agronomy 8 (3-4): 161–79.

Jarquı́n, Diego, José Crossa, Xavier Lacaze, Philippe Du Cheyron, Joëlle Daucourt, Josiane Lorgeou, François Piraux, et al. 2014. “A Reaction Norm Model for Genomic Selection Using High-Dimensional Genomic and Environmental Data.” Theoretical and Applied Genetics 127 (3): 595–607.

Jarquı́n, Diego, Cristiano Lemes da Silva, R Chris Gaynor, Jesse Poland, Allan Fritz, Reka Howard, Sarah Battenfield, and Jose Crossa. 2017. “Increasing Genomic-Enabled Prediction Accuracy by Modeling Genotype$\times$ Environment Interactions in Kansas Wheat.” The Plant Genome 10 (2): 1–15.

Keating, Brian A, Peter S Carberry, Graeme L Hammer, Mervyn E Probert, Michael J Robertson, D Holzworth, Neil I Huth, et al. 2003. “An Overview of APSIM, a Model Designed for Farming Systems Simulation.” European Journal of Agronomy 18 (3-4): 267–88.

Kuhn, Max, Kjell Johnson, et al. 2013. Applied Predictive Modeling. Vol. 26. Springer.

Kuhn, Max, and Hadley Wickham. 2020. Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles. https://www.tidymodels.org.

Maechler, Martin, Peter Rousseeuw, Anja Struyf, Mia Hubert, and Kurt Hornik. 2021. Cluster: Cluster Analysis Basics and Extensions. https://CRAN.R-project.org/package=cluster.

McFarland, Bridget A, Naser AlKhalifah, Martin Bohn, Jessica Bubert, Edward S Buckler, Ignacio Ciampitti, Jode Edwards, et al. 2020. “Maize Genomes to Fields (G2f): 2014–2017 Field Seasons: Genotype, Phenotype, Climatic, Soil, and Inbred Ear Image Datasets.” BMC Research Notes 13 (1): 1–6.

Molnar, Christoph. 2022. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. 2nd ed. Available online, accessed on 30 March 2022. https://christophm.github.io/interpretable-ml-book.

Molnar, Christoph, Bernd Bischl, and Giuseppe Casalicchio. 2018. “Iml: An r Package for Interpretable Machine Learning.” JOSS 3 (26): 786. https://doi.org/10.21105/joss.00786.

Monteverde, Eliana, Lucı́a Gutierrez, Pedro Blanco, Fernando Pérez de Vida, Juan E Rosas, Victoria Bonnecarrère, Gastón Quero, and Susan McCouch. 2019. “Integrating Molecular Markers and Environmental Covariates to Interpret Genotype by Environment Interaction in Rice (Oryza Sativa l.) Grown in Subtropical Areas.” G3: Genes, Genomes, Genetics 9 (5): 1519–31.

Poggio, Laura, Luis M De Sousa, Niels H Batjes, Gerard Heuvelink, Bas Kempen, Eloi Ribeiro, and David Rossiter. 2021. “SoilGrids 2.0: Producing Soil Information for the Globe with Quantified Spatial Uncertainty.” Soil 7 (1): 217–40.

Rincent, Renaud, Marcos Malosetti, Behnam Ababaei, Gaetan Touzy, Agathe Mini, M Bogard, Pierre Martre, Jacques Le Gouis, and F van Eeuwijk. 2019. “Using Crop Growth Model Stress Covariates and AMMI Decomposition to Better Predict Genotype-by-Environment Interactions.” Theoretical and Applied Genetics 132 (12): 3399–3411.

Rousseeuw, Peter J. 1987. “Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis.” Journal of Computational and Applied Mathematics 20: 53–65.

Sparks, Adam H. 2018. “Nasapower: A NASA POWER Global Meteorology, Surface Solar Energy and Climatology Data Client for r.” Journal of Open Source Software, 3 (30), 1035.

Technow, Frank, Carlos D Messina, L Radu Totir, and Mark Cooper. 2015. “Integrating Crop Growth Models with Whole Genome Prediction Through Approximate Bayesian Computation.” PloS One 10 (6): e0130855.

Van der Laan, Mark J, Eric C Polley, and Alan E Hubbard. 2007. “Super Learner.” Statistical Applications in Genetics and Molecular Biology 6 (1).

Westhues, Cathy C., Henner Simianer, and Timothy M. Beissinger. 2021. “learnMET: An r Package to Apply Machine Learning Methods for Genomic Prediction Using Multi-Environment Trial Data.” bioRxiv. https://doi.org/10.1101/2021.12.13.472185.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686.

Zotarelli, Lincoln, Michael D Dukes, Consuelo C Romero, Kati W Migliaccio, and Kelly T Morgan. 2010. “Step by Step Calculation of the Penman-Monteith Evapotranspiration (FAO-56 Method).” Institute of Food and Agricultural Sciences. University of Florida.

Evaluation of the model performance of a stacked model using CV0 on a rice dataset for phenotypic prediction

Cathy Westhues

2022-09-29