Processing of a split object to get data ready to be used and fitted with a rf_reg_3 (random forest) regression model.

The function processes a split object (training + test sets), according to the configuration set by the user. For instance, genomic information is incorporated according to the option set by the user. A list of specific environmental covariables to use can be provided.

A recipe is created using the package recipes, to specify additional preprocessing steps, such as standardization based on the training set, with same transformations used on the test set. Variables with null variance are removed. If year effect is included, it is converted to dummy variables.
Further fitting on the training set with a gradient boosting model (see function fit_cv_split.rf_reg_3())).

This prediction method can be very slow according to the number of SNPs variables used!

new_rf_reg_3(
  split = NULL,
  trait = NULL,
  geno = NULL,
  env_predictors = NULL,
  info_environments = NULL,
  use_selected_markers = F,
  SNPs = NULL,
  include_env_predictors = T,
  list_env_predictors = NULL,
  lat_lon_included = F,
  year_included = F,
  ...
)

rf_reg_3(
  split,
  trait,
  geno,
  env_predictors,
  info_environments,
  use_selected_markers,
  SNPs,
  list_env_predictors,
  include_env_predictors,
  lat_lon_included,
  year_included,
  ...
)

validate_rf_reg_3(x, ...)

Arguments

split	an object of class `split`. A `split` object contains a training and test elements.
trait	`character` Name of the trait to predict. An ordinal trait should be encoded as `integer`.
geno	`data.frame` It corresponds to a `geno` element within an object of class `METData`.
env_predictors	`data.frame` It corresponds to the `env_data` element within an object of class `METData`.
info_environments	`data.frame` It corresponds to the `info_environments` element within an object of class `METData`.
use_selected_markers	A `Logical` indicating whether to use a subset of markers identified via single-environment GWAS or based on the table of marker effects obtained via Elastic Net as predictor variables, when main genetic effects are modeled with principal components. If `use_selected_markers` is `TRUE`, the `SNPs` argument should be provided. For more details, see `select_markers()`
SNPs	A `data.frame` with the genotype matrix (individuals in rows and selected markers in columns) for SNPs selected via the `select_markers()` function. Optional argument, can remain as `NULL` if no single markers should be incorporated as predictor variables in analyses based on PCA decomposition.
include_env_predictors	A `logical` indicating whether environmental covariates characterizing each environment should be used in predictions.
list_env_predictors	A `character` vector containing the names of the environmental predictors which should be used in predictions. By default `NULL`: all environmental predictors included in the env_data table of the `METData` object will be used.
lat_lon_included	`logical` indicates if longitude and latitude data should be used as numeric predictors. Default is `FALSE`.
year_included	`logical` indicates if year factor should be used as predictor variable. Default is `FALSE`.

Value

A list object of class rf_reg_3 with the following items:

training: data.frame Training set after partial processing
test: data.frame Test set after partial processing
rec: A recipe object, specifying the remaining processing steps which are implemented when a model is fitted on the training set with a recipe.

References

Wickham H, Averick M, Bryan J, Chang W, McGowan LD, Fran攼㸷ois R, Grolemund G, Hayes A, Henry L, Hester J, others (2019). “Welcome to the Tidyverse.” Journal of open source software, 4(43), 1686. Kuhn M, Wickham H (2020). Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles.. https://www.tidymodels.org.