Create a multi-environment trials data object — create

This function combines all types of data sources (genotypic, phenotypic, information about the environments, environmental data if available...) in a single data object of class METData.

new_create_METData(
  geno = NULL,
  map = NULL,
  pheno = NULL,
  info_environments = NULL,
  raw_weather_data = NULL,
  climate_variables = NULL,
  soil_variables = NULL,
  compute_climatic_ECs = FALSE,
  path_to_save = NULL,
  as_test_set = FALSE,
  get_public_soil_data = FALSE,
  ...
)

create_METData(
  geno = NULL,
  pheno = NULL,
  info_environments = NULL,
  map = NULL,
  climate_variables = NULL,
  compute_climatic_ECs = FALSE,
  soil_variables = NULL,
  raw_weather_data = NULL,
  path_to_save = NULL,
  ...
)

validate_create_METData(x, ...)

Arguments

geno	`numeric` genotype values stored in a `matrix` or `data.frame` which contains the geno_ID as row.names and markers as columns.
map	`data.frame` object with 3 columns. marker `character` with marker names chr `numeric` with chromosome number pos `numeric` with marker position. Map object not mandatory.
pheno	`data.frame` object with at least 4 columns. geno_ID `character` contains the genotype identifiers. year `numeric` contains the year of the observation. location `character` contains the name of the location. From the fourth column on: each column is `numeric` and contains phenotypic values for a phenotypic trait observed in a combination Year x Location. Names of the traits should be provided as column names. The geno_ID must be a subset of the row.names in the geno object.
info_environments	`data.frame` object with at least the 4 following columns. year: `numeric` Year label of the environment location: `character` Name of the location longitude: `numeric` longitude of the environment latitude: `numeric` latitude of the environment The two next columns are required only if weather data should be retrieved from NASA POWER data using the argument `compute_climatic_EC` set to TRUE, or if raw weather data are provided: planting.date: (optional) `Date` YYYY-MM-DD harvest.date: (optional) `Date` YYYY-MM-DD elevation: (optional) `numeric` The data.frame should contain as many rows as Year x Location combinations which will be used in pheno_new.
raw_weather_data	`data.frame` can be let as NULL by user, if no daily weather datasets are available. If else, required columns should be provided like this (colnames should be respected): longitude `numeric` latitude `numeric` year `numeric` location `character` YYYYMMDD `Date` Available weather data provided by user must be a subset of the following weather variable names. Colnames must be given as following: T2M `numeric` Daily mean temperature (°C) T2M_MIN `numeric` Daily minimum temperature (°C) T2M_MAX `numeric` Daily maximum temperature (°C) PRECTOTCORR `numeric` Daily total precipitation (mm) RH2M `numeric` Daily mean relative humidity (%) RH2M_MIN `numeric` Daily minimum relative humidity (%) RH2M_MAX `numeric` Daily maximum relative humidity (%) daily_solar_radiation `numeric` daily solar radiation (MJ/m^2/day) top_atmosphere_insolation `numeric` Top-of-atmosphere Insolation (MJ/m^2/day) T2MDEW `numeric` Dew Point (°C) It is not required that weather data for ALL environments are provided by the user. If weather data for some environments are missing, they will be retrieved by the NASA
climate_variables	`data.frame` can be let as NULL by user, if no climate variables provided as input. Otherwise, a `data.frame` should be provided. The data.frame should contain as many rows as the `info_environments` `data.frame`. Columns should be: year `numeric` with the year label location `character` with the location character Columns 3 and + should be numeric and contain the climate (weather-based) covariates. If climate_variables is provided,`compute_climatic_ECs`should be set to `FALSE`.
soil_variables	`data.frame` can be let as NULL by user, if no soil variables provided as input. Otherwise, a `data.frame` should be provided. The data.frame should contain as many rows as the `info_environments` `data.frame`. Columns should be: year `numeric` with the year label location `character` with the location character Columns 3 and + should be numeric and contain the soil-based environmental covariates.
compute_climatic_ECs	`logical` indicates if climatic covariates should be computed with the function. Default is `FALSE`. Set compute_climatic_ECs = `TRUE` if user wants to use weather data from NASA POWER data OR if raw weather data are available and should be used (also possible to provide field weather data for only some environments; weather data for other environments present in the dataset will be retrieved using the NASA POWER query.
path_to_save	Path where daily weather data (if retrieved) and plots based on k-means clustering are saved.
as_test_set	If using a prediction set (i.e. no phenotypic values for the new data to predict), should be set to TRUE. Default is FALSE.
get_public_soil_data	`logical` Indicates whether public soil data should be downloaded.

Value

A formatted list of class METData which contains the following elements:

geno: matrix with genotype values of phenotyped individuals.
map: data.frame with genetic map.
pheno: data.frame with phenotypic trait values.
compute_EC_by_geno: logical indicates if environmental covariates were required to be retrieved via the package by the user.
env_data: data.frame with the environmental covariates per environment
list_climatic_predictors: character with the names of the climatic predictor variables
list_soil_predictors: character with the names of the soil-based predictor variables
info_environments: data.frame contains basic information on each environment.
ECs_computed: logical subelement added in the output to indicate if the function get_ECs() was run within the pipeline.
climate_data_retrieved: logical subelement added in the output to indicate if NASAPOWER data were retrieved within the pipeline.

Author

Cathy C. Westhues cathy.jubin@uni-goettingen.de

Examples


data(geno_G2F)
data(pheno_G2F)
data(map_G2F)
data(info_environments_G2F)
data(soil_G2F)
# Create METData and get climate variables from NASAPOWER data & use soil variables
METdata_G2F <- create_METData(geno=geno_G2F,pheno=pheno_G2F,map=map_G2F,climate_variables = NULL,compute_climatic_ECs = TRUE,info_environments = info_environments_G2F,soil_variables=soil_G2F, path_to_save = "~/g2f_data")
#> No climate covariates provided by the user.
#> Warning: Coercing info_environments$planting.date to class 'POSIXct'.
#> Warning: Coercing info_environments$harvest.date to class 'POSIXct'.
#> Step 1: Processing/Retrieval of daily weather data starts!
#> Daily weather tables have been downloaded from NASA POWER for the required environments in a previous run, and are matching the environments ID/planting and harvest dates used in this analysis.
#>  These data will be used. 
#> Daily weather tables downloaded from NASA POWER for the required environments!
#> Step 1 is done!
#> Step 2: Aggregation of daily weather data into covariavate starts!
#> Step 2 is done!
#> Computation of environmental covariates is done.
#> Clustering of env. data starts.
#> Clustering of env. data done.
#> Soil and climate data will be included in the final METData object. 

data(geno_indica)
data(map_indica)
data(pheno_indica)
data(info_environments_indica)
data(climate_variables_indica)
METdata_indica <- create_METData(geno=geno_indica,pheno=pheno_indica,climate_variables = climate_variables_indica,compute_climatic_ECs = FALSE,info_environments = info_environments_indica,map = map_indica, path_to_save = "~/indica")
#> No soil covariates provided by the user.
#> Clustering of env. data starts.
#> Clustering of env. data done.

data(geno_japonica)
data(map_japonica)
data(pheno_japonica)
data(info_environments_japonica)
data(climate_variables_japonica)
METdata_japonica <- create_METData(geno=geno_japonica,pheno=pheno_japonica,climate_variables = climate_variables_japonica,compute_climatic_ECs = FALSE,info_environments = info_environments_japonica,map = map_japonica, path_to_save = "~/japonica")
#> No soil covariates provided by the user.
#> Clustering of env. data starts.
#> Clustering of env. data done.