Phenotypic data
The data are composed of 265 single cross hybrids from the maize breeding program of Embrapa Maize and Sorghum evaluated in eight combinations of trials/locations/years under irrigated trials (WW) and water stress (WS) conditions at two locations in Brazil (Janaúba—Minas Gerais and Teresina—Piauí) over two years (2010 and 2011). The hybrids were obtained from crosses between 188 inbred lines and two testers. The inbred lines belong to heterotic groups: dent (85 inbred lines), flint (86 inbred lines), and an additional group, referred to as group C (17 inbred lines), which is unrelated to the dent and flint origins. The two testers are inbred lines belonging to the flint (L3) and dent (L228-3) groups. Among the inbred lines, 120 were crossed with both testers, 52 were crossed with the L228-3 tester only, and 16 lines were crossed with the L3 tester only. Silva et al. (2020) evaluated the genetic diversity and heterotic groups in the same database. These authors showed the existence of subgroups within each heterotic group. Therefore, once these groups were not genetically well defined and the breeding program from Embrapa Maize was in the beginning, the effect of allelic substitution in both groups are assumed to be the same. More details on the experimental design and procedures can be found in Dias et al.13,30.
The experiment originally included 308 entries, but hybrids that were not present in all environments were also removed to evaluate the genomic prediction within each environment, resulting in a total of 265 hybrids for analysis. Each trial consisted of 308 maize single cross hybrids, randomly divided into six sets: sets 1–3 for crosses with L3 (61, 61, and 14 hybrids each), and sets 4–6 for crosses with L228-3 (80, 77, and 15 hybrids each). Four checks (commercial maize cultivars) were included in each set, and the experiment was designed in completely randomized blocks. Between trials, hybrids within each set remained the same, but hybrids and checks were randomly allocated into groups of plots within each set. This allocation varied between replicates of sets and between trials. The WS trials had three replications, except for the set containing 15 hybrids and the trials evaluated in 2010, which had two replications. All WW trials, except for the trial in 2011, had two replicates.
Two agronomic traits related to drought tolerance were analyzed: grain yield (GY) and female flowering time (FFT). GY was determined by weighing all grains in each plot, adjusted for 13% grain moisture, and converted to tons per hectare (t/ha), accounting for differences in plot sizes across trials. FFT was measured as the number of days from sowing until the stigmas appeared in 50% of the plants. A summary of means, standard deviations, and ranges of both evaluated traits are available in Table 1.
To conduct the analyses, hybrids considered as outliers were removed (i.e., hybrids that presented phenotypic values greater than 1.5 × interquartile range above the third quartile or below the first quartile) for the GY and FFT traits. The variations in predictive abilities among hybrids of T2, T1, and T0 are widely recognized31. However, the primary aim of our study was to compare different prediction methodologies in MET assays. In this study, there were 240 T2 hybrids and 68 T1 hybrids, with T2 hybrids had both parents evaluated in different hybrid combinations, while hybrids being single-cross hybrids sharing one parent with the tested hybrids. Given the realistic nature of our scenario, we have a limited and imbalanced distribution of these hybrid groups, making a fair comparison challenging. Consequently, we opted to construct a training set comprising T2 and T0 hybrids.
Statistical analysis of phenotypic data
To correct the phenotypic values for experimental design effects, each trial (WW and WS) and environment were analyzed independently to obtain the Best Linear Unbiased Estimator (eBLUEs) for each hybrid, for the two traits evaluated. The estimates were obtained based on the following model:
$${\varvec{y}} = 1\mu + \user2{ X}_{1} {\varvec{r}} + {\varvec{X}}_{2} {\varvec{s}} + {\varvec{X}}_{3} {\varvec{h}} + {\varvec{e}}$$
(1)
where \(\user2{y }\left( {n \times 1} \right)\) is the phenotype vector for \(f\) replicates, \(t\) sets of \(p\) hybrids, and \(n\) is the number of observations; \(\mu\) is the mean; \(\user2{r }\left( {f \times 1} \right)\) is the fixed effect vector of the replicates; \(\user2{s }\left( {t \times 1} \right)\) is the fixed effect vector of the sets; \({\varvec{h}}\) \(\left( {p \times 1} \right)\) is the fixed effect vector of the hybrids; and \(\user2{e }\left( {k \times 1} \right)\) is the residue vector, with \(\user2{ e} \sim ,MVN\left( {0,{\varvec{I}}\sigma_{e}^{2} } \right)\), where \({\varvec{I}}\) is an identity matrix of corresponding order, and \(\sigma_{e}^{2}\) the residual variance. \({\varvec{X}}_{{1\user2{ }}} \left( {k \times f} \right)\), \({\varvec{X}}_{{2\user2{ }}} \left( {k \times t} \right)\) e \({\varvec{X}}_{{3\user2{ }}} \left( {k \times p} \right)\) represents incidence matrices for their respective effects. The eBLUES of each environment were used in further analyses.
Genotypic data
A total of 57,294 Single Nucleotide Polymorphisms (SNPs) markers were obtained from 188 inbred lines, and two testers used as parents of the 265 single cross hybrids. The genotyping by sequencing (GBS) strategyare detailed in Dias et al.13. For the quality control, SNPs were discarded if: the minor allele frequency was smaller than 5%, more than 20% of missing genotypes were found, and/or there were more than 5% of heterozygous genotypes. After filtering, missing data were imputed using NPUTE. Then, for each SNP, the genotypes of the hybrids were inferred based on the genotype of their parents (inbred line and tester). The number of SNPs per chromosome ranged from 3121 (chromosome 10) to 7705 (chromosome 1), totalizing 47,127 markers.
Genomic relationship matrix
The additive and dominance genomic relationship matrices were constructed32 based on information from the SNPs using the package AGHmatrix33, following VanRaden34 and Vitezica et al., respectively.
Genomic prediction
Genomic predictions were performed using the Genomic Best Line Unbiased Prediction (GBLUP) method using the package AsReml v. 436. Two groups were considered: the first group comprised four environments under WW conditions, and the second included four environments under WS conditions. The linear model is described below:
$$\overline{\user2{y}} = \mu 1 + \user2{ Xb} + {\varvec{Z}}_{1} {\varvec{u}}_{{\varvec{a}}} + {\varvec{Z}}_{2} {\varvec{u}}_{{\varvec{d}}} + {\varvec{e}}$$
(2)
where \(\user2{\overline{y} }\left( {pq \times 1} \right)\) is the vector of eBLUES previously estimated for each environment with \(p\) hybrids and \(q\) environments;\(\mu\) is the mean; \(\user2{b }\left( {q \times 1} \right)\) is the vector of environmental effects (fixed); \({\varvec{u}}_{{\user2{a }}} \left( {pq \times 1} \right)\) is the vector of individual additive genetic values nested within environments (random), with \({\varvec{u}}_{{\varvec{a}}} \sim MVN\left( {0,\left[ {{\varvec{I}}_{{\varvec{q}}} \sigma_{{u_{a} }}^{2} + \rho_{a} \left( {{\varvec{J}}_{{\varvec{q}}} – {\varvec{I}}_{{\varvec{q}}} } \right)} \right] \otimes {\varvec{A}}} \right)\), where \({\varvec{A}}\) is the genomic relationship matrix between individuals for additive effects, \(\rho_{a}\) is the additive genetic correlation coefficient between environments, \({\varvec{I}}_{{\varvec{q}}} \user2{ }\left( {q \times q} \right)\) is an identity matrix, \({\varvec{J}}_{{\varvec{q}}} \user2{ }\left( {q \times q} \right)\) is a matrix of ones, and \(\otimes\) denotes the Kronecker product; \({\varvec{u}}_{{\varvec{d}}}\) \(\left( {pq \times 1} \right)\) is the vector of individual dominance genetic values nested within environments (random), with \({\varvec{u}}_{{\varvec{d}}} \sim MVN\left( {0,\left[ {{\varvec{I}}_{{\varvec{q}}} \sigma_{{u_{d} }}^{2} + \rho_{d} \left( {{\varvec{J}}_{{\varvec{q}}} – {\varvec{I}}_{{\varvec{q}}} } \right)} \right] \otimes {\varvec{D}}} \right)\), where \({\varvec{D}}\) is the genomic relationship matrix between individuals for dominance effects, \(\rho_{{\varvec{d}}}\) is the dominance correlation coefficient between environments; \({\varvec{e}}\) \(\left( {pq \times 1} \right)\) is the random residuals vector with \({\varvec{e}}\sim MVN\left( {0,{\varvec{I}}\sigma_{e}^{2} } \right)\). The capital letters \(\user2{X }\left( {pq \times q} \right),\user2{ Z}_{1} \left( {pq \times pq} \right)\) and \({\varvec{Z}}_{2} \user2{ }\left( {pq \times pq} \right)\) represent the incidence matrices for their respective effects, \(1\user2{ }\left( {pq \times 1} \right)\) is a vector of ones. The (co)variance components were obtained using the residual maximum likelihood method (REML)37.
Two alternative models were also used. The first for genomic prediction retained only additive effects by removing \({\varvec{u}}_{{\varvec{d}}}\) from Eq. (2). The second model was used to estimate the genetic parameters within each environment separately.
The significance of random effects was tested using the Likelihood Ratio Test (LRT)38, given by:
$$LRT = 2*\left( {LogL_{c} – LogL_{r} } \right)$$
(3)
where \(LogL_{c}\) is the logarithm of the likelihood function of the complete model (with all effects included), and \(LogL_{r}\) is the logarithm of the restricted likelihood function of the reduced model (without the effect under test). Effect significance was tested by LRT using the chi-square (X2) probability density function with a degree of freedom and significance level of 5%39.
The narrow-sense heritability (\({ }h^{2}\)), the proportion of variance explained by dominance effects (\(d^{2}\)), and the broad-sense heritability \(\left( {H^{2} } \right)\) for each trait were estimated following Falconer and Mackay 199635.
Machine learning
Similar to the previous topic, the trials were divided between WW and WS conditions, and the potential of regression trees (RT) was explored using the following three algorithms: bagging, random forest, and boosting22. Bagging (Bag) is a methodology that aims to reduce the RT variance22. In other words, it consists of obtaining D samples with available sampling replacement, thus obtaining D models \(\hat{f}^{1} \left( x \right), \hat{f}^{2} \left( x \right), \ldots , \hat{f}^{D} \left( x \right)\), and finally use the generated models to obtain an average, given by:
$$\hat{f}_{medio} \left( x \right) = \frac{1}{D}\mathop \sum \limits_{d = 1}^{D} \hat{f}^{d} \left( x \right)$$
(4)
This decreases the variability obtained in the decision trees. The number of trees used in Bag is not a parameter that will result in overfitting of the model. In practice, a number of trees is used until the error has stabilized22. The number of trees sampled for Bag was set at 500 trees.
Random forest (RF) was proposed by HO40 and it is an improvement of Bag to avoid the high correlation of the trees and to improve the accuracy in the selection of individuals. RF changes only the number of predictor variables used in each split. That is, each time a split in a tree is considered, a random sample of \(m\) variables is chosen as candidates from the complete set of \(p\) variables. Hastie et al.21 suggest that the number of predictor variables used in each partition is equal to \(m = \frac{p}{3}\) for regression trees. The number of trees for the RF was set at 500.
Boosting uses RT by adjusting the residual of an initial model. The residual is updated with each tree that grows sequentially from the previous tree’s residual, and the response variable involves a combination of a large number of trees, such that:
$$\hat{f}\left( x \right) = \mathop \sum \limits_{b = 1}^{B} {\uplambda } \hat{f}^{b} \left( x \right)$$
(5)
The function \(\hat{f}\left( . \right)\) refers to the final tree combined with sequentially adjusted trees, and λ is the shrinkage parameter that controls the learning rate of the method. Furthermore, this method needs to be adjusted with several splits in each of the trees. This parameter controls the complexity of the Boost and is known as the depth. For Boosting, the number of trees sampled was 250, with a learning rate of 0.1 and a depth of 3.
To perform hybrid prediction for each environment based on MET dataset, we propose the incorporation of location and year information in which the experiments were carried out as factors in the data input file together with SNPs markers as predictors in machine learning methodologies. As a response variable, the eBLUEs previously estimated by Eq. (1) were used.
For the construction of the bagging and random forest models, the randomForest function from the package randomForest41 was used. Finally, the package’s gbm function gbm42 was used for boosting. All analyzes were implemented in the software R43.
Model validation
Genomic predictions were carried out following Burgueño et al.16, considering two different prediction problems, CV1 and CV2, which simulate two possible scenarios a breeder can face. In CV1, the ability of the algorithms to predict the performance of hybrids that have not yet been evaluated in any field trial was evaluated. Thus, predictions derived from the CV1 scenario are entirely based on phenotypic and genotypic records from other related hybrids. In CV2, the ability of the algorithms to predict the performance of hybrids using data collected in other environments was evaluated. It simulates the prediction problem found in incomplete MET trials. Here, information from related individuals is used, and the prediction can benefit from genetic relationships between hybrids and correlations between environments. Within the CV2 scenario, two different situations of data imbalance were evaluated. In the first, called CV2 (50%), the tested hybrids were not present in half of the environments, while in the second, called CV2 (25%), the tested hybrids were not present in only 25% of the environments. Table 2 provides a hypothetical representation of this CV1, CV2 (50%), and CV2 (25%) validation scheme.
To separate the training and validation sets, the k-folds procedure was used, considering \(k = 5\). The set of 265 hybrids was divided into five groups, with 80% of the hybrids considered as the training population, and the remaining 20% hybrids considered as the validation population. The hybrids were separated into sets proportionally containing all the crosses performed (Dent × Dent, Dent × Flint, Flint × Flint, C × Dent, C × Flint). The cross-validation process was performed separately for each trait, condition (WS or WW) and scenario (CV1, CV2-50% and CV2-25%) and was repeated five times to assess the predictive ability of the analyses.
The predictive ability within each environment for the conditions (WS and WW) was estimated by the Pearson correlation coefficient44 between the corrected phenotypic values (eBLUES) of Eq. (1) for each environment and the GEBVs predicted by each fitted method.
Ethics statement
The authors confirm that all methods were carried out by relevant guidelines in the method section. The authors also confirm that the handling of the plant materials used in the study complies with relevant institutional, national, and international guidelines and legislation.
Statement of handling of plants
The authors confirm that the appropriate permissions and/or licenses for collection of plant or seed specimens are taken.