Introduction2GS.knit

Introduction

In this practice, we will show how to implement Genomic selection/Genomic prediction models. We will introduce rrBLUP, GBLUP, BayesA, BayesB, and a cross validation scheme.

To perform the analyses, we will need the following packages (if you don’t have it installed, please, use the function ‘install.packages(“pack_name”)’ to do so):

rm(list=ls())
require(AGHmatrix)

## Carregando pacotes exigidos: AGHmatrix

require(BGLR)

## Carregando pacotes exigidos: BGLR

require(tidyverse)

## Carregando pacotes exigidos: tidyverse

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.2     ✔ purrr   1.0.1
## ✔ tibble  3.2.1     ✔ dplyr   1.1.2
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

require(dplyr)
require(ggplot2)

Dataset

This dataset that we will use was simulated using the coalescent theory implemented in \(AlphasimR\) package (Gaynor et al., 2021). This dataset mimic an evaluation of 500 maize genotypes, in four locations, and four traits were simulated. In addition, a set of 3000 single nucleotide polymorphism (SNPs) were randomly sampled through the 10 pair of chromosomes.

Phenotypic data
We generated BLUEs (best linear unbiased estimation) for each trait, and it can be loaded from Phenotypes.csv.

# Loading the dataset
Phenotypes = read.csv("Phenotypes.csv")

# Filtering for one environment and one trait
phenodata = Phenotypes[Phenotypes$Env == 1, c(1:3)]

Genomic data
The genomic data (3000 SNP markers) was coded as 0,1,2 for aa,Aa,AA, respectively. The dataset for all 500 genotypes can be loaded from Genotypes.csv.

# Loading the SNP data
Genotypes = as.matrix(read.csv("Genotypes.csv"))

#Genotype' names
rownames(Genotypes)= phenodata$Genotype

# Genotypes
Genotypes[1:5,1:5]

##       X1_2 X1_7 X1_16 X1_19 X1_27
## 26662    2    0     2     0     2
## 30678    0    0     2     0     2
## 30129    2    0     2     0     2
## 28907    2    0     2     0     2
## 30522    2    0     2     0     2

Transforming Genotypes and Environment into factors

# As factor
phenodata$Genotype = phenodata$Genotype %>% as.factor
phenodata$Env = phenodata$Env %>% as.factor # Only one

Relationship matrix Here, we will compose the additive relationship matrix (G matrix) following the propositions made by VanRaden (2008), as follows:

\[ G = {\frac{ZZ'}{2{\sum{p_i(1-p_i)}}}} \]

where \(p_i\) and \(1-p_i\) represents the allele frequency for both \(A\) and \(a\) at each loci. In this case, we divided by \({2{\sum{p_i(1-p_i)}}}\) to scale G to be analogous to the numerator relationship matrix A ((VanRaden, 2008)).

To compose the additive kernel (aka genomic relationship matrix), we will use the package AGHMatrix [amadeu2016aghmatrix]. This package uses the SNP information coded as 2,1,0 to create a relationship matrix between the pair of genotypes. Also, we are able to built other kernels, such as dominance relationship matrix. In our case, we should pay attention in three parameters that will be important while we create the kernel:

Minor allele frequency (\(maf\)): This parameter is connected with the frequency of the alleles in the population. As rare alleles tend to decrease the importance of alleles that contributed with the target trait, we can filter and drop those with small frequency.

Threshold: This parameter is connected with the quality of the SNP dataset. It represents a threshold to the amount of missing data that we will allow for both, SNP and genotypes.

Method: In this case, which one should be the method used to build the kernel. For additive kernels, using the SNP data, the method indicated is the one from VanRaden (2008), as previously discussed.

# Additive matrix
GMat = AGHmatrix::Gmatrix(Genotypes, 
                          maf=0.05,          # Minor allele frequency
                          thresh = 0.8,      # threshold for missing data
                          method="VanRaden", # Kernel (i.e., additice, dominance, etc.)
                          missingValue = NA) # Missing data representation

## Initial data: 
##  Number of Individuals: 500 
##  Number of Markers: 3000 
## 
## Missing data check: 
##  Total SNPs: 3000 
##   0 SNPs dropped due to missing data threshold of 0.8 
##  Total of: 3000  SNPs 
## MAF check: 
##   545 SNPs dropped with MAF below 0.05 
##  Total: 2455  SNPs 
## Monomorphic check: 
##  No monomorphic SNPs 
## Summary check: 
##  Initial:  3000 SNPs 
##  Final:  2455  SNPs ( 545  SNPs removed) 
##  
## Completed! Time = 2.65  seconds

GBLUP

The first model implemented is a GBLUP in a Bayesian framework. For such, we will use the BGLR package Pérez & Los Campos (2014). In this package, we will need an element named “ETA”. In the ETA, we will set the linear predictors of the model and the priors, such as relationship matrix (Z) or markers data (X), and the specified model, such as BRR, RKHS, BayesA, BayesB, BayesC, etc.

Statistical model

The GBLUP model was given as follows:

\[ y = 1u + Za + e \]

where \(y\) is the vector of BLUEs data for the target trait; \(u\) represents the mean (fixed); \(g\) is the vector of genotype effects (assumed to be random), where \(g \sim N(0,G\sigma^2_a)\), \(G\) is a matrix that describes the genomic similarities between pair of genotypes and \(\sigma^2_a\) represents the genetic variance; and \(e\) is the vector of residuals (random), where \(e\sim N(0,I\sigma^2_{res})\), \(\sigma^2_{res}\) is the residual variance. The letter \(Z\) refers to the incidence matrix for \(a\).

###----------- Linear predictor for BGLR
ETA = list(A = list(K = GMat,         # K (relationship matrix)
                    model = 'RKHS'))  # Specify the model

###----------- Phenotypic data
Y = as.matrix(phenodata[,3])

###-----------  Model for trait 1
fmGBLUP = BGLR(y=Y,              # Phenotypic data
               ETA=ETA,          # Model ETA
               nIter=1200,       # Number of iterations for the model
               burnIn=120,       # Burnin iterations
               thin=5,           # Sampling throughout iterations
               verbose = FALSE)

#Mean estimated
fmGBLUP$mu

## [1] 155.4327

#Variance component for genic effect
fmGBLUP$ETA[[1]]$varU

## [1] 23.06351

#Variance component for the residual effect
fmGBLUP$varE

## [1] 43.71404

#Estimated breeding value
fmGBLUP$yHat

##   [1] 167.9268 163.0976 161.4981 155.9664 162.0465 159.4487 149.5192 153.5858
##   [9] 164.9637 162.7841 156.5767 155.3318 154.2871 156.5047 161.6985 164.5441
##  [17] 160.7354 154.7687 166.7831 164.1047 147.2658 157.9078 157.8556 158.0999
##  [25] 153.3875 161.5526 160.9672 157.0088 155.2165 152.0109 164.5837 151.1404
##  [33] 153.0374 158.9615 150.5215 155.6904 156.6738 155.6906 155.9842 142.7911
##  [41] 157.1657 149.2476 159.0145 157.3523 158.7143 155.3048 164.0865 150.2564
##  [49] 155.1040 164.4107 155.5997 155.3787 163.8582 159.2150 160.9523 161.8149
##  [57] 159.4313 157.2566 163.2967 154.6704 163.5924 149.5179 162.6920 153.8217
##  [65] 160.2614 164.1310 154.9562 163.4081 154.1883 153.0388 157.5154 158.6631
##  [73] 145.1013 159.8497 160.1920 154.8012 157.7137 157.4006 157.6822 152.5201
##  [81] 158.2062 155.5406 152.6613 154.2349 152.0013 151.3266 158.8247 153.4603
##  [89] 156.0031 159.9442 161.3422 147.9291 155.8202 154.6349 157.9151 153.2637
##  [97] 156.2557 151.4177 159.7306 156.3952 160.3467 156.4837 159.1810 150.5248
## [105] 157.6562 162.7730 152.0582 156.0229 152.8867 155.4959 157.4544 157.6782
## [113] 153.7342 152.7435 152.8429 154.3022 158.2829 164.4551 156.4334 155.1706
## [121] 155.4890 152.4701 153.8783 156.1098 155.4950 155.7020 157.2486 160.1520
## [129] 151.0730 154.4193 145.1675 155.3347 155.8538 153.3627 152.1696 155.7409
## [137] 158.6527 159.5674 157.9128 154.5117 150.3212 157.2657 157.9968 156.3728
## [145] 154.1412 155.2903 154.7377 156.5063 154.7174 156.4034 150.4243 155.0310
## [153] 153.3391 149.1270 152.8610 162.0039 158.6246 159.4715 155.9032 150.2653
## [161] 151.7284 149.3235 156.3567 158.2104 159.0395 152.6485 160.8946 155.1608
## [169] 151.4262 158.0789 160.8190 159.7927 158.3380 159.5650 162.7558 157.9075
## [177] 156.7974 154.2453 147.4980 153.5470 150.9206 155.6346 153.3753 159.0393
## [185] 160.1868 165.6098 158.6297 153.0510 152.1964 150.6820 154.2402 152.9476
## [193] 152.6830 150.1618 151.5055 157.4493 159.1277 150.6392 157.6123 152.7259
## [201] 154.9401 157.5716 154.9306 151.9476 152.6713 152.2683 155.9905 154.2091
## [209] 150.6652 153.9281 151.8408 157.7530 153.1870 155.4404 161.7768 162.6394
## [217] 154.7141 156.3502 154.2255 154.5451 153.8899 158.2599 151.3690 160.3615
## [225] 155.7310 147.3317 156.0560 157.7883 152.5981 157.4458 151.4224 154.4121
## [233] 161.1948 157.7778 154.7024 148.1058 161.9204 161.8540 160.7519 161.1838
## [241] 154.9752 153.8907 149.3440 148.1506 162.6782 158.4812 152.3673 148.6806
## [249] 158.0294 160.4404 152.5710 157.7853 159.7478 148.3655 157.2594 158.8933
## [257] 151.5718 161.2216 157.0788 158.6790 155.0048 162.7385 155.5162 157.6402
## [265] 157.3722 155.6292 149.2790 156.7013 151.0329 153.7117 152.7393 158.6939
## [273] 154.3897 160.8865 156.6227 149.5821 148.5102 156.8536 158.4839 160.9819
## [281] 156.9655 154.3912 155.3533 152.1862 148.5161 160.7234 161.5865 153.0459
## [289] 148.5050 152.4970 147.1090 150.1458 150.4420 159.2314 155.1077 150.9568
## [297] 164.0624 147.5069 148.7896 158.5148 150.1870 153.0786 149.3782 154.6104
## [305] 151.2645 153.0690 149.3545 158.5880 155.0577 157.9826 154.2772 150.3033
## [313] 156.1234 151.6426 151.7434 152.4210 161.8980 157.7172 155.0219 156.8189
## [321] 153.6973 166.4681 154.6471 153.6077 144.4291 157.8286 144.5493 155.4508
## [329] 152.6624 158.9848 155.8940 150.5647 148.1509 154.5022 153.7499 147.0972
## [337] 159.1715 152.0923 154.3409 155.0563 157.0627 162.1474 155.9339 156.2234
## [345] 151.4740 152.0646 160.2349 154.5376 161.9605 158.8890 150.8711 150.3875
## [353] 151.7825 157.1229 155.2129 159.4779 155.2829 152.8822 152.8137 150.1872
## [361] 149.5316 152.8513 149.1035 162.2370 150.3309 146.3993 156.1413 154.7727
## [369] 154.3701 156.9446 155.6108 158.9412 153.4909 150.9304 157.9706 158.2448
## [377] 149.5818 156.0210 153.6322 161.1561 158.6454 155.2090 150.8984 156.2895
## [385] 154.9596 160.6473 161.0415 162.8278 154.3200 157.0482 153.1364 152.2057
## [393] 147.5741 146.3625 147.0329 157.3832 164.9576 152.4834 159.9568 158.2126
## [401] 151.7243 158.7225 162.5858 156.6758 154.8532 150.6526 151.9809 148.5788
## [409] 153.4801 150.5562 148.0042 150.5407 161.4977 153.2781 159.8336 155.3048
## [417] 157.9836 154.8791 152.3473 159.7478 161.1757 153.6787 159.9456 156.6230
## [425] 161.3833 149.5913 153.4592 160.6914 154.8692 153.6561 150.0661 159.2206
## [433] 154.0976 153.9553 159.1238 146.1120 152.5662 159.4090 151.5752 155.8212
## [441] 151.9166 155.5067 154.8595 152.1028 149.8568 155.5508 155.3047 151.4718
## [449] 151.6609 154.9939 159.9550 154.0631 158.0208 159.7384 154.8977 156.0144
## [457] 147.7476 145.8921 155.9129 158.0632 153.6834 157.8240 158.0257 159.1391
## [465] 163.0767 153.8684 155.4048 153.7602 153.8781 158.1284 151.7561 159.9098
## [473] 147.6922 149.3701 147.0858 152.5369 154.2469 157.8906 151.5470 154.6068
## [481] 153.6845 153.2828 156.1584 158.9906 152.6985 154.0695 155.9311 153.9828
## [489] 158.6791 149.4481 152.5930 157.4484 150.6895 150.8266 154.0096 151.8136
## [497] 154.4326 162.1506 163.3955 148.5967

#Model accuracy
cor(as.matrix(phenodata$Trait1), fmGBLUP$ETA[[1]]$u, use = 'complete.obs')

##           [,1]
## [1,] 0.8522079

# Estimated breeding values (Genotype effect)
plot(fmGBLUP$ETA[[1]]$u, col=4, cex=.5, type='o', main='GBLUP')

rrBLUP

Statistical model

Random regression BLUP or RRBLUP is a marker model that is given as follows:

\[ y = 1u + Zm + e \]

where \(y\) is the vector of BLUEs data for the target trait; \(u\) represents the mean (fixed); \(m\) is the vector of random marker effects (assumed to be random), where \(m \sim N(0,I\sigma^2_m)\) and \(\sigma^2_m\) represents the marker genetic variance; and \(e\) is the vector of residuals (random), where \(e \sim N(0,I\sigma^2_{res})\), \(\sigma^2_{res}\) is the residual variance. The letter \(Z\) refers to the incidence matrix for \(m\).

Now we can deploy RRBLUP model.

###----------- Linear predictor 
ETA = list(A = list(X = Genotypes,   # change from K (relationship matrix) to X (Markers) in ETA
                    model = 'BRR'))  # Specify the model
                                 
###----------- Phenotypic data
Y = as.matrix(phenodata[,3])

###----------- Model for trait 1
fmrrBLUP = BGLR(y=Y,             # Phenotypic data
               ETA=ETA,          # Model ETA
               nIter=1200,       # Number of iterations for the model
               burnIn=120,       # Burnin iterations
               thin=5,           # Sampling throughout iterations
               verbose = FALSE)

# Mean estimated
fmrrBLUP$mu

## [1] 149.1196

#Variance component for SNP effect
fmrrBLUP$ETA[[1]]$varB

## [1] 0.02493585

#Variance component for the residual effect
fmrrBLUP$varE

## [1] 43.91798

# SNP effects
length(fmrrBLUP$ETA[[1]]$b) #'b' slot

## [1] 3000

# Estimated effect values (SNP effect)

plot(fmrrBLUP$ETA[[1]]$b, col=4, cex=.5, type='o', main='rrBLUP')

Equivalence rrBLUP/GBLUP:

We estimated in the above model the additive marker effects. From that, we can backsolve to estimate a Genomic Estimated Breeding Values (GEBV) for each genotype. The simplest way is to use the product between SNPs matrix and the vector of the additive markers’ effects:

# Backsolve to GEBV values
GEBV = Genotypes %*% fmrrBLUP$ETA[[1]]$b 

# Data frame
GBLUP_rrBLUP = data.frame("Genotype" = phenodata$Genotype,
                          "GEBV_GBLUP" = fmGBLUP$ETA[[1]]$u,
                          "GEBV_rrBLUP" = GEBV)

# Estimated effect values (SNP effect)
ggplot(GBLUP_rrBLUP, aes(x = GEBV_GBLUP, y = GEBV_rrBLUP))+
  geom_point()

BayesA

###----------- Phenotypic data
Y = as.matrix(phenodata[,3])

###-----------  Model for trait 1
fmBA <- BGLR(y=Y,            # Phenotypic data
             ETA=list(list(X=Genotypes, model='BayesA')),          # Model ETA
             nIter=1200,       # Number of iterations for the model
             burnIn=120,       # Burnin iterations
             thin=5,
             verbose = FALSE)           # Sampling throughout iterations

# Mean estimated
fmBA$mu

# Variance component for the residual effect
fmBA$varE

# SNP effects
length(fmBA$ETA[[1]]$b) #'b' slot

plot((fmBA$ETA[[1]]$b),col=4, cex=.5, type='o', main='BayesA')

BayesB

###----------- Phenotypic data
Y = as.matrix(phenodata[,3])

###-----------  Model for trait 1
fmBB<-BGLR(y=Y,            # Phenotypic data
         ETA=list(A = list(X = Genotypes, model = 'BayesB')),          # Model ETA
         nIter=1200,       # Number of iterations for the model
         burnIn=120,       # Burnin iterations
         thin=5,           # Sampling throughout iterations
         verbose = FALSE)

# Mean estimated
fmBB$mu

## [1] 155.5443

#Residual covariance matrix
fmBB$varE

## [1] 40.49782

#Estimated SNP effects
head(fmBB$ETA[[1]]$b)

##         X1_2         X1_7        X1_16        X1_19        X1_27        X1_34 
##  0.062760956  0.052840926 -0.008042352  0.001796811  0.005559859  0.015022433

plot((fmBB$ETA[[1]]$b), col=4, cex=.5, type='o', main='BayesB')

RRBLUP vs. BayesB

# Filtering for one environment and one trait
phenodat0 = Phenotypes[Phenotypes$Env == 1, c(1,2,6)]

###----------- Phenotypic data
Y = as.matrix(phenodat0[,3])

###-----------  Model for trait 1
fmBB <- BGLR(y=Y,            # Phenotypic data
             ETA=list(list(X=Genotypes, model='BayesB')),          # Model ETA
             nIter=1200,       # Number of iterations for the model
             burnIn=120,       # Burnin iterations
             thin=5,
             verbose = FALSE)           # Sampling throughout iterations


###-----------  Model for trait 1
fmBRR <- BGLR(y=Y,            # Phenotypic data
             ETA=list(list(X=Genotypes, model='BRR')),          # Model ETA
             nIter=1200,       # Number of iterations for the model
             burnIn=120,       # Burnin iterations
             thin=5,
             verbose = FALSE)           # Sampling throughout iterations


# Plotting the effects
BayesB_rrBLUP = data.frame("effect_BayesB" = fmBB$ETA[[1]]$b,
                           "effect_rrBLUP" = fmBRR$ETA[[1]]$b)

ggplot(BayesB_rrBLUP, aes(x = effect_BayesB, y = effect_rrBLUP))+
  geom_point()

Cross-validation

Using the information to predict accuracy in a CV1 method

The cross-validations methods are divided into four different classes (sensu Jarquin et al. (2018)), regarding the genotypes and environments. As follows:

Figure 1 - Cross-validations schemes.

Phenotypic information

The first step will be to organize the phenotypic information and create the partitions to the cross validation scheme (CV1). Generally, we can divide the phenotypic dataset into two:

Testing set: Population where the phenotype will be predicted based on the markers and in the information from the training set.

Training set: Population with the information that we use to calibrate the model.

###----------- Parameters for the folds
nReps = 5
nFolds = 5

###----------- Output for the estimated values
yHatCV=rep(NA,length(phenodata$Trait1))
pred = data.frame()

###----------- Model
# Phenodata
y = as.matrix(phenodata$Trait1)

# Time for the loop
set.seed(0928761) 
 for(Rep in 1:nReps){
   
 folds=sample(1:nFolds,size=length(phenodata$Trait1),replace=T)  
 
  for(i in 1:max(folds)){
    tst=which(folds==i)
    yNA=y
    yNA[tst]=NA
    # model
    fm=BGLR(y=yNA,
            ETA=list(list(K=GMat,model='RKHS')),
            nIter=1200,
            burnIn=120,
            verbose = FALSE)
    # Predicted values
    yHatCV[tst]=fm$yHat[tst]
  }
 # Accuracy for each fold
 pred=rbind(pred, 
            data.frame(Repetitions = Rep,
                       Cor=cor(yHatCV, y)))
}
 

## Mean for the correlation
mean(pred$Cor)

## [1] 0.3670201

References

Gaynor, R. C., Gorjanc, G., & Hickey, J. M. (2021). AlphaSimR: An r package for breeding program simulations. G3, 11(2), jkaa017.

Jarquin, D., Howard, R., Xavier, A., & Das Choudhury, S. (2018). Increasing predictive ability by modeling interactions between environments, genotype and canopy coverage image data for soybeans. Agronomy, 8(4), 51.

Pérez, P., & Los Campos, G. de. (2014). Genome-wide regression and prediction with the BGLR statistical package. Genetics, 198(2), 483–495.

Pérez-Rodrı́guez, P., & Los Campos, G. de. (2022). Multitrait bayesian shrinkage and variable selection models with the BGLR-r package. Genetics, 222(1), iyac112.

VanRaden, P. M. (2008). Efficient methods to compute genomic predictions. Journal of Dairy Science, 91(11), 4414–4423.

University of Florida, mresende@ufl.edu ↩︎
University of Florida, jhernandezjarqui@ufl.edu ↩︎
University of Florida, deamorimpeixotom@ufl.edu ↩︎

Introduction to Genomic Selection (GS) and Genomic Prediction (GP) Models (Bayesian Alphabet)

Marcio Resende, professor¹

Diego Jarquin, professor²

Marco Antonio Peixoto, Post doc³

Introduction

Dataset

GBLUP

rrBLUP

BayesA

BayesB

RRBLUP vs. BayesB

Cross-validation

References

Introduction to Genomic Selection (GS) and Genomic Prediction (GP) Models (Bayesian Alphabet)

Marcio Resende, professor1

Diego Jarquin, professor2

Marco Antonio Peixoto, Post doc3

Introduction

Dataset

GBLUP

rrBLUP

BayesA

BayesB

RRBLUP vs. BayesB

Cross-validation

References

Marcio Resende, professor¹

Diego Jarquin, professor²

Marco Antonio Peixoto, Post doc³