Genotyping errors
genotyping_errors.Rmd
Error patterns
The pattern of genotyping errors will depend on the genotyping method and the software used to call SNPs. For SNP arrays, the clustering based on the intensity of the A vs a probe results in the chance of a true homozygote AA being erroneously scored as a heterozygote Aa being about as likely as the reverse. In contrast, for sequencing-based methods scoring a true heterozygote erroneously as a homozygote is much more likely than the reverse, especially with low coverage.
Because there is no one-size-fits-all error model for SNP genotypes,
sequoia
offers full flexibility in specifying the error
pattern. This is done via a 3x3 matrix (heterozygotes
Aa and aA are interchangeable) with
the probabilities of observing genotype \(y\) given true genotype \(x\).
Default model
The default error model is based on SNP array genotyping, and from version 2.0 onwards is defined as:
aa | Aa | AA | |
---|---|---|---|
aa | \((1-E/2)^2\) | \(E(1-E/2)\) | \((E/2)^2\) |
Aa | \(E/2\) | \(1-E\) | \(E/2\) |
AA | \((E/2)^2\) | \(E(1-E/2)\) | \((1-E/2)^2\) |
Which for an error rate of 1% translates to:
sequoia::ErrToM(Err=0.01)
## obs-0|act obs-1|act obs-2|act
## act-0 9.9e-01 0.009975 2.5e-05
## act-1 5.0e-03 0.990000 5.0e-03
## act-2 2.5e-05 0.009975 9.9e-01
where 0, 1 or 2 refers to the number of copies of the reference allele A.
When the true genotype is homozygote aa (top row), the probabilities are derived as follows:
aa: the probability that neither allele is observed incorrectly, i.e. \((1-E/2) \times (1-E/2)\)
Aa: the probability that the first allele is mis-observed (\(E/2\)) and the second allele is not (\(1-E/2\)), plus vice versa. Thus, \(P=2 \times (E/2) \times (1-E/2) = E (1-E/2)\).
AA: the probability that both alleles are observed incorrectly
Error model in previous versions
\((E/2)^2 \approx 0\) and \((1-E/2)^2 \approx 1-E\), as illustrated in
the example for Err=0.01
. This implies that the current
default error model does not differ as much from the simpler error model
in version 0.9 as may seem at first glance:
aa | Aa | AA | |
---|---|---|---|
aa | \(1-E\) | \(E\) | \(0\) |
Aa | \(E/2\) | \(1-E\) | \(E/2\) |
AA | \(0\) | \(E\) | \(1-E\) |
Using a value of \(0\) in the error matrix is not advised, as it puts an infinite amount of weight on the locus if it does happen after all (see also Cromwell’s rule.
Per-locus vs per-allele error rate
The error rate \(E\)
(Err
) in the previous part represents the per-locus error
rate; the per-allele error rate is \(E/2\). There appears to be no consistency
in the literature on whether the per-locus or per-allele error rate is
used, so it is worth making sure it is defined consistently throughout
your analysis pipeline. For pedigree reconstruction with
sequoia
, being off by a factor 2 should generally not have
large consequences.
If you prefer, you could easily re-define the error matrix in terms of a per-allele error rate (say \(\epsilon\)):
aa | Aa | AA | |
---|---|---|---|
aa | \((1-\epsilon)^2\) | \(2\epsilon(1-\epsilon)\) | \(\epsilon^2\) |
Aa | \(\epsilon\) | \(1-2\epsilon\) | \(\epsilon\) |
AA | \(\epsilon^2\) | \(2\epsilon(1-\epsilon)\) | \((1-\epsilon)^2\) |
This can be done via sequoia()
argument Err
or ErrFlavour
, as described next.
Changing the error model
via Err
You can pass a 3x3 matrix to Err
, for which the only
requirement is that rows must sum to 1. The homozygote for the reference
allele may be more likely to be mis-typed than the homozygote for the
alternative allele, the heterozygote may be mis-typed with different
probabilities for the two homozygotes, or whatever else is appropriate
for your dataset.
For example, you have the following counts from genotyping some samples several times:
## [,1] [,2] [,3]
## [1,] 14 3 5
## [2,] 4 28 9
## [3,] 0 4 72
A problem here is the cell with \(0\): by leaving this as is, it would imply that true A can never be observed as aa, even if we would genotype thousands of samples. That seems unlikely, so as a quick fix let’s pretend that if we had genotyped twice as many, we would have found 1 error:
ErrCount[3,1] <- 0.5
Then divide by rowsums, e.g. using sweep()
:
## [,1] [,2] [,3]
## [1,] 0.636 0.136 0.227
## [2,] 0.098 0.683 0.220
## [3,] 0.007 0.052 0.941
and use this as input for the sequoia()
analysis:
SeqOUT <- sequoia(GenoData, LifeHistData,
Err = ErrProbs)
In reality you probably want something a bit more sophisticated, e.g. relying on the known error pattern for your genotyping platform.
via ErrFlavour
If you want to run sequoia()
with several presumed error
rates, you would need to re-calculate the 3x3 error matrix several
times. You can do this yourself, or you can provide argument
ErrFlavour
with a function that turns a single number into
a 3x3 matrix.
For example, to re-define the error matrix in terms of a per-allele error rate:
ErFunc_allele <- function(Ea) {
matrix(c((1-Ea)^2, 2*Ea*(1-Ea), (Ea)^2,
Ea, 1-2*Ea, Ea,
(Ea)^2, 2*Ea*(1-Ea), (1-Ea)^2),
3,3, byrow=TRUE)
}
ErFunc_allele(0.01)
## [,1] [,2] [,3]
## [1,] 0.9801 0.0198 0.0001
## [2,] 0.0100 0.9800 0.0100
## [3,] 0.0001 0.0198 0.9801
ErFunc_allele(0.005)
## [,1] [,2] [,3]
## [1,] 0.990025 0.00995 0.000025
## [2,] 0.005000 0.99000 0.005000
## [3,] 0.000025 0.00995 0.990025
SeqOUT <- sequoia(GenoData, LifeHistData,
Err = 0.005,
ErrFlavour = ErFunc_allele)
Variation between SNPs
Currently a constant error rate across all SNPs is assumed. It is unclear whether setting this presumed rate to the arithmetic mean rate (the regular average) across SNPs gives the best performance for real world data. Anecdotal evidence suggests that using a somewhat higher presumed error rate increases performance (more correct assignments and/or fewer incorrect assignments), but this has not been thoroughly explored. It may partly be due to confusion between the per-locus and per-allele error rate, but also due to the very skewed nature of most distributions (many alleles with low error rate, and some with high error rate), resulting in a arithmetic mean that is lower than the median. For rates, the harmonic mean may be more appropriate.
Implementation
The calculation of likelihoods incorporating genotyping errors is
described in Huisman (2017), and can be explored using
CalcPairLL
.
Noteworthy is that the probabilistic estimate of an individual’s true genotype depends on its own observed genotype and its parents’ estimated true genotypes, but not on the genotypes of its offspring. Doing so would potentially create an unstable seesaw situation, where assignment of a parent is both affected by and does affect that parent’s estimated true genotype.
For dummy individuals, which do not have an observed genotype, the offspring genotypes do necessarily affect its estimated true genotype. To prevent the algorithm from seesawing marginal probabilities are calculated, excluding grandparents and/or one, two or all offspring from the calculation, as appropriate for a particular relationship likelihood (most importantly avoiding implicitly conditioning on oneself).
The main reason those marginal probabilities are not implemented for genotyped individuals, is because of their relatively large computational cost, with negligible effect on the pedigree outcome for moderately high quality SNP data (error rate < 1% and low missingness).1
Try out different values
Due to both the variation in error rate between SNPs and how
genotyping errors are implemented, it is usually a good idea to run
sequoia()
with a few different presumed genotyping error
rates. This allows you to make sure the results are robust against
slight mis-specifications of this parameter, and/or to optimise the
false negative and false positive assignment rates.
Err = 0
sequoia()
is not guaranteed to work with an assumed
error rate of \(0\), especially for
more complex pedigrees. The problem is that a new assignment considers
only the direct vicinity of the putative parent-offspring pair, not the
full pedigree. An assignment will have cascading effects on the
probabilities of the true, underlying genotypes of dummy or partially
genotyped individuals downstream in the pedigree, which may result in
impossibilities when Err=0
but genotyping errors are
present.