Skip to contents

Error patterns

The pattern of genotyping errors will depend on the genotyping method and the software used to call SNPs. For SNP arrays, the clustering based on the intensity of the A vs a probe results in the chance of a true homozygote AA being erroneously scored as a heterozygote Aa being about as likely as the reverse. In contrast, for sequencing-based methods scoring a true heterozygote erroneously as a homozygote is much more likely than the reverse, especially with low coverage.

Because there is no one-size-fits-all error model for SNP genotypes, sequoia offers full flexibility in specifying the error pattern. This is done via a 3x3 matrix (heterozygotes Aa and aA are interchangeable) with the probabilities of observing genotype \(y\) given true genotype \(x\).

Default model

The default error model is based on SNP array genotyping, and from version 2.0 onwards is defined as:

Probability of observed genotype (columns) conditional on actual genotype (rows) and per-locus error rate E
aa Aa AA
aa \((1-E/2)^2\) \(E(1-E/2)\) \((E/2)^2\)
Aa \(E/2\) \(1-E\) \(E/2\)
AA \((E/2)^2\) \(E(1-E/2)\) \((1-E/2)^2\)

Which for an error rate of 1% translates to:

sequoia::ErrToM(Err=0.01)
##       obs-0|act obs-1|act obs-2|act
## act-0   9.9e-01  0.009975   2.5e-05
## act-1   5.0e-03  0.990000   5.0e-03
## act-2   2.5e-05  0.009975   9.9e-01

where 0, 1 or 2 refers to the number of copies of the reference allele A.

When the true genotype is homozygote aa (top row), the probabilities are derived as follows:

  • aa: the probability that neither allele is observed incorrectly, i.e. \((1-E/2) \times (1-E/2)\)

  • Aa: the probability that the first allele is mis-observed (\(E/2\)) and the second allele is not (\(1-E/2\)), plus vice versa. Thus, \(P=2 \times (E/2) \times (1-E/2) = E (1-E/2)\).

  • AA: the probability that both alleles are observed incorrectly

Error model in previous versions

\((E/2)^2 \approx 0\) and \((1-E/2)^2 \approx 1-E\), as illustrated in the example for Err=0.01. This implies that the current default error model does not differ as much from the simpler error model in version 0.9 as may seem at first glance:

As previous table, for sequoia 0.9
aa Aa AA
aa \(1-E\) \(E\) \(0\)
Aa \(E/2\) \(1-E\) \(E/2\)
AA \(0\) \(E\) \(1-E\)

Using a value of \(0\) in the error matrix is not advised, as it puts an infinite amount of weight on the locus if it does happen after all (see also Cromwell’s rule.

Per-locus vs per-allele error rate

The error rate \(E\) (Err) in the previous part represents the per-locus error rate; the per-allele error rate is \(E/2\). There appears to be no consistency in the literature on whether the per-locus or per-allele error rate is used, so it is worth making sure it is defined consistently throughout your analysis pipeline. For pedigree reconstruction with sequoia, being off by a factor 2 should generally not have large consequences.

If you prefer, you could easily re-define the error matrix in terms of a per-allele error rate (say \(\epsilon\)):

Error matrix redefined with per-allele error rate \(\epsilon\)
aa Aa AA
aa \((1-\epsilon)^2\) \(2\epsilon(1-\epsilon)\) \(\epsilon^2\)
Aa \(\epsilon\) \(1-2\epsilon\) \(\epsilon\)
AA \(\epsilon^2\) \(2\epsilon(1-\epsilon)\) \((1-\epsilon)^2\)

This can be done via sequoia() argument Err or ErrFlavour, as described next.

Changing the error model

via Err

You can pass a 3x3 matrix to Err, for which the only requirement is that rows must sum to 1. The homozygote for the reference allele may be more likely to be mis-typed than the homozygote for the alternative allele, the heterozygote may be mis-typed with different probabilities for the two homozygotes, or whatever else is appropriate for your dataset.

For example, you have the following counts from genotyping some samples several times:

ErrCount <- matrix(c(14, 3, 5,
                     4, 28, 9,
                     0, 4, 72),
                   nrow=3, byrow=TRUE)
ErrCount
##      [,1] [,2] [,3]
## [1,]   14    3    5
## [2,]    4   28    9
## [3,]    0    4   72

A problem here is the cell with \(0\): by leaving this as is, it would imply that true A can never be observed as aa, even if we would genotype thousands of samples. That seems unlikely, so as a quick fix let’s pretend that if we had genotyped twice as many, we would have found 1 error:

ErrCount[3,1] <- 0.5

Then divide by rowsums, e.g. using sweep():

ErrProbs <- sweep(ErrCount, MARGIN=1, STATS=rowSums(ErrCount), FUN="/")
round(ErrProbs, 3)
##       [,1]  [,2]  [,3]
## [1,] 0.636 0.136 0.227
## [2,] 0.098 0.683 0.220
## [3,] 0.007 0.052 0.941

and use this as input for the sequoia() analysis:

SeqOUT <- sequoia(GenoData, LifeHistData,
                  Err = ErrProbs)

In reality you probably want something a bit more sophisticated, e.g. relying on the known error pattern for your genotyping platform.

via ErrFlavour

If you want to run sequoia() with several presumed error rates, you would need to re-calculate the 3x3 error matrix several times. You can do this yourself, or you can provide argument ErrFlavour with a function that turns a single number into a 3x3 matrix.

For example, to re-define the error matrix in terms of a per-allele error rate:

ErFunc_allele <- function(Ea) {
      matrix(c((1-Ea)^2, 2*Ea*(1-Ea), (Ea)^2,
              Ea, 1-2*Ea, Ea,
              (Ea)^2, 2*Ea*(1-Ea), (1-Ea)^2),
             3,3, byrow=TRUE)
}

ErFunc_allele(0.01)
##        [,1]   [,2]   [,3]
## [1,] 0.9801 0.0198 0.0001
## [2,] 0.0100 0.9800 0.0100
## [3,] 0.0001 0.0198 0.9801
ErFunc_allele(0.005)
##          [,1]    [,2]     [,3]
## [1,] 0.990025 0.00995 0.000025
## [2,] 0.005000 0.99000 0.005000
## [3,] 0.000025 0.00995 0.990025
SeqOUT <- sequoia(GenoData, LifeHistData,
                  Err = 0.005,
                  ErrFlavour = ErFunc_allele)

Variation between SNPs

Currently a constant error rate across all SNPs is assumed. It is unclear whether setting this presumed rate to the arithmetic mean rate (the regular average) across SNPs gives the best performance for real world data. Anecdotal evidence suggests that using a somewhat higher presumed error rate increases performance (more correct assignments and/or fewer incorrect assignments), but this has not been thoroughly explored. It may partly be due to confusion between the per-locus and per-allele error rate, but also due to the very skewed nature of most distributions (many alleles with low error rate, and some with high error rate), resulting in a arithmetic mean that is lower than the median. For rates, the harmonic mean may be more appropriate.

Implementation

The calculation of likelihoods incorporating genotyping errors is described in Huisman (2017), and can be explored using CalcPairLL.

Noteworthy is that the probabilistic estimate of an individual’s true genotype depends on its own observed genotype and its parents’ estimated true genotypes, but not on the genotypes of its offspring. Doing so would potentially create an unstable seesaw situation, where assignment of a parent is both affected by and does affect that parent’s estimated true genotype.

For dummy individuals, which do not have an observed genotype, the offspring genotypes do necessarily affect its estimated true genotype. To prevent the algorithm from seesawing marginal probabilities are calculated, excluding grandparents and/or one, two or all offspring from the calculation, as appropriate for a particular relationship likelihood (most importantly avoiding implicitly conditioning on oneself).

The main reason those marginal probabilities are not implemented for genotyped individuals, is because of their relatively large computational cost, with negligible effect on the pedigree outcome for moderately high quality SNP data (error rate < 1% and low missingness).1

Try out different values

Due to both the variation in error rate between SNPs and how genotyping errors are implemented, it is usually a good idea to run sequoia() with a few different presumed genotyping error rates. This allows you to make sure the results are robust against slight mis-specifications of this parameter, and/or to optimise the false negative and false positive assignment rates.

Err = 0

sequoia() is not guaranteed to work with an assumed error rate of \(0\), especially for more complex pedigrees. The problem is that a new assignment considers only the direct vicinity of the putative parent-offspring pair, not the full pedigree. An assignment will have cascading effects on the probabilities of the true, underlying genotypes of dummy or partially genotyped individuals downstream in the pedigree, which may result in impossibilities when Err=0 but genotyping errors are present.


References

Huisman, Jisca. 2017. “Pedigree Reconstruction from SNP Data: Parentage Assignment, Sibship Clustering and Beyond.” Molecular Ecology Resources 17 (5): 1009–24.