8.1 Error about/due to input data

For smaller datasets (say up to 1000 individuals) it may be useful to read in the data with read.table() (using header=TRUE or FALSE, as appropriate, and typically row.names=1), and inspect the data with table(as.matrix(mydata)) for odd entries that weren’t caught by CheckGeno() to give an informative error message. For very large datasets, specialist tools may be required to get the data in a standardised format.

8.2 Low assignment rate

See also key points.

In many cases, assignment rate can be boosted without increasing the number of SNPs: by assuming a different genotyping error rate, providing sex or birth year information on more individuals, or fine-tuning the ageprior. Sometimes reducing the number of SNPs may increase assignment rate, by removing those with the highest error rate and/or SNPs in close LD.

8.2.1 Not enough SNPs

As opposed to most other pedigree assignment programs, Sequoia does not rely on MCMC to explore many different pedigree possibilities, but instead sequentially assigns highly likely relationships, and expands the pedigree step by step. For a relationship to be highly likely, a substantial number of SNPs is necessary, larger than for MCMC methods: at least 100-200 for parentage assignment, or full sibling clustering in a monogamous population, and about 400-500 otherwise. More is not always better: with tens of thousands of SNPs many will unavoidably be in high LD (unless you work on a species with a very large genome and very high \(N_e\)), and the signal will get lost in the noise. Using more than 1000 SNPs is never necessary, and for a typical vertebrate genome there is little benefit of using more than 500 good SNPs.

This unfortunately also means that sequoia is unsuitable for species with a very small genome.

8.2.2 Genotyping errors

A few SNPs with a high (apparent) error rate may throw off pedigree reconstruction if their erroneous signal is not sufficiently offset by a large number of more accurate SNPs. Such SNPs can be identified with SnpStats(), using an existing pedigree or one from an preliminary round of parentage assignment. If there are clear outliers with regards to estimated error rate, it may be worth exploring pedigree reconstruction without these SNPs.

8.2.3 Lacking sex information

Currently Sequoia cannot handle sex-linked markers, and therefore cannot distinguish between maternal versus paternal relatives. Sometimes the sex of an individual can be inferred, if it forms a complementary parent-pair with an individual of known sex. This is also the manner in which the sex of dummy parents is determined; half-sibships sharing a parent of unknown sex are not currently implemented.

Pairs of relatives for which it is unclear whether they are maternal or paternal relatives, can be identified with GetMaybeRel().

8.2.4 Insufficient age information

A parent will only be assigned if it is known to be older than the individual with which it genetically forms a parent–offspring pair, or if a complementary co-parent is identified. Purely genetically, it is impossible to tell who is the parent in a parent–offspring pair. But once an individual has been assigned parents, any remaining individuals with which it forms a parent-offspring pair must be its offspring, and will be assigned as such. Thus, a high proportion of sampled parents may somewhat compensate for unknown birth years.

Providing minimum and maximum possible birth years can substantially increase assignment rate. The risk is that when intervals are taken too narrowly, parent–offspring pairs are flipped the wrong way around, which in turn may lead to other wrong assignments. This may especially be a problem in long-lived species which start breeding at an early age.

Age information will also help to distinguish between the three different types of second degree relatives (half siblings, grandparent – grand-offspring and full avuncular (aunt/uncle – niece/nephew). These are genetically indistinguishable, unless both individuals of the pair already have a parent assigned. In most species, there is limited overlap between the age-difference of half siblings versus grandparents, and only partial overlap of either with the age distribution of avuncular relationships.

8.2.5 Uninformative age prior

The ageprior used for full pedigree reconstruction is estimated from the birth year differences of parent-offspring pairs assigned during the initial parentage assignment. When only few parents are assigned, or when many birth years are unknown, this ageprior may not be very informative. If a large pedigree is available from the same or a similar population (e.g. based on observations and microsatellite paternity assignments), it can be useful to estimate the agepriors from that pedigree. For details, see the age vignette.

8.2.6 Mating system

When the population has a complex mating system, with overlapping generations and many double relatives, a large number of SNPs is needed to distinguish between various plausible alternatives. When the power of the SNP panel is insufficient to make the distinction, no assignment will be made.

When inbreeding and complex relationships (e.g. paternal half-sibling as well as maternal half-aunt) are rare, ignoring these typically increases assignment rate (Complex="simp"). Similarly, when polygamy is rare and not of particular interest, assuming a monogamous mating system typically increases assignment rate (Complex="mono"). However, these choices will risk erroneous assignments when complex relationships or polygamy, respectively, do occur.

8.3 Death year

The year of death forms an upper limit to when an individual could have reproduced, and can from version 2.5 onwards be used as input via the Year.last column in LifeHistData. Note that for many species, the last possible year in which offspring can be born may be after the year of death (e.g. males in many mammal species), or the year preceding the year of death (if a individual died between January 1st and the breeding season).

If it is unclear whether an individual died or emigrated, please do not provide a Year.last – emigrants can still be candidate parents, as they may reside just outside the study area boundaries, or return briefly and unseen during the breeding season. is unknown.

8.4 Accuracy

The real, true pedigree is normally unknown, but there are a few ways to infer the accuracy of specific pedigree links, or to estimate the average accuracy of assignments.

8.4.1 Field pedigree

If the genetically assigned mother matches the mother caring for the individual, or the plant from which the seed was collected, there is little reason to doubt the assignment. Similarly, when the genetically assigned father matches (one of) the observed mates of the mother, the assignment is most likely correct.

In rare other cases, the genetically assigned parent is impossible, for example because the assigned parent was not alive at the time of birth (for mothers) or conception (for fathers). The blame for such an erroneous assignment may be the pedigree reconstruction software (due to genotyping errors, or a bug in the code), but may also be due to sample mislabelling in the lab, or a case of mistaken identity in the field.

Tracking down the underlying cause of discrepancies between the field and genetic pedigree will be time consuming, but can be a valuable part of data quality control.

8.4.2 Genomic relatedness

When many thousands of SNPs are typed, it is possible to calculate the genomic relatedness (\(R_{grm}\)) between all pairs of individuals (see Relatedness). Due to the random nature of Mendelian inheritance there is always considerable scatter of genomic relatedness around pedigree relatedness, but when the pedigree relatedness is considerably higher (say \(R_{ped} - R_{grm} >0.2\)), this is often indicative of a pedigree error. Note however that most estimators of \(R_{grm}\) assume a large, panmictic, non-inbred population, and deviations from these assumptions may contribute to differences between \(R_{ped}\) and \(R_{grm}\).

8.4.3 Data simulation

Simulations as performed by ‘EstConf()’ do not tell which assignments may be incorrect, but do give an estimate of the overall number of incorrect assignments. The simulations are done presuming the inferred pedigree (or an existing pedigree) is the true pedigree, i.e. for a pedigree that is (hopefully) very close to the actual true pedigree. It relies on several simplifying assumptions, and will therefore always be an optimistic estimate of the actual accuracy.

8.5 Add new individuals

When new individuals have been genotyped, such as a new cohort of offspring, it is best to re-run the pedigree reconstruction for all genotyped individuals. This ensures that older siblings of the new offspring are identified, as well as any grandparents. A possible exception is when all candidate parents have been genotyped, although even then inclusion of their parents (the new individuals’ grandparents) may correct for any genotyping errors they may have.

8.6 Results differ from COLONY

Sequoia will typically produce a more conservative pedigree than COLONY, for various reasons:

  • COLONY only considers whether individuals are full siblings, half siblings, or unrelated, while sequoia also considers full- or half- avuncular (aunts and uncles and their nieces and nephews), and grand-parental. These alternative relationships are impossible among individuals born in the same year when generations do not overlap (args.AP=list(Discrete=TRUE)). If all samples are from a single birth year, this is automatically assumed.
  • sequoia also considers double relationships, e.g. paternal half-siblings that are also maternal half-cousins, with a relatedness inbetween full and half siblings. These may be erroneously assigned as full siblings by COLONY when present, but their consideration may cause true full siblings to be only assigned as half-siblings, especially when there is only a moderate amount of genetic data. To ignore double and inbred relationships, use Complex='simp'.
  • sequoia considers third degree relationships, such as full cousins. Again, these may result in false positives by COLONY, while their consideration may result in false negatives by sequoia. This can not be turned off.
  • When no parents of known sex can be assigned, there is no way in sequoia to differentiate between maternal and paternal half-siblings, and none are assigned (although they may be identified with GetMaybeRel(). In COLONY, candidate mothers and fathers can be specified, but this option is not available in sequoia (yet?).
  • When a relationship is uncertain, COLONY will include it in the output with a low confidence probability. sequoia only makes assignments of which it is very sure, as the sequential nature of the algorithm may otherwise result in a snowball effect of incorrect assignments.

You can get sequoia‘s ’opinion’ on a COLONY reconstructed pedigree (or any other pedigree) with CalcOHLLR(), which for every assigned parent gives the log-likelihood ratio between being parent-offspring with the focal individual, versus otherwise related. You may also use CalcPairLL() for a specific set of pairs of individuals to calculate their likelihoods to be parent-offspring, full siblings, … , and investigate how these likelihoods change when changing the ageprior or with Complex='simp'.