5 Running `sequoia()`

5.1 `Module` (`MaxSibIter`)

Most datasets initially contain imperfections. Pedigree reconstruction can be a useful tool in quality control, to find outliers with many Mendelian errors among e.g. known mother–offspring pairs (sample mislabeling?) or among SNPs (poor quality SNPs to be excluded).

To speed up iterative quality control, it is possible to run sequoia() only up to a certain point, use the output thus far to, as necessary, exclude SNPs, exclude/relabel samples, or adjust input parameters, and re-run again from the start or resume from the last point.

There are three possible intermediate stopping points, plus the final end point, each named after the last ‘Module’ that is run:

'pre': Input check
check that the genotype data, life history data and input parameters are in a valid format, create 'Specs' element of output list
'dup': check for duplicates
Check for (nearly) identical genotypes, and for duplicated IDs in the genotype and life history data
'par': parentage assignment
Assign genotyped parents to genotyped individuals. Includes call to MakeAgePrior() to estimate AgePriors based on the just-assigned parents.
'ped': full pedigree reconstruction
Cluster half- and full-siblings and assign each cluster a dummy-parent; assign grandparents to sibships and singletons.

5.1.1 `MaxSibIter`

Up to sequoia version 2.0, MaxSibIter was used to switch between these options, with numeric values

\(-9\) only input check
\(-1\) also check for duplicates
\(0\) also do parentage assignment
\(x>0\) also do at most \(x\) iterations of full pedigree reconstruction

The maximum number of iterations was always only intended as a safety net in case the total likelihood did not stabilise, but this has not been an issue since sequoia version 1.0 (as far as I am aware). Even for large, complex datasets (several thousand individuals with considerable close inbreeding) the total likelihood plateaus in 10–15 iterations. From version 2.1 onward, MaxSibIter\(=42\); if this number of iterations is reached it is a strong sign of problems with the data, e.g. that the SNPs are not informative enough to reconstruct the pedigree.

5.2 Input check

SNP datasets are typically too large to easily spot any problems by simply looking at the data in a spreadsheet. There are many tools available to deal with genotype data and do proper quality control; the function CheckGeno() merely checks that the data are in the correct format for sequoia, and that there are no SNPs or individuals with excessively many missing values.

The function SnpStats() can be used to count the number of Mendelian errors per SNP. It is not automatically called by sequoia(), but excluding SNPs with exceptionally high genotyping error rates is recommended to improve accuracy of the inferred pedigree.

5.3 Check for duplicates

The data may contain positive controls, as well as other intentional and unintentional duplicated samples, with or without life-history information. sequoia() searches the data for (near) identical genotypes, allowing for MaxMismatchDUP mismatches between the genotypes, which may or may not have the same individual ID. The threshold value depends on the presumed genotyping error rate, allele frequencies and number of SNPs, and is calculated by CalcMaxMismatch().

Not all pairs flagged as potential duplicate genotypes are necessarily actual duplicates: inbred individuals may be nearly indistinguishable from their parent(s), especially when the number of SNPs is limited.

This check will also return a vector of individuals included in the genotype data, but not in the life history data (NoLH). This is merely a service to the user; individuals without life history information can often be successfully included in the pedigree.

5.4 Parentage assignment

The number of pairs to be checked if they are parent and offspring is very large for even moderate numbers of individuals, e.g. 5,000 pairs for 100 individuals, and 2 million for 2,000 individuals. To speed up computation, three ‘sieves’ are applied sequentially to find candidate parent-offspring pairs, with decreasing ‘mesh size’

The number of SNPs at which the pair are opposing homozygotes must be less than MaxMismatchOH;
The log10-likelihood ratio (LLR) between being parent and offspring versus unrelated, not conditioning on any already assigned parents, must be equal to or greater than Tfilter;
The LLR between the pair being parent and offspring versus being otherwise related, calculated conditional on all already assigned relatives, must be equal to or greater than Tassign. This step filters out siblings, grandparents and aunts/uncles.

The older of the pair is assigned as parent of the younger. If it is unclear which is the older, or if it is unclear whether the parent is the mother or the father, no assignment is made, but these pairs can be found with GetMaybeRel() (see section Likely relatives). If there are multiple candidate parents of one or both (or unknown) sex(es), the parent pair or single parent resulting in the highest likelihood is assigned.

The heuristic sequential filtering approach makes parentage assignment quick, and usually takes only a few minutes, especially when setting CalcLLR=FALSE (do not re-calculate LLR parent-offspring vs next-most-likely relationship for all assigned parents, based on the final pedigree).

5.5 AgePrior

The assigned parents are used to update the ageprior, before returning the results to the user of continuing with full pedigree reconstruction. Any arguments to MakeAgePrior() can be passed via args.AP. For example, you can specify the maximum age of dams and sires:

SeqOUT.B <- sequoia(GenoM = SimGeno_example, Err = 0.005,
                    LifeHistData = LH_HSg5, Module="par", Plot=FALSE, quiet=TRUE,
                    args.AP = list(MaxAgeParent = c(3,2), Smooth=FALSE))
PlotAgePrior(SeqOUT.B$AgePriors)

This way, even if sampled parents are all age 1, siblings sharing an unsampled parent may still have an age difference of 1 or 2 years, and grandparent – grand-offspring pairs may have an age difference of 2-6 years.

5.6 Full reconstruction

During full pedigree reconstruction, assignment of all first and second degree links between all individuals is attempted, using the following steps within each iteration

Find pairs of likely full- and half-siblings, using filtering steps with decreasing ‘mesh size’ similar to parentage assignment
Cluster potential sibling pairs into sibships
Find and assign grandparent – grand-offspring pairs (iteration 3+)
Merge existing sibships
Replace dummy parents by genotyped individuals
For individuals without parent(s), find candidate real and dummy parents, and assign parent(-pairs)
For sibships without grandparent(s), find candidate real and dummy grandparents, and assign grandparent(-pairs) (iteration 2+)

The order of these steps, and the skipping of some steps in the first iteration(s), has been established by trial & error to minimise false positive assignments while maximising correct assignments across a wide range of datasets. When running sequoia(), you can keep track of the progress through the various steps by setting quiet = 'verbose'.

Full pedigree reconstruction may take from a few seconds to several hours, depending on the number of individuals without an already assigned parent, the proportion of individuals with unknown sex or birth year, the number of sibships that is being clustered and their degree of interconnection, and the number of SNPs and their error rate. Generally computation is faster if the data is more complete and more accurate.

5.6.1 Dummy Individuals

The ‘dummy’ parent assigned to each cluster of half-siblings is denoted by a 4-digit number with prefix F for females (F0001, F0002, …) and M for males (M0001, M0002, …). In the output, dummy individuals are appended at the bottom of the pedigree with their assigned parents, i.e. the sibship’s grandparents. In addition, all information for each dummy individual is given in dataframe DummyIDs, including its parents (again), its offspring, and the estimated birth year, as a point estimate (BY.est) and lower and upper bound of the 95% probability interval. This information is intended to make it easier to match dummy IDs to real IDs of observed but non-genotyped individuals (see also Compare pedigrees).

5.6.2 Total likelihood

The total likelihood provides a measure of how well the observed genotype data is explained by the currently inferred pedigree, the allele frequencies of the SNPs, the presumed genotyping error rate, and the assumption that founder genotypes were drawn from a genepool in Hardy-Weinberg equilibrium. It is always a negative number, and closer to zero indicates a better fit. When an asymptote is reached, typically in 5–10 iterations, either the algorithm is concluded (the default) or dependency on the age prior is increased (if UseAge = 'extra') and the algorithm continues until a new asymptote is reached.

If you set MaxSibIter and there is a large change in value between the second-last and last likelihood of output element TotLikSib, consider running the algorithm for more iterations (increase MaxSibIter).

5.7 Save output

There are various ways in which the output can be stored. This includes saving the seqoia list object, and optionally various other objects, in an .RData compressed file

save(SeqList, LHdata, Geno, file="Sequoia_output_date.RData")

which can be read back into R at a later point

load("Sequoia_output_date.RData")
# 'SeqList' and 'LHdata' will appear in R environment

The advantage is that all data is stored and can easily be manipulated when recalled. The disadvantage is that the file is not human-readable, and (to my knowledge) can only be opened by R.

Alternatively, the various dataframes and list elements can each be written to a text file in a designated folder. This can be done using write.table or write.csv, or using the wrapper writeSeq():

writeSeq(SeqList, GenoM = Geno, folder=paste("Sequoia_OUT", Sys.Date()))

The same function can also write the dataframes and list elements to an excel file (.xls or .xlsx), each to a separate sheet, using package xlsx:

writeSeq(SeqList, OutFormat="xls", file="Sequoia_OUT.xlsx")

Note that ‘GenoM’ is ignored, as a very large genotype matrix may result in a file that is too large for excel to open. If you have a genotype matrix of modest size, you can add it to the same excel file:

library(xlsx)
write.xlsx(Geno, file = "Sequoia_OUT.xlsx", sheetName="Genotypes",
      col.names=FALSE, row.names=TRUE, append=TRUE, showNA=FALSE)

The option append=TRUE ensures that the sheet is appended to the file, rather than the file entirely overwritten.

5 Running sequoia()

5.1 Module (MaxSibIter)

5.1.1 MaxSibIter