How to: RADseq data
Jisca Huisman ( jisca.huisman @ gmail.com )
16 August 2025
howto_RADseq.Rmd
Background & Caveats
Sequoia
was initially developed for SNP array data,
which after typical quality control has a low rate of missing data
(<1%) and a very low genotyping error rate (<0.1%). This is quite
different from RADseq and similar sequence data, which typically have a
high rate of missingness (often >10%) and a high rate of genotyping
errors (>5%, see e.g. (Bresadola et al. 2020)).
Because of the difference in data accuracy and completeness, the
optimal strategy for pedigree reconstruction differs. With SNP array
data it is often very obvious if and how a sample pair are related,
which allows the use of the fast hill-climbing algorithm implemented in
sequoia
, that relies on (almost) all assignments made in
previous steps being correct. In contrast, for RADseq data the
relationship between samples is often a lot less obvious. In such
situations, algorithms that fairly randomly explore the huge number of
possible pedigree configurations are more powerful. However, this is
very time consuming, and (at least some years ago) only implemented for
parentage assignment (https://CRAN.R-project.org/package=MasterBayes ;
archived), for sibship clustering and parentage assignment within a
single cohort (https://www.zsl.org/about-zsl/resources/software/colony),
or for idealised datasets (various theoretical papers; assuming no
inbreeding and/or no missing data and/or discrete generations, etc).
Substantial power can be gained by making use of information from
neighbouring SNPs, rather than treating each SNP as independent as
sequoia
, Colony
, and most other programs do.
However, this added complexity is likely to come with a substantial
increase in computation time.
Do please let me know if you a good general pedigree
reconstruction program for RADseq data, so that I can recommend it to
users for which sequoia
does not work well!
So, while sequoia
is thus not an ideal tool to use with
RADseq data, it is often sufficient, and may sometimes be the only
suitable tool available.
Keeping that in mind, there are various tips & tricks to get the most out of your data, listed below.
Data filtering
This step is crucial; simply using all genetic data as input for
sequoia
is almost guaranteed to give you very poor
results.
Filter data based on:
Autosomal SNPs only. Remove any SNPs that are heterozygous in the heterogametic sex (females in mammals, males in birds and some reptiles), but always homozygous in the other sex.
Minor Allele Frequency (MAF). SNPs with very low MAF are not informative, as almost all individuals will be identical homozygotes. Moreover, with typical RADseq error rates, any heterozygotes or rare-allele-homozygotes are almost as likely to be incorrect as true carriers of the rare allele. Therefore, inclusion of these SNPs may obfuscate the true pedigree signal, rather than strengthen it.
Missingness. If a SNP is only scored for say 20% of individuals, this means that only or 4% of sample pairs are both scored, so for 96% of pairs the SNP carries no information at all.
Linkage Disequilibrium (LD). By chance, pairs of individuals will be more related in some regions of their genomes than you’d expect from the pedigree, and less related in other regions. As an extreme example, if you have 50 SNPs in very close LD, and 30 other SNPs scattered throughout the genome, the signal from those 50 SNPs can overwhelm the signal from the rest of the genome, and bias the inference of how the samples are related. It is possible to filter SNPs based on LD even if the genetic map of the species is unknown.
Genotyping error rate. If you have some estimate of the genotyping error rate, for example from samples genotyped twice, consider removing SNPs with the very worst error rates. There is no need to filter extremely harshly: even if all SNPs had exactly the same underlying rate, you would see a Poisson distribution of the error count per SNP with a long tail to the right.
Deviation from HWE Strong deviations from genotype proportions under Hardy-Weinberg Equilibrium suggest null alleles or other genotyping issues. Again no need to filter extremely harshly, as variation in this metric is expected, especially in small inbred populations and/or due to selection.
There is no one-size-fits-all recommendation on thresholds to use, or the relative emphasis on the different filtering criteria – it depends on the number of SNPs you start with, the data quality, the genome size and chromosome number, … . It can be useful to first remove the very worst SNPs, then remove the very worst samples, and only then do the final filtering.
Aim for at least 100-200 SNPs for parentage assignment, and more for
sibship clustering (say 400-500). Generally, the more complicated the
pedigree, in terms of inbreeding, overlapping generations, unknown birth
years, etc., the more SNPs you will need. But, genome size and
chromosome number put an upper limit on the number of semi-independent
SNPs. Consequently, sequoia
is unlikely to work well on
inbred populations of species with a small genome.
Genotyping error pattern
SNP array vs RADseq
By default, sequoia
assumes that the chance that an
allele is observed incorrectly is independent from whether the true
genotype is homozygote or heterozygote. This is a reasonable assumption
for SNP array data, where most SNPs that are strongly affected by
allelic dropout are removed during quality control. Remaining errors
largely arise from the clustering software used to translate the signal
strength for each of the two alleles into three genotype classes. For
each genotype, the probability to be correctly observed is
,
with for the heterozygote an equal chance to be either homozygote (each
),
and for the homozygotes a much smaller chance to be observed as the
other homozygote
()
than as heterozygote
().
In contrast, for RADseq the chance that a true heterozygote is observed as homozygote is considerably higher than the reverse, due to much higher allelic dropout rates than for SNP arrays. In (Bresadola et al. 2020), the per-allele genotyping error rate at homozygous sites is mentioned as being typically <1%, while the rate at heterozygous sites is around 5-10%, and can be much higher.
Implementation
All relevant functions in the package allow full flexibility with respect to the genotyping error pattern by letting you specify a vector with three genotyping error rates:
- hom|hom: the probability that a true homozygote is observed as the other homozygote;
- het|hom: the probability that a true homozygote is observed as heterozygote;
- hom|het: the probability that a true heterozygote is observed as the minor homozygote, which is equal to the probability it is observed as the major homozygote.
So, the only assumption made is that the two homozygotes are interchangeable with respect to genotyping error chances.
There are two functions to help with creating these vectors,
ErrToM()
(introduced in version 2.0 in 2020) and
Err_RADseq()
(introduced in version 3.0 in 2025). Both can
return both a length-3 vector and a 3x3 matrix; both of which can be
used as input to other functions. See the ?ErrToM
helpfile
for further details.
Code Examples
With a per-allele genotyping error rate at homozygous sites of and at heterozygous sites of , the error matrix would be
Err_RADseq(E0=0.005, E1=0.10, Return='matrix')
## obs
## act 0 1 2
## 0 0.990025 0.00995 0.000025
## 1 0.090000 0.82000 0.090000
## 2 0.000025 0.00995 0.990025
where 0, 1 and 2 are the number of copies of the reference (minor) allele.
For comparison, the default SNP-array based pattern with the same would be
ErrToM(0.09*2, Return='matrix')
## obs
## act 0 1 2
## 0 0.8200 0.1719 0.0081
## 1 0.0900 0.8200 0.0900
## 2 0.0081 0.1719 0.8200
and thus has a MUCH higher . Basically, with the RADseq pattern, any observed heterozygote is probably correct (), but with the default pattern is always about twice as large as , and any heterozygote is quite likely to be wrong ().
You can try running sequoia()
(or any other function)
with several different error vectors, to explore the effect it has on
assignments. Hopefully, the majority of assignments is not affected by
the assumed genotyping error rate or any other input parameter, forming
a solid core of your pedigree.
Err_RAD_low <- Err_RADseq(E0=0.002, E1=0.05)
Err_RAD_high <- Err_RADseq(E0=0.01, E1=0.15)
Err_chip <- ErrToM(0.15*(1-0.15)*2, Return='vector')
SeqOUT_lowErr <- sequoia(GenoM, LHdata, Err=Err_RAD_low)
SeqOUT_highErr <- sequoia(GenoM, LHdata, Err=Err_RAD_high)
SeqOUT_chipErr <- sequoia(GenoM, LHdata, Err=Err_chip)
You can also use EstConf()
to explore potential
consequences of the actual (i.e., simulated) genotyping error rate being
much higher/lower than assumed (during pedigree reconstruction), or of
assuming the wrong pattern:
EC_A <- EstConf(Ped_griffin, LH_griffin,
args.sim=list(SnpError=Err_RAD_high, nSnp=200, CallRate=0.8),
args.seq=list(Err=Err_RAD_low),
nSim = 10, nCores = 5)
EC_B <- EstConf(Ped_griffin, LH_griffin,
args.sim=list(SnpError=Err_RAD_high, nSnp=200, CallRate=0.8),
args.seq=list(Err=Err_chip),
nSim = 10, nCores = 5)
EC_C <- EstConf(Ped_griffin, LH_griffin,
args.sim=list(SnpError=Err_RAD_high, nSnp=200, CallRate=0.8),
args.seq=list(Err=Err_RAD_high),
nSim = 10, nCores = 5)
Calculating the number of correct and incorrect assignments, averaged across replicates, and summed over dams and sires:
count_ARER <- function(EC) {
list(AR = mean(apply(EC$PedComp.fwd[,'TT','Match',],1,sum)),
ER = mean(apply(EC$PedComp.fwd[,'TT',c('Mismatch','P2only'),], 1, sum)))
}
count_ARER(EC_A) # RAD pattern, presumed rate lower than simulated
# $AR
# [1] 71.3
# $ER
# [1] 27.2
count_ARER(EC_B) # SNP chip pattern, similar/higher rate presumed than sim
# $AR
# [1] 250.8
# $ER
# [1] 47.8
count_ARER(EC_C) # presumed rate and pattern identical to simulated
# $AR
# [1] 266.3
# $ER
# [1] 8.2
Of course, in reality you likely will not know the actual genotyping
error rates very accurately (if at all), but trying out different values
in sequoia()
, EstConf()
,
GetMaybeRel()
, CalcOHLLR()
and/or
CalcPairLL()
will get you a feeling of what might and might
not be possible with your dataset.