Confidence Probabilities
EstConf.Rd
Estimate confidence probabilities ('backward') and assignment
error rates ('forward') per category (genotyped/dummy) by repeatedly
simulating genotype data from a reference pedigree using
SimGeno
, reconstruction a pedigree from this using
sequoia
, and counting the number of mismatches using
PedCompare
.
Arguments
- Pedigree
reference pedigree from which to simulate, dataframe with columns id-dam-sire. Additional columns are ignored.
- LifeHistData
dataframe with id, sex (1=female, 2=male, 3=unknown), birth year, and optionally BY.min - BY.max - YearLast.
- args.sim
list of arguments to pass to
SimGeno
, such asnSnp
(number of SNPs),SnpError
(genotyping error rate) andParMis
(proportion of non-genotyped parents). Set toNULL
to use all default values.- args.seq
list of arguments to pass to
sequoia
, such asModule
('par' or 'ped'),Err
(assumed genotyping error rate), andComplex
. May include (part of)SeqList
, a list of sequoia output (i.e. as a list-within-a-list). Set toNULL
to use all default values.- nSim
number of iterations of simulate - reconstruct - compare to perform, i.e. number of simulated datasets.
- nCores
number of computer cores to use. If
>1
, package parallel is used. Set to NULL to use all but one of the available cores, as detected byparallel::detectCores()
(using all cores tends to freeze up your computer). With large datasets, the amount of computer memory may be the limiting factor for the number of cores you can use.- quiet
suppress messages.
TRUE
runsSimGeno
andsequoia
quietly,'very'
also suppresses other messages and the iteration counter whennCores=1
(there is no iteration counter whennCores>1
).
Value
A list, with elements:
- ConfProb
See below
- PedErrors
See below
- Pedigree.reference
the pedigree from which data was simulated
- LifeHistData
- Pedigree.inferred
a list with for each iteration the inferred pedigree based on the simulated data
- SimSNPd
a list with for each iteration the IDs of the individuals simulated to have been genotyped
- PedComp.fwd
array with
Counts
from the 'forward'PedCompare
, from whichPedErrors
is calculated- RunParams
a list with the call to
EstConf
as a semi-nested list (args.sim, args.seq, nSim, nCores), as well as the default parameter values forSimGeno
andsequoia
.- RunTime
sequoia
runtime per simulation in seconds, as measured bysystem.time()['elapsed']
.
Dataframe ConfProb
has 7 columns:
- id.cat, dam.cat, sire.cat
Category of the focal individual, dam, and sire, in the pedigree inferred based on the simulated data. Coded as G=genotyped, D=dummy, X=none
- dam.conf
Probability that the dam is correct, given the categories of the assigned dam and sire (ignoring whether or not the sire is correct)
- sire.conf
as
dam.conf
, for the sire- pair.conf
Probability that both dam and sire are correct, given their categories
- N
Number of individuals per category-combination, across all
nSim
iterations
Array PedErrors
has three dimensions:
- class
FalseNeg
(atives): could have been assigned but was not (individual + parent both genotyped or dummyfiable; P1only inPedCompare
).FalsePos
(itives): no parent in reference pedigree, but one was assigned based on the simulated data (P2only)Mismatch
: different parents between the pedigrees
- cat
Category of individual + parent, as a two-letter code where the first letter indicates the focal individual and the second the parent; G=Genotyped, D=Dummy, T=Total
- parent
dam or sire
Details
The confidence probability is taken as the number of correct (matching) assignments, divided by all assignments made in the observed (inferred-from-simulated) pedigree. In contrast, the false negative & false positive assignment rates are proportions of the number of parents in the true (reference) pedigree. Each rate is calculated separatedly for dams & sires, and separately for each category (Genotyped/Dummy(fiable)/X (none)) of individual, parent and co-parent.
This function does not know which individuals in the actual Pedigree
are genotyped, so the confidence probabilities need to be added to the
Pedigree
as shown in the example at the bottom.
A confidence of \(1\) means all assignments on simulated data were correct for
that category-combination. It should be interpreted as (and perhaps modified
to) \(> 1 - 1/N\), where sample size N
is given in the last column
of the ConfProb
and PedErrors
dataframes in the output. The
same applies for a false negative/positive rate of \(0\) (i.e. to be
interpreted as \(< 1/N\)).
Assumptions
Because the actual true pedigree is (typically) unknown, the provided
reference pedigree is used as a stand-in and assumed to be the true
pedigree, with unrelated founders. It is also assumed that the probability
to be genotyped is equal for all parents; in each iteration, a new random
set of parents (proportion set by ParMis
) is mimicked to be
non-genotyped. In addition, SNPs are assumed to segregate independently.
An experimental version offering more fine-grained control is available at https://github.com/JiscaH/sequoiaExtra .
Object size
The size in Kb of the returned list can become pretty big, as each of the
inferred pedigrees is included. When running EstConf
many times for
a range of parameter values, it may be prudent to save the required summary
statistics for each run rather than the full output.
Errors
If you have a large pedigree and try to run this function on multiple cores, you may run into "Cannot allocate vector of size ..." errors or even unexpected crashes: there is not enough computer memory for each separate run. Try reducing `nCores`.
Examples
# estimate proportion of parents that are genotyped (= 1 - ParMis)
sumry_grif <- SummarySeq(SeqOUT_griffin, Plot=FALSE)
tmp <- apply(sumry_grif$ParentCount['Genotyped',,,],
MARGIN = c('parentSex', 'parentCat'), FUN = sum)
props <- sweep(tmp, MARGIN='parentCat', STATS = rowSums(tmp), FUN = '/')
1 - props[,'Genotyped'] / (props[,'Genotyped'] + props[,'Dummy'])
#> Dam Sire
#> 0.3867925 0.2020202
# Example for parentage assignment only
conf_grif <- EstConf(Pedigree = SeqOUT_griffin$Pedigree,
LifeHistData = SeqOUT_griffin$LifeHist,
args.sim = list(nSnp = 150, # no. in actual data, or what-if
SnpError = 5e-3, # best estimate, or what-if
CallRate=0.9, # from SnpStats()
ParMis=c(0.39, 0.20)), # calc'd above
args.seq = list(Err=5e-3, Module="par"), # as in real run
nSim = 1, # try-out, proper run >=20 (10 if huge pedigree)
nCores=1)
#> ℹ Simulating parentage assignment only ...
#> i= 1 20:38:51
# parent-pair confidence, per category (Genotyped/Dummy/None)
conf_grif$ConfProb
#> id.cat dam.cat sire.cat dam.conf sire.conf pair.conf N
#> 14 G G G 1 1 1 43
#> 15 G G X 1 NA NA 17
#> 17 G X G NA 1 NA 21
#> 18 G X X NA NA NA 54
# Proportion of true parents that was correctly assigned
1 - apply(conf_grif$PedErrors, MARGIN=c('cat','parent'), FUN=sum, na.rm=TRUE)
#> parent
#> cat dam sire
#> GG 0.9523810 0.9014085
#> GD 0.0000000 0.0000000
#> GT 0.6122449 0.6956522
#> DG 0.0000000 0.0000000
#> DD 0.0000000 0.0000000
#> DT 0.0000000 0.0000000
#> TT 0.4918033 0.5517241
# add columns with confidence probabilities to pedigree
# first add columns with category (G/D/X)
Ped.withConf <- getAssignCat(Pedigree = SeqOUT_griffin$Pedigree,
SNPd = SeqOUT_griffin$PedigreePar$id)
Ped.withConf <- merge(Ped.withConf, conf_grif$ConfProb, all.x=TRUE,
sort=FALSE) # (note: merge() messes up column order)
head(Ped.withConf[Ped.withConf$dam.cat=="G", ])
#> id.cat dam.cat sire.cat id dam sire LLRdam LLRsire
#> 30 G G G i110_2006_M i061_2004_F i073_2004_M 5.67 6.34
#> 31 G G G i025_2002_M i014_2001_F i018_2001_M 8.78 8.22
#> 32 G G G i136_2007_M i095_2005_F i089_2005_M 5.89 4.77
#> 33 G G G i145_2008_F i132_2007_F i127_2007_M 9.23 8.02
#> 34 G G G i079_2004_F i041_2003_F i046_2003_M 12.66 7.90
#> 35 G G G i080_2004_M i041_2003_F i039_2002_M 11.47 5.72
#> LLRpair OHdam OHsire MEpair dam.conf sire.conf pair.conf N
#> 30 15.51 0 0 0 1 1 1 43
#> 31 16.13 0 0 0 1 1 1 43
#> 32 14.83 0 0 0 1 1 1 43
#> 33 16.29 0 0 0 1 1 1 43
#> 34 17.56 0 0 0 1 1 1 43
#> 35 17.07 0 0 0 1 1 1 43
# save output summary
if (FALSE) {
conf_griff[['Note']] <- 'You could add a note'
saveRDS(conf_grif[c('ConfProb','PedComp.fwd','RunParams','RunTime','Note')],
file = 'conf_200SNPs_Err005_Callrate80.RDS')
}
## P(actual FS | inferred as FS) etc.
if (FALSE) {
PairL <- list()
for (i in 1:length(conf_grif$Pedigree.inferred)) { # nSim
cat(i, "\t")
PairL[[i]] <- ComparePairs(conf_grif$Pedigree.reference,
conf_grif$Pedigree.inferred[[i]],
GenBack=1, patmat=TRUE, ExcludeDummies = TRUE,
Return="Counts")
}
# P(actual relationship (Ped1) | inferred relationship (Ped2))
PairRel.prop.A <- plyr::laply(PairL, function(M)
sweep(M, MARGIN='Ped2', STATS=colSums(M), FUN="/"))
PairRel.prop <- apply(PairRel.prop.A, 2:3, mean, na.rm=TRUE) #avg across sims
round(PairRel.prop, 3)
# or: P(inferred relationship | actual relationship)
PairRel.prop2 <- plyr::laply(PairL, function(M)
sweep(M, MARGIN='Ped1', STATS=rowSums(M), FUN="/"))
}
if (FALSE) {
# confidence probability vs. sibship size
source('https://raw.githubusercontent.com/JiscaH/sequoiaExtra/main/conf_vs_sibsize.R')
conf_grif_nOff <- Conf_by_nOff(conf_grif)
conf_grif_nOff['conf',,'GD',]
conf_grif_nOff['N',,'GD',]
}