Skip to contents

Check that the provided genotype matrix is in the correct format, and check for low call rate samples and SNPs.

Usage

CheckGeno(
  GenoM,
  quiet = FALSE,
  Plot = FALSE,
  Return = "GenoM",
  Strict = TRUE,
  DumPrefix = c("F0", "M0")
)

Arguments

GenoM

the genotype matrix.

quiet

suppress messages.

Plot

display the plots of SnpStats.

Return

either 'GenoM' to return the cleaned-up genotype matrix, or 'excl' to return a list with excluded SNPs and individuals (see Value).

Strict

Exclude any individuals genotyped for <5 genotyped for <5 up to version 2.4.1. Otherwise only excluded are (very nearly) monomorphic SNPs, SNPs scored for fewer than 2 individuals, and individuals scored for fewer than 2 SNPs.

DumPrefix

length 2 vector, to check if these don't occur among genotyped individuals.

Value

If Return='excl' a list with, if any are found:

ExcludedSNPs

SNPs scored for <10 excluded when running sequoia

ExcludedSnps-mono

monomorphic (fixed) SNPs; automatically excluded when running sequoia. This includes nearly-fixed SNPs with MAF \(= 1/2N\). Column numbers are *after* removal of ExcludedSNPs, if any.

ExcludedIndiv

Individuals scored for <5 reliably included during pedigree reconstruction. Individual call rate is calculated after removal of 'Excluded SNPs'

Snps-LowCallRate

SNPs scored for 10 recommended to be filtered out

Indiv-LowCallRate

individuals scored for <50 recommended to be filtered out

When Return='excl' the return is invisible, i.e. a check is run and warnings or errors are always displayed, but nothing may be returned.

Thresholds

Appropriate call rate thresholds for SNPs and individuals depend on the total number of SNPs, distribution of call rates, genotyping errors, and the proportion of candidate parents that are SNPd (sibship clustering is more prone to false positives). Note that filtering first on SNP call rate tends to keep more individuals in.

See also

SnpStats to calculate SNP call rates; CalcOHLLR to count the number of SNPs scored in both focal individual and parent.

Examples

GenoM <- SimGeno(Ped_HSg5, nSnp=400, CallRate = runif(400, 0.2, 0.8))
# the quick way:
GenoM.checked <- CheckGeno(GenoM, Return="GenoM")
#> ! There are 178 SNPs scored for <50% of individuals
#>  There are  1000  individuals and  400  SNPs.

# the user supervised way:
Excl <- CheckGeno(GenoM, Return = "excl")
#> ! There are 178 SNPs scored for <50% of individuals
#>  There are  1000  individuals and  400  SNPs.
GenoM.orig <- GenoM   # make a 'backup' copy
if ("ExcludedSnps" %in% names(Excl))
  GenoM <- GenoM[, -Excl[["ExcludedSnps"]]]
if ("ExcludedSnps-mono" %in% names(Excl))
  GenoM <- GenoM[, -Excl[["ExcludedSnps-mono"]]]
if ("ExcludedIndiv" %in% names(Excl))
  GenoM <- GenoM[!rownames(GenoM) %in% Excl[["ExcludedIndiv"]], ]

# warning about  SNPs scored for <50% of individuals ?
# note: this is not necessarily a problem, and sometimes unavoidable.
SnpCallRate <- apply(GenoM, MARGIN=2,
                     FUN = function(x) sum(x!=-9)) / nrow(GenoM)
hist(SnpCallRate, breaks=50, col="grey")

GenoM <- GenoM[, SnpCallRate > 0.6]

# to filter out low call rate individuals: (also not necessarily a problem)
IndivCallRate <- apply(GenoM, MARGIN=1,
                       FUN = function(x) sum(x!=-9)) / ncol(GenoM)
hist(IndivCallRate, breaks=50, col="grey")

GoodSamples <- rownames(GenoM)[ IndivCallRate > 0.8]