Unbiased estimation of linkage disequilibrium from unphased data
AbstractLinkage disequilibrium is used to infer evolutionary history and to identify regions under selection or associated with a given trait. In each case, we require accurate estimates of linkage disequilibrium from sequencing data. Unphased data presents a challenge because the co-occurrence of alleles at different loci is ambiguous. Commonly used estimators for the common statistics r2 and D2 exhibit large and variable upward biases that complicate interpretation and comparison across cohorts. Here, we show how to find unbiased estimators for a wide range of two-locus statistics, including D2, for both single and multiple randomly mating populations. These provide accurate estimates over three orders of magnitude in LD. We also use these estimators to construct an estimator for r2 that is less biased than commonly used estimators, but nevertheless argue for using rather than r2 for population size estimates.