Calculation of Tajima's D and other neutrality test statistics from low depth next-generation sequencing data.

BMC Bioinformatics 2013 Oct 2;14:289. Epub 2013 Oct 2.

Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, Oestervoldgade 5-7, DK-1350, Copenhagen, Denmark.

Background: A number of different statistics are used for detecting natural selection using DNA sequencing data, including statistics that are summaries of the frequency spectrum, such as Tajima's D. These statistics are now often being applied in the analysis of Next Generation Sequencing (NGS) data. However, estimates of frequency spectra from NGS data are strongly affected by low sequencing coverage; the inherent technology dependent variation in sequencing depth causes systematic differences in the value of the statistic among genomic regions.

Results: We have developed an approach that accommodates the uncertainty of the data when calculating site frequency based neutrality test statistics. A salient feature of this approach is that it implicitly solves the problems of varying sequencing depth, missing data and avoids the need to infer variable sites for the analysis and thereby avoids ascertainment problems introduced by a SNP discovery process.

Conclusion: Using an empirical Bayes approach for fast computations, we show that this method produces results for low-coverage NGS data comparable to those achieved when the genotypes are known without uncertainty. We also validate the method in an analysis of data from the 1000 genomes project. The method is implemented in a fast framework which enables researchers to perform these neutrality tests on a genome-wide scale.

Download full-text PDF

Source
http://dx.doi.org/10.1186/1471-2105-14-289DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4015034PMC
October 2013
2 Reads

Publication Analysis

Top Keywords

ngs data
12
neutrality test
8
sequencing depth
8
sequencing data
8
data
8
test statistics
8
sequencing
6
statistics
5
approach accommodates
4
validate method
4
accommodates uncertainty
4
uncertainty data
4
data calculating
4
site frequency
4
genotypes uncertainty
4
statistics salient
4
achieved genotypes
4
based neutrality
4
frequency based
4
calculating site
4

References

(Supplied by CrossRef)

R Nielsen et al.
Annu Rev Genet 2005

BF Voight et al.
PLoS Biol 2006

PC Sabeti et al.
Nature 2007

JK Pickrell et al.
Genome Res 2009

T Bersaglieri et al.
Am J Hum Genet 2004

A Seltsam et al.
Blood 2003

AL Hughes et al.
Front Biosci 1998

JM Akey et al.
PLoS Biol 2004

ER Mardis et al.
Annu Rev Genomics Hum Genet 2008

ML Metzker et al.
Nat Rev Genet 2009

A Ramírez-Soriano et al.
Genetics 2009

Similar Publications