Publications by authors named "Gerton Lunter"

61 Publications

A unified haplotype-based method for accurate and comprehensive variant calling.

Nat Biotechnol 2021 Mar 29. Epub 2021 Mar 29.

MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, UK.

Almost all haplotype-based variant callers were designed specifically for detecting common germline variation in diploid populations, and give suboptimal results in other scenarios. Here we present Octopus, a variant caller that uses a polymorphic Bayesian genotyping model capable of modeling sequencing data from a range of experimental designs within a unified haplotype-aware framework. Octopus combines sequencing reads and prior information to phase-called genotypes of arbitrary ploidy, including those with somatic mutations. We show that Octopus accurately calls germline variants in individuals, including single nucleotide variants, indels and small complex replacements such as microinversions. Using a synthetic tumor data set derived from clean sequencing data from a sample with known germline haplotypes and observed mutations in a large cohort of tumor samples, we show that Octopus is more sensitive to low-frequency somatic variation, yet calls considerably fewer false positives than other methods. Octopus also outputs realigned evidence BAM files to aid validation and interpretation.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41587-021-00861-3DOI Listing
March 2021

Short and long-read genome sequencing methodologies for somatic variant detection; genomic analysis of a patient with diffuse large B-cell lymphoma.

Sci Rep 2021 Mar 19;11(1):6408. Epub 2021 Mar 19.

Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK.

Recent advances in throughput and accuracy mean that the Oxford Nanopore Technologies PromethION platform is a now a viable solution for genome sequencing. Much of the validation of bioinformatic tools for this long-read data has focussed on calling germline variants (including structural variants). Somatic variants are outnumbered many-fold by germline variants and their detection is further complicated by the effects of tumour purity/subclonality. Here, we evaluate the extent to which Nanopore sequencing enables detection and analysis of somatic variation. We do this through sequencing tumour and germline genomes for a patient with diffuse B-cell lymphoma and comparing results with 150 bp short-read sequencing of the same samples. Calling germline single nucleotide variants (SNVs) from specific chromosomes of the long-read data achieved good specificity and sensitivity. However, results of somatic SNV calling highlight the need for the development of specialised joint calling algorithms. We find the comparative genome-wide performance of different tools varies significantly between structural variant types, and suggest long reads are especially advantageous for calling large somatic deletions and duplications. Finally, we highlight the utility of long reads for phasing clinically relevant variants, confirming that a somatic 1.6 Mb deletion and a p.(Arg249Met) mutation involving TP53 are oriented in trans.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41598-021-85354-8DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7979876PMC
March 2021

Demographic inference from multiple whole genomes using a particle filter for continuous Markov jump processes.

PLoS One 2021 2;16(3):e0247647. Epub 2021 Mar 2.

MRC Weatherall Institute of Molecular Medicine, John Radcliffe Hospital, Headington, Oxford, United Kingdom.

Demographic events shape a population's genetic diversity, a process described by the coalescent-with-recombination model that relates demography and genetics by an unobserved sequence of genealogies along the genome. As the space of genealogies over genomes is large and complex, inference under this model is challenging. Formulating the coalescent-with-recombination model as a continuous-time and -space Markov jump process, we develop a particle filter for such processes, and use waypoints that under appropriate conditions allow the problem to be reduced to the discrete-time case. To improve inference, we generalise the Auxiliary Particle Filter for discrete-time models, and use Variational Bayes to model the uncertainty in parameter estimates for rare events, avoiding biases seen with Expectation Maximization. Using real and simulated genomes, we show that past population sizes can be accurately inferred over a larger range of epochs than was previously possible, opening the possibility of jointly analyzing multiple genomes under complex demographic models. Code is available at https://github.com/luntergroup/smcsmc.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0247647PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7924771PMC
March 2021

DeepC: predicting 3D genome folding using megabase-scale transfer learning.

Nat Methods 2020 11 12;17(11):1118-1124. Epub 2020 Oct 12.

MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, UK.

Predicting the impact of noncoding genetic variation requires interpreting it in the context of three-dimensional genome architecture. We have developed deepC, a transfer-learning-based deep neural network that accurately predicts genome folding from megabase-scale DNA sequence. DeepC predicts domain boundaries at high resolution, learns the sequence determinants of genome folding and predicts the impact of both large-scale structural and single base-pair variations.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41592-020-0960-3DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7610627PMC
November 2020

Efficient inference in state-space models through adaptive learning in online Monte Carlo expectation maximization.

Comput Stat 2020 3;35(3):1319-1344. Epub 2019 Dec 3.

MRC Weatherall Institute of Molecular Medicine, Unversity of Oxford, Oxford, OX3 9DS UK.

Expectation maximization (EM) is a technique for estimating maximum-likelihood parameters of a latent variable model given observed data by alternating between taking expectations of sufficient statistics, and maximizing the expected log likelihood. For situations where sufficient statistics are intractable, stochastic approximation EM (SAEM) is often used, which uses Monte Carlo techniques to approximate the expected log likelihood. Two common implementations of SAEM, Batch EM (BEM) and online EM (OEM), are parameterized by a "learning rate", and their efficiency depend strongly on this parameter. We propose an extension to the OEM algorithm, termed Introspective Online Expectation Maximization (IOEM), which removes the need for specifying this parameter by adapting the learning rate to trends in the parameter updates. We show that our algorithm matches the efficiency of the optimal BEM and OEM algorithms in multiple models, and that the efficiency of IOEM can exceed that of BEM/OEM methods with optimal learning rates when the model has many parameters. Finally we use IOEM to fit two models to a financial time series. A Python implementation is available at https://github.com/luntergroup/IOEM.git.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1007/s00180-019-00937-4DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7382664PMC
December 2019

Inferring B cell specificity for vaccines using a Bayesian mixture model.

BMC Genomics 2020 Feb 22;21(1):176. Epub 2020 Feb 22.

MRC Weatherall Institute of Molecular Medicine, University of Oxford, John Radcliffe Hospital, Oxford, UK.

Background: Vaccines have greatly reduced the burden of infectious disease, ranking in their impact on global health second only after clean water. Most vaccines confer protection by the production of antibodies with binding affinity for the antigen, which is the main effector function of B cells. This results in short term changes in the B cell receptor (BCR) repertoire when an immune response is launched, and long term changes when immunity is conferred. Analysis of antibodies in serum is usually used to evaluate vaccine response, however this is limited and therefore the investigation of the BCR repertoire provides far more detail for the analysis of vaccine response.

Results: Here, we introduce a novel Bayesian model to describe the observed distribution of BCR sequences and the pattern of sharing across time and between individuals, with the goal to identify vaccine-specific BCRs. We use data from two studies to assess the model and estimate that we can identify vaccine-specific BCRs with 69% sensitivity.

Conclusion: Our results demonstrate that statistical modelling can capture patterns associated with vaccine response and identify vaccine specific B cells in a range of different data sets. Additionally, the B cells we identify as vaccine specific show greater levels of sequence similarity than expected, suggesting that there are additional signals of vaccine response, not currently considered, which could improve the identification of vaccine specific B cells.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12864-020-6571-7DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7036227PMC
February 2020

Repertoire-wide phylogenetic models of B cell molecular evolution reveal evolutionary signatures of aging and vaccination.

Proc Natl Acad Sci U S A 2019 11 21;116(45):22664-22672. Epub 2019 Oct 21.

Department of Pathology, Yale School of Medicine, New Haven, CT 06520;

In order to produce effective antibodies, B cells undergo rapid somatic hypermutation (SHM) and selection for binding affinity to antigen via a process called affinity maturation. The similarities between this process and evolution by natural selection have led many groups to use phylogenetic methods to characterize the development of immunological memory, vaccination, and other processes that depend on affinity maturation. However, these applications are limited by the fact that most phylogenetic models are designed to be applied to individual lineages comprising genetically diverse sequences, while B cell repertoires often consist of hundreds to thousands of separate low-diversity lineages. Further, several features of affinity maturation violate important assumptions in standard phylogenetic models. Here, we introduce a hierarchical phylogenetic framework that integrates information from all lineages in a repertoire to more precisely estimate model parameters while simultaneously incorporating the unique features of SHM. We demonstrate the power of this repertoire-wide approach by characterizing previously undescribed phenomena in affinity maturation. First, we find evidence consistent with age-related changes in SHM hot-spot targeting. Second, we identify a consistent relationship between increased tree length and signs of increased negative selection, apparent in the repertoires of recently vaccinated subjects and those without any known recent infections or vaccinations. This suggests that B cell lineages shift toward negative selection over time as a general feature of affinity maturation. Our study provides a framework for undertaking repertoire-wide phylogenetic testing of SHM hypotheses and provides a means of characterizing dynamics of mutation and selection during affinity maturation.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1073/pnas.1906020116DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6842591PMC
November 2019

Sequencing of human genomes with nanopore technology.

Nat Commun 2019 04 23;10(1):1869. Epub 2019 Apr 23.

Wellcome Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, UK.

Whole-genome sequencing (WGS) is becoming widely used in clinical medicine in diagnostic contexts and to inform treatment choice. Here we evaluate the potential of the Oxford Nanopore Technologies (ONT) MinION long-read sequencer for routine WGS by sequencing the reference sample NA12878 and the genome of an individual with ataxia-pancytopenia syndrome and severe immune dysregulation. We develop and apply a novel reference panel-free analytical method to infer and then exploit phase information which improves single-nucleotide variant (SNV) calling performance from otherwise modest levels. In the clinical sample, we identify and directly phase two non-synonymous de novo variants in SAMD9L, (OMIM #159550) inferring that they lie on the same paternal haplotype. Whilst consensus SNV-calling error rates from ONT data remain substantially higher than those from short-read methods, we demonstrate the substantial benefits of analytical innovation. Ongoing improvements to base-calling and SNV-calling methodology must continue for nanopore sequencing to establish itself as a primary method for clinical WGS.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-019-09637-5DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6478738PMC
April 2019

An equivariant Bayesian convolutional network predicts recombination hotspots and accurately resolves binding motifs.

Bioinformatics 2019 07;35(13):2177-2184

Motivation: Convolutional neural networks (CNNs) have been tremendously successful in many contexts, particularly where training data are abundant and signal-to-noise ratios are large. However, when predicting noisily observed phenotypes from DNA sequence, each training instance is only weakly informative, and the amount of training data is often fundamentally limited, emphasizing the need for methods that make optimal use of training data and any structure inherent in the process.

Results: Here we show how to combine equivariant networks, a general mathematical framework for handling exact symmetries in CNNs, with Bayesian dropout, a version of Monte Carlo dropout suggested by a reinterpretation of dropout as a variational Bayesian approximation, to develop a model that exhibits exact reverse-complement symmetry and is more resistant to overtraining. We find that this model combines improved prediction consistency with better predictive accuracy compared to standard CNN implementations and state-of-art motif finders. We use our network to predict recombination hotspots from sequence, and identify binding motifs for the recombination-initiation protein PRDM9 previously unobserved in this data, which were recently validated by high-resolution assays. The network achieves a predictive accuracy comparable to that attainable by a direct assay of the H3K4me3 histone mark, a proxy for PRDM9 binding.

Availability And Implementation: https://github.com/luntergroup/EquivariantNetworks.

Supplementary Information: Supplementary data are available at Bioinformatics online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/bty964DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6596897PMC
July 2019

Haplotype matching in large cohorts using the Li and Stephens model.

Authors:
Gerton Lunter

Bioinformatics 2019 03;35(5):798-806

University of Oxford, Wellcome Centre for Human Genetics, Oxford, UK.

Motivation: The Li and Stephens model, which approximates the coalescent describing the pattern of variation in a population, underpins a range of key tools and results in genetics. Although highly efficient compared to the coalescent, standard implementations of this model still cannot deal with the very large reference cohorts that are starting to become available, and practical implementations use heuristics to achieve reasonable runtimes.

Results: Here I describe a new, exact algorithm ('fastLS') that implements the Li and Stephens model and achieves runtimes independent of the size of the reference cohort. Key to achieving this runtime is the use of the Burrows-Wheeler transform, allowing the algorithm to efficiently identify partial haplotype matches across a cohort. I show that the proposed data structure is very similar to, and generalizes, Durbin's positional Burrows-Wheeler transform.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/bty735DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6394399PMC
March 2019

A high throughput screen for active human transposable elements.

BMC Genomics 2018 02 1;19(1):115. Epub 2018 Feb 1.

Wellcome Trust Centre for Human Genetics, Oxford, UK.

Background: Transposable elements (TEs) are mobile genetic sequences that randomly propagate within their host's genome. This mobility has the potential to affect gene transcription and cause disease. However, TEs are technically challenging to identify, which complicates efforts to assess the impact of TE insertions on disease. Here we present a targeted sequencing protocol and computational pipeline to identify polymorphic and novel TE insertions using next-generation sequencing: TE-NGS. The method simultaneously targets the three subfamilies that are responsible for the majority of recent TE activity (L1HS, AluYa5/8, and AluYb8/9) thereby obviating the need for multiple experiments and reducing the amount of input material required.

Results: Here we describe the laboratory protocol and detection algorithm, and a benchmark experiment for the reference genome NA12878. We demonstrate a substantial enrichment for on-target fragments, and high sensitivity and precision to both reference and NA12878-specific insertions. We report 17 previously unreported loci for this individual which are supported by orthogonal long-read evidence, and we identify 1470 polymorphic and novel TEs in 12 additional samples that were previously undocumented in databases of insertion polymorphisms.

Conclusions: We anticipate that future applications of TE-NGS alongside exome sequencing of patients with sporadic disease will reduce the number of unresolved cases, and improve estimates of the contribution of TEs to human genetic disease.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12864-018-4485-4DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5796560PMC
February 2018

Accurate clinical detection of exon copy number variants in a targeted NGS panel using DECoN.

Wellcome Open Res 2016 Nov 25;1:20. Epub 2016 Nov 25.

Division of Genetics and Epidemiology, Institute of Cancer Research, London, UK.

Targeted next generation sequencing (NGS) panels are increasingly being used in clinical genomics to increase capacity, throughput and affordability of gene testing. Identifying whole exon deletions or duplications (termed exon copy number variants, 'exon CNVs') in exon-targeted NGS panels has proved challenging, particularly for single exon CNVs.  We developed a tool for the Detection of Exon Copy Number variants (DECoN), which is optimised for analysis of exon-targeted NGS panels in the clinical setting. We evaluated DECoN performance using 96 samples with independently validated exon CNV data. We performed simulations to evaluate DECoN detection performance of single exon CNVs and to evaluate performance using different coverage levels and sample numbers. Finally, we implemented DECoN in a clinical laboratory that tests and with the TruSight Cancer Panel (TSCP). We used DECoN to analyse 1,919 samples, validating exon CNV detections by multiplex ligation-dependent probe amplification (MLPA).  In the evaluation set, DECoN achieved 100% sensitivity and 99% specificity for BRCA exon CNVs, including identification of 8 single exon CNVs. DECoN also identified 14/15 exon CNVs in 8 other genes. Simulations of all possible BRCA single exon CNVs gave a mean sensitivity of 98% for deletions and 95% for duplications. DECoN performance remained excellent with different levels of coverage and sample numbers; sensitivity and specificity was >98% with the typical NGS run parameters. In the clinical pipeline, DECoN automatically analyses pools of 48 samples at a time, taking 24 minutes per pool, on average. DECoN detected 24 BRCA exon CNVs, of which 23 were confirmed by MLPA, giving a false discovery rate of 4%. Specificity was 99.7%.  DECoN is a fast, accurate, exon CNV detection tool readily implementable in research and clinical NGS pipelines. It has high sensitivity and specificity and acceptable false discovery rate. DECoN is freely available at www.icr.ac.uk/decon.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.12688/wellcomeopenres.10069.1DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5409526PMC
November 2016

A Phylogenetic Codon Substitution Model for Antibody Lineages.

Genetics 2017 05 17;206(1):417-427. Epub 2017 Mar 17.

Department of Zoology, University of Oxford, OX1 3PS, United Kingdom

Phylogenetic methods have shown promise in understanding the development of broadly neutralizing antibody lineages (bNAbs). However, the mutational process that generates these lineages, somatic hypermutation, is biased by hotspot motifs which violates important assumptions in most phylogenetic substitution models. Here, we develop a modified GY94-type substitution model that partially accounts for this context dependency while preserving independence of sites during calculation. This model shows a substantially better fit to three well-characterized bNAb lineages than the standard GY94 model. We also demonstrate how our model can be used to test hypotheses concerning the roles of different hotspot and coldspot motifs in the evolution of B-cell lineages. Further, we explore the consequences of the idea that the number of hotspot motifs, and perhaps the mutation rate in general, is expected to decay over time in individual bNAb lineages.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1534/genetics.116.196303DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5419485PMC
May 2017

Erratum to: B-cell repertoire dynamics after sequential hepatitis B vaccination and evidence for cross-reactive B-cell activation.

Genome Med 2016 Aug 3;8(1):81. Epub 2016 Aug 3.

Oxford Vaccine Group, Department of Paediatrics, University of Oxford and the NIHR Oxford Biomedical Research Center, Oxford, OX3 7LE, UK.

View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13073-016-0337-5DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4973058PMC
August 2016

OpEx - a validated, automated pipeline optimised for clinical exome sequence analysis.

Sci Rep 2016 08 3;6:31029. Epub 2016 Aug 3.

The Institute of Cancer Research, London, Division of Genetics &Epidemiology, Sutton SM2 5NG, UK.

We present an easy-to-use, open-source Optimised Exome analysis tool, OpEx (http://icr.ac.uk/opex) that accurately detects small-scale variation, including indels, to clinical standards. We evaluated OpEx performance with an experimentally validated dataset (the ICR142 NGS validation series), a large 1000 exome dataset (the ICR1000 UK exome series), and a clinical proband-parent trio dataset. The performance of OpEx for high-quality base substitutions and short indels in both small and large datasets is excellent, with overall sensitivity of 95%, specificity of 97% and low false detection rate (FDR) of 3%. Depending on the individual performance requirements the OpEx output allows one to optimise the inevitable trade-offs between sensitivity and specificity. For example, in the clinical setting one could permit a higher FDR and lower specificity to maximise sensitivity. In contexts where experimental validation is not possible, minimising the FDR and improving specificity may be a preferable trade-off for slightly lower sensitivity. OpEx is simple to install and use; the whole pipeline is run from a single command. OpEx is therefore well suited to the increasing research and clinical laboratories undertaking exome sequencing, particularly those without in-house dedicated bioinformatics expertise.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/srep31029DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4971567PMC
August 2016

B-cell repertoire dynamics after sequential hepatitis B vaccination and evidence for cross-reactive B-cell activation.

Genome Med 2016 06 16;8(1):68. Epub 2016 Jun 16.

Oxford Vaccine Group, Department of Paediatrics, University of Oxford and the NIHR Oxford Biomedical Research Center, Oxford, OX3 7LE, UK.

Background: A diverse B-cell repertoire is essential for recognition and response to infectious and vaccine antigens. High-throughput sequencing of B-cell receptor (BCR) genes can now be used to study the B-cell repertoire at great depth and may shed more light on B-cell responses than conventional immunological methods. Here, we use high-throughput BCR sequencing to provide novel insight into B-cell dynamics following a primary course of hepatitis B vaccination.

Methods: Nine vaccine-naïve participants were administered three doses of hepatitis B vaccine (months 0, 1, and 2 or 7). High-throughput Illumina sequencing of the total BCR repertoire was combined with targeted sequencing of sorted vaccine antigen-enriched B cells to analyze the longitudinal response of both the total and vaccine-specific repertoire after each vaccine. ELISpot was used to determine vaccine-specific cell numbers following each vaccine.

Results: Deconvoluting the vaccine-specific from total BCR repertoire showed that vaccine-specific sequence clusters comprised <0.1 % of total sequence clusters, and had certain stereotypic features. The vaccine-specific BCR sequence clusters were expanded after each of the three vaccine doses, despite no vaccine-specific B cells being detected by ELISpot after the first vaccine dose. These vaccine-specific BCR clusters detected after the first vaccine dose had distinct properties compared to those detected after subsequent doses; they were more mutated, present at low frequency even prior to vaccination, and appeared to be derived from more mature B cells.

Conclusions: These results demonstrate the high-sensitivity of our vaccine-specific BCR analysis approach and suggest an alternative view of the B-cell response to novel antigens. In the response to the first vaccine dose, many vaccine-specific BCR clusters appeared to largely derive from previously activated cross-reactive B cells that have low affinity for the vaccine antigen, and subsequent doses were required to yield higher affinity B cells.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13073-016-0322-zDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4910312PMC
June 2016

Analysis of B Cell Repertoire Dynamics Following Hepatitis B Vaccination in Humans, and Enrichment of Vaccine-specific Antibody Sequences.

EBioMedicine 2015 Dec 24;2(12):2070-9. Epub 2015 Nov 24.

Oxford Vaccine Group, Department of Paediatrics, University of Oxford and the NIHR Oxford Biomedical Research Center, Oxford OX3 7LE, United Kingdom.

Generating a diverse B cell immunoglobulin repertoire is essential for protection against infection. The repertoire in humans can now be comprehensively measured by high-throughput sequencing. Using hepatitis B vaccination as a model, we determined how the total immunoglobulin sequence repertoire changes following antigen exposure in humans, and compared this to sequences from vaccine-specific sorted cells. Clonal sequence expansions were seen 7 days after vaccination, which correlated with vaccine-specific plasma cell numbers. These expansions caused an increase in mutation, and a decrease in diversity and complementarity-determining region 3 sequence length in the repertoire. We also saw an increase in sequence convergence between participants 14 and 21 days after vaccination, coinciding with an increase of vaccine-specific memory cells. These features allowed development of a model for in silico enrichment of vaccine-specific sequences from the total repertoire. Identifying antigen-specific sequences from total repertoire data could aid our understanding B cell driven immunity, and be used for disease diagnostics and vaccine evaluation.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.ebiom.2015.11.034DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4703725PMC
December 2015

The ICR1000 UK exome series: a resource of gene variation in an outbred population.

F1000Res 2015 22;4:883. Epub 2015 Sep 22.

Division of Genetics & Epidemiology, The Institute of Cancer Research, London, SM2 5NG, UK; Cancer Genetics Unit, Royal Marsden NHS Foundation Trust, London, SM2 5PT, UK.

To enhance knowledge of gene variation in outbred populations, and to provide a dataset with utility in research and clinical genomics, we performed exome sequencing of 1,000 UK individuals from the general population and applied a high-quality analysis pipeline that includes high sensitivity and specificity for indel detection. Each UK individual has, on average, 21,978 gene variants including 160 rare (0.1%) variants not present in any other individual in the series. These data provide a baseline expectation for gene variation in an outbred population. Summary data of all 295,391 variants we detected are included here and the individual exome sequences are available from the European Genome-phenome Archive as the ICR1000 UK exome series. Furthermore, samples and other phenotype and experimental data for these individuals are obtainable through application to the 1958 Birth Cohort committee.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.12688/f1000research.7049.1DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4706061PMC
February 2016

The Diversity and Molecular Evolution of B-Cell Receptors during Infection.

Mol Biol Evol 2016 05 22;33(5):1147-57. Epub 2016 Jan 22.

Department of Zoology, University of Oxford, Oxford, United Kingdom

B-cell receptors (BCRs) are membrane-bound immunoglobulins that recognize and bind foreign proteins (antigens). BCRs are formed through random somatic changes of germline DNA, creating a vast repertoire of unique sequences that enable individuals to recognize a diverse range of antigens. After encountering antigen for the first time, BCRs undergo a process of affinity maturation, whereby cycles of rapid somatic mutation and selection lead to improved antigen binding. This constitutes an accelerated evolutionary process that takes place over days or weeks. Next-generation sequencing of the gene regions that determine BCR binding has begun to reveal the diversity and dynamics of BCR repertoires in unprecedented detail. Although this new type of sequence data has the potential to revolutionize our understanding of infection dynamics, quantitative analysis is complicated by the unique biology and high diversity of BCR sequences. Models and concepts from molecular evolution and phylogenetics that have been applied successfully to rapidly evolving pathogen populations are increasingly being adopted to study BCR diversity and divergence within individuals. However, BCR dynamics may violate key assumptions of many standard evolutionary methods, as they do not descend from a single ancestor, and experience biased mutation. Here, we review the application of evolutionary models to BCR repertoires and discuss the issues we believe need be addressed for this interdisciplinary field to flourish.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/molbev/msw015DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4839220PMC
May 2016

In-Depth Assessment of Within-Individual and Inter-Individual Variation in the B Cell Receptor Repertoire.

Front Immunol 2015 12;6:531. Epub 2015 Oct 12.

Oxford Vaccine Group, Department of Paediatrics, The NIHR Oxford Biomedical Research Center, University of Oxford , Oxford , UK.

High-throughput sequencing of the B cell receptor (BCR) repertoire can provide rapid characterization of the B cell response in a wide variety of applications in health, after vaccination and in infectious, inflammatory and immune-driven disease, and is starting to yield clinical applications. However, the interpretation of repertoire data is compromised by a lack of studies to assess the intra and inter-individual variation in the BCR repertoire over time in healthy individuals. We applied a standardized isotype-specific BCR repertoire deep sequencing protocol to a single highly sampled participant, and then evaluated the method in 9 further participants to comprehensively describe such variation. We assessed total repertoire metrics of mutation, diversity, VJ gene usage and isotype subclass usage as well as tracking specific BCR sequence clusters. There was good assay reproducibility (both in PCR amplification and biological replicates), but we detected striking fluctuations in the repertoire over time that we hypothesize may be due to subclinical immune activation. Repertoire properties were unique for each individual, which could partly be explained by a decrease in IgG2 with age, and genetic differences at the immunoglobulin locus. There was a small repertoire of public clusters (0.5, 0.3, and 1.4% of total IgA, IgG, and IgM clusters, respectively), which was enriched for expanded clusters containing sequences with suspected specificity toward antigens that should have been historically encountered by all participants through prior immunization or infection. We thus provide baseline BCR repertoire information that can be used to inform future study design, and aid in interpretation of results from these studies. Furthermore, our results indicate that BCR repertoire studies could be used to track changes in the public repertoire in and between populations that might relate to population immunity against infectious diseases, and identify the characteristics of inflammatory and immunological diseases.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3389/fimmu.2015.00531DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4601265PMC
November 2015

CSN and CAVA: variant annotation tools for rapid, robust next-generation sequencing analysis in the clinical setting.

Genome Med 2015 Jul 28;7:76. Epub 2015 Jul 28.

Division of Genetics & Epidemiology, The Institute of Cancer Research, 15 Cotswold Road, London, SM2 5NG, UK.

Background: Next-generation sequencing (NGS) offers unprecedented opportunities to expand clinical genomics. It also presents challenges with respect to integration with data from other sequencing methods and historical data. Provision of consistent, clinically applicable variant annotation of NGS data has proved difficult, particularly of indels, an important variant class in clinical genomics. Annotation in relation to a reference genome sequence, the DNA strand of coding transcripts and potential alternative variant representations has not been well addressed. Here we present tools that address these challenges to provide rapid, standardized, clinically appropriate annotation of NGS data in line with existing clinical standards.

Methods: We developed a clinical sequencing nomenclature (CSN), a fixed variant annotation consistent with the principles of the Human Genome Variation Society (HGVS) guidelines, optimized for automated variant annotation of NGS data. To deliver high-throughput CSN annotation we created CAVA (Clinical Annotation of VAriants), a fast, lightweight tool designed for easy incorporation into NGS pipelines. CAVA allows transcript specification, appropriately accommodates the strand of a gene transcript and flags variants with alternative annotations to facilitate clinical interpretation and comparison with other datasets. We evaluated CAVA in exome data and a clinical BRCA1/BRCA2 gene testing pipeline.

Results: CAVA generated CSN calls for 10,313,034 variants in the ExAC database in 13.44 hours, and annotated the ICR1000 exome series in 6.5 hours. Evaluation of 731 different indels from a single individual revealed 92 % had alternative representations in left aligned and right aligned data. Annotation of left aligned data, as performed by many annotation tools, would thus give clinically discrepant annotation for the 339 (46 %) indels in genes transcribed from the forward DNA strand. By contrast, CAVA provides the correct clinical annotation for all indels. CAVA also flagged the 370 indels with alternative representations of a different functional class, which may profoundly influence clinical interpretation. CAVA annotation of 50 BRCA1/BRCA2 gene mutations from a clinical pipeline gave 100 % concordance with Sanger data; only 8/25 BRCA2 mutations were correctly clinically annotated by other tools.

Conclusions: CAVA is a freely available tool that provides rapid, robust, high-throughput clinical annotation of NGS data, using a standardized clinical sequencing nomenclature.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13073-015-0195-6DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4551696PMC
July 2015

Factors influencing success of clinical genome sequencing across a broad spectrum of disorders.

Nat Genet 2015 Jul 18;47(7):717-726. Epub 2015 May 18.

Institute of Physiology, Zurich Center for Integrative Human Physiology, University of Zurich, Zurich, Switzerland.

To assess factors influencing the success of whole-genome sequencing for mainstream clinical diagnosis, we sequenced 217 individuals from 156 independent cases or families across a broad spectrum of disorders in whom previous screening had identified no pathogenic variants. We quantified the number of candidate variants identified using different strategies for variant calling, filtering, annotation and prioritization. We found that jointly calling variants across samples, filtering against both local and external databases, deploying multiple annotation tools and using familial transmission above biological plausibility contributed to accuracy. Overall, we identified disease-causing variants in 21% of cases, with the proportion increasing to 34% (23/68) for mendelian disorders and 57% (8/14) in family trios. We also discovered 32 potentially clinically actionable variants in 18 genes unrelated to the referral disorder, although only 4 were ultimately considered reportable. Our results demonstrate the value of genome sequencing for routine clinical diagnosis but also highlight many outstanding challenges.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/ng.3304DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4601524PMC
July 2015

BCR repertoire sequencing: different patterns of B-cell activation after two Meningococcal vaccines.

Immunol Cell Biol 2015 Nov 15;93(10):885-95. Epub 2015 May 15.

Oxford Vaccine Group, Department of Paediatrics, University of Oxford and the NIHR Oxford Biomedical Research Center, Oxford, UK.

Next-generation sequencing was used to investigate the B-cell receptor heavy chain transcript repertoire of different B-cell subsets (naive, marginal zone (MZ), immunoglobulin M (IgM) memory and IgG memory) at baseline, and of plasma cells (PCs) 7 days following administration of serogroup ACWY meningococcal polysaccharide and protein-polysaccharide conjugate vaccines. Baseline B-cell subsets could be distinguished from each other using a small number of repertoire properties (clonality, mutation from germline and complementarity-determining region 3 (CDR3) length) that were conserved between individuals. However, analyzing the CDR3 amino-acid sequence (which is particularly important for antigen binding) of the baseline subsets showed few sequences shared between individuals. In contrast, day 7 PCs demonstrated nearly 10-fold greater sequence sharing between individuals than the baseline subsets, consistent with the PCs being induced by the vaccine antigen and sharing specificity for a more limited range of epitopes. By annotating PC sequences based on IgG subclass usage and mutation, and also comparing them with the sequences of the baseline cell subsets, we were able to identify different signatures after the polysaccharide and conjugate vaccines. PCs produced after conjugate vaccination were predominantly IgG1, and most related to IgG memory cells. In contrast, after polysaccharide vaccination, the PCs were predominantly IgG2, less mutated and were equally likely to be related to MZ, IgM memory or IgG memory cells. High-throughput B-cell repertoire sequencing thus provides a unique insight into patterns of B-cell activation not possible from more conventional measures of immunogenicity.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/icb.2015.57DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4551417PMC
November 2015

High-throughput DNA Sequencing Identifies Novel Variants in Muscle-invasive Bladder Cancer Patients.

Bladder Cancer 2015 Apr 30;1(1):31-44. Epub 2015 Apr 30.

CRUK/MRC Oxford Institute for Radiation Oncology, Department of Oncology, University of Oxford, Oxford, UK.

Background: Germline mutations in DNA damage signalling and repair genes predispose individuals to cancer. Rare germline variants may also increase cancer risk and be predictive of outcomes following cancer treatments, but require high-throughput sequencing (HTS) for detection in large cohorts.

Objective: To use a dual indexing system on a HTS platform to detect novel variants in CtIP (RBBP8) which may be associated with clinical outcomes following radiotherapy treatment for bladder cancer.

Methods: All exons and flanking introns of CtIP were amplified from germline DNA from bladder cancer patients using seven primer pairs by automated long-range PCR. Amplicons were pooled, fragmented and ligated to adaptor sequences. One of 96 tag sequences was introduced at each end by PCR. Sequencing was performed on a single flow cell of an Illumina MiSeq. Reads were mapped by Stampy and variants called by Platypus. For phasing experiments, target regions were amplified and cloned for Sanger sequencing.

Results: Of 201 samples, 160 were successfully amplified. Eleven CtIP variants were called, within the exons and 15 bp adjacent intronic DNA, including eight known variants from the 1000 Genomes project, plus three previously unreported variants now confirmed by Sanger sequencing. In two individuals, phasing experiments showed two variants of interest to be on separate alleles, likely to result in stronger impairment of gene function.

Conclusions: We have demonstrated proof of principle for dual indexing on 160 samples on one MiSeq flow cell sequencing surface, and show that for the CtIP gene multiplexing of up to 720 samples would provide sufficient coverage to achieve >98% detection power for rare germline variation, reducing HTS costs substantially.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.3233/BLC-150007DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6218178PMC
April 2015

scrm: efficiently simulating long sequences using the approximated coalescent with recombination.

Bioinformatics 2015 May 8;31(10):1680-2. Epub 2015 Jan 8.

Department of Biology, Ludwig-Maximilians-Universität München, Planegg-Martinsried, Germany and Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK.

Motivation: Coalescent-based simulation software for genomic sequences allows the efficient in silico generation of short- and medium-sized genetic sequences. However, the simulation of genome-size datasets as produced by next-generation sequencing is currently only possible using fairly crude approximations.

Results: We present the sequential coalescent with recombination model (SCRM), a new method that efficiently and accurately approximates the coalescent with recombination, closing the gap between current approximations and the exact model. We present an efficient implementation and show that it can simulate genomic-scale datasets with an essentially correct linkage structure.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btu861DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4426833PMC
May 2015

Identification of antigen-specific B cell receptor sequences using public repertoire analysis.

J Immunol 2015 Jan 12;194(1):252-261. Epub 2014 Nov 12.

Oxford Vaccine Group, Department of Paediatrics, University of Oxford, and the NIHR Oxford Biomedical Research Centre, Oxford OX3 7LE, United Kingdom.

High-throughput sequencing allows detailed study of the BCR repertoire postimmunization, but it remains unclear to what extent the de novo identification of Ag-specific sequences from the total BCR repertoire is possible. A conjugate vaccine containing Haemophilus influenzae type b (Hib) and group C meningococcal polysaccharides, as well as tetanus toxoid (TT), was used to investigate the BCR repertoire of adult humans following immunization and to test the hypothesis that public or convergent repertoire analysis could identify Ag-specific sequences. A number of Ag-specific BCR sequences have been reported for Hib and TT, which made a vaccine containing these two Ags an ideal immunological stimulus. Analysis of identical CDR3 amino acid sequences that were shared by individuals in the postvaccine repertoire identified a number of known Hib-specific sequences but only one previously described TT sequence. The extension of this analysis to nonidentical, but highly similar, CDR3 amino acid sequences revealed a number of other TT-related sequences. The anti-Hib avidity index postvaccination strongly correlated with the relative frequency of Hib-specific sequences, indicating that the postvaccination public BCR repertoire may be related to more conventional measures of immunogenicity correlating with disease protection. Analysis of public BCR repertoire provided evidence of convergent BCR evolution in individuals exposed to the same Ags. If this finding is confirmed, the public repertoire could be used for rapid and direct identification of protective Ag-specific BCR sequences from peripheral blood.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.4049/jimmunol.1401405DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4272858PMC
January 2015

8.2% of the Human genome is constrained: variation in rates of turnover across functional element classes in the human lineage.

PLoS Genet 2014 Jul 24;10(7):e1004525. Epub 2014 Jul 24.

Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom.

Ten years on from the finishing of the human reference genome sequence, it remains unclear what fraction of the human genome confers function, where this sequence resides, and how much is shared with other mammalian species. When addressing these questions, functional sequence has often been equated with pan-mammalian conserved sequence. However, functional elements that are short-lived, including those contributing to species-specific biology, will not leave a footprint of long-lasting negative selection. Here, we address these issues by identifying and characterising sequence that has been constrained with respect to insertions and deletions for pairs of eutherian genomes over a range of divergences. Within noncoding sequence, we find increasing amounts of mutually constrained sequence as species pairs become more closely related, indicating that noncoding constrained sequence turns over rapidly. We estimate that half of present-day noncoding constrained sequence has been gained or lost in approximately the last 130 million years (half-life in units of divergence time, d1/2 = 0.25-0.31). While enriched with ENCODE biochemical annotations, much of the short-lived constrained sequences we identify are not detected by models optimized for wider pan-mammalian conservation. Constrained DNase 1 hypersensitivity sites, promoters and untranslated regions have been more evolutionarily stable than long noncoding RNA loci which have turned over especially rapidly. By contrast, protein coding sequence has been highly stable, with an estimated half-life of over a billion years (d1/2 = 2.1-5.0). From extrapolations we estimate that 8.2% (7.1-9.2%) of the human genome is presently subject to negative selection and thus is likely to be functional, while only 2.2% has maintained constraint in both human and mouse since these species diverged. These results reveal that the evolutionary history of the human genome has been highly dynamic, particularly for its noncoding yet biologically functional fraction.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pgen.1004525DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4109858PMC
July 2014

Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications.

Nat Genet 2014 Aug 13;46(8):912-918. Epub 2014 Jul 13.

Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK.

High-throughput DNA sequencing technology has transformed genetic research and is starting to make an impact on clinical practice. However, analyzing high-throughput sequencing data remains challenging, particularly in clinical settings where accuracy and turnaround times are critical. We present a new approach to this problem, implemented in a software package called Platypus. Platypus achieves high sensitivity and specificity for SNPs, indels and complex polymorphisms by using local de novo assembly to generate candidate variants, followed by local realignment and probabilistic haplotype estimation. It is an order of magnitude faster than existing tools and generates calls from raw aligned read data without preprocessing. We demonstrate the performance of Platypus in clinically relevant experimental designs by comparing with SAMtools and GATK on whole-genome and exome-capture data, by identifying de novo variation in 15 parent-offspring trios with high sensitivity and specificity, and by estimating human leukocyte antigen genotypes directly from variant calls.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/ng.3036DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4753679PMC
August 2014

Improved workflows for high throughput library preparation using the transposome-based Nextera system.

BMC Biotechnol 2013 Nov 20;13:104. Epub 2013 Nov 20.

Wellcome Trust Centre for Human Genetics, OX3 7BN Oxford, UK.

Background: The Nextera protocol, which utilises a transposome based approach to create libraries for Illumina sequencing, requires pure DNA template, an accurate assessment of input concentration and a column clean-up that limits its applicability for high-throughput sample preparation. We addressed the identified limitations to develop a robust workflow that supports both rapid and high-throughput projects also reducing reagent costs.

Results: We show that an initial bead-based normalisation step can remove the need for quantification and improves sample purity. A 75% cost reduction was achieved with a low-volume modified protocol which was tested over genomes with different GC content to demonstrate its robustness. Finally we developed a custom set of index tags and primers which increase the number of samples that can simultaneously be sequenced on a single lane of an Illumina instrument.

Conclusions: We addressed the bottlenecks of Nextera library construction to produce a modified protocol which harnesses the full power of the Nextera kit and allows the reproducible construction of libraries on a high-throughput scale reducing the associated cost of the kit.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/1472-6750-13-104DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4222894PMC
November 2013

GAT: a simulation framework for testing the association of genomic intervals.

Bioinformatics 2013 Aug 18;29(16):2046-8. Epub 2013 Jun 18.

MRC CGAT Programme and Functional Genomics Unit, MRC Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, UK.

Motivation: A common question in genomic analysis is whether two sets of genomic intervals overlap significantly. This question arises, for example, when interpreting ChIP-Seq or RNA-Seq data in functional terms. Because genome organization is complex, answering this question is non-trivial.

Summary: We present Genomic Association Test (GAT), a tool for estimating the significance of overlap between multiple sets of genomic intervals. GAT implements a null model that the two sets of intervals are placed independently of one another, but allows each set's density to depend on external variables, for example, isochore structure or chromosome identity. GAT estimates statistical significance based on simulation and controls for multiple tests using the false discovery rate.

Availability: GAT's source code, documentation and tutorials are available at http://code.google.com/p/genomic-association-tester.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btt343DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3722528PMC
August 2013