Publications by authors named "Victor Jongeneel"

41 Publications

Organizing and running bioinformatics hackathons within Africa: The H3ABioNet cloud computing experience.

AAS Open Res 2018 7;1. Epub 2019 Aug 7.

Computational Biology Division, Integrative Medical Biosciences, University of Cape Town, Cape Town, South Africa.

The need for portable and reproducible genomics analysis pipelines is growing globally as well as in Africa, especially with the growth of collaborative projects like the Human Health and Heredity in Africa Consortium (H3Africa). The Pan-African H3Africa Bioinformatics Network (H3ABioNet) recognized the need for portable, reproducible pipelines adapted to heterogeneous computing environments, and for the nurturing of technical expertise in workflow languages and containerization technologies. Building on the network's Standard Operating Procedures (SOPs) for common genomic analyses, H3ABioNet arranged its first Cloud Computing and Reproducible Workflows Hackathon in 2016, with the purpose of translating those SOPs into analysis pipelines able to run on heterogeneous computing environments and meeting the needs of H3Africa research projects. This paper describes the preparations for this hackathon and reflects upon the lessons learned about its impact on building the technical and scientific expertise of African researchers. The workflows developed were made publicly available in GitHub repositories and deposited as container images on Quay.io.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.12688/aasopenres.12847.2DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7194140PMC
August 2019

Knowledge-guided analysis of "omics" data using the KnowEnG cloud platform.

PLoS Biol 2020 01 23;18(1):e3000583. Epub 2020 Jan 23.

Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America.

We present Knowledge Engine for Genomics (KnowEnG), a free-to-use computational system for analysis of genomics data sets, designed to accelerate biomedical discovery. It includes tools for popular bioinformatics tasks such as gene prioritization, sample clustering, gene set analysis, and expression signature analysis. The system specializes in "knowledge-guided" data mining and machine learning algorithms, in which user-provided data are analyzed in light of prior information about genes, aggregated from numerous knowledge bases and encoded in a massive "Knowledge Network." KnowEnG adheres to "FAIR" principles (findable, accessible, interoperable, and reuseable): its tools are easily portable to diverse computing environments, run on the cloud for scalable and cost-effective execution, and are interoperable with other computing platforms. The analysis tools are made available through multiple access modes, including a web portal with specialized visualization modules. We demonstrate the KnowEnG system's potential value in democratization of advanced tools for the modern genomics era through several case studies that use its tools to recreate and expand upon the published analysis of cancer data sets.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pbio.3000583DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6977717PMC
January 2020

Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics.

BMC Bioinformatics 2018 Nov 29;19(1):457. Epub 2018 Nov 29.

Computational Biology Division, Department of Integrative Medical Biosciences, IDM, University of Cape Town, Cape Town, South Africa.

Background: The Pan-African bioinformatics network, H3ABioNet, comprises 27 research institutions in 17 African countries. H3ABioNet is part of the Human Health and Heredity in Africa program (H3Africa), an African-led research consortium funded by the US National Institutes of Health and the UK Wellcome Trust, aimed at using genomics to study and improve the health of Africans. A key role of H3ABioNet is to support H3Africa projects by building bioinformatics infrastructure such as portable and reproducible bioinformatics workflows for use on heterogeneous African computing environments. Processing and analysis of genomic data is an example of a big data application requiring complex interdependent data analysis workflows. Such bioinformatics workflows take the primary and secondary input data through several computationally-intensive processing steps using different software packages, where some of the outputs form inputs for other steps. Implementing scalable, reproducible, portable and easy-to-use workflows is particularly challenging.

Results: H3ABioNet has built four workflows to support (1) the calling of variants from high-throughput sequencing data; (2) the analysis of microbial populations from 16S rDNA sequence data; (3) genotyping and genome-wide association studies; and (4) single nucleotide polymorphism imputation. A week-long hackathon was organized in August 2016 with participants from six African bioinformatics groups, and US and European collaborators. Two of the workflows are built using the Common Workflow Language framework (CWL) and two using Nextflow. All the workflows are containerized for improved portability and reproducibility using Docker, and are publicly available for use by members of the H3Africa consortium and the international research community.

Conclusion: The H3ABioNet workflows have been implemented in view of offering ease of use for the end user and high levels of reproducibility and portability, all while following modern state of the art bioinformatics data processing protocols. The H3ABioNet workflows will service the H3Africa consortium projects and are currently in use. All four workflows are also publicly available for research scientists worldwide to use and adapt for their respective needs. The H3ABioNet workflows will help develop bioinformatics capacity and assist genomics research within Africa and serve to increase the scientific output of H3Africa and its Pan-African Bioinformatics Network.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s12859-018-2446-1DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6264621PMC
November 2018

Assessing computational genomics skills: Our experience in the H3ABioNet African bioinformatics network.

PLoS Comput Biol 2017 Jun 1;13(6):e1005419. Epub 2017 Jun 1.

Sydney Brenner Institute for Molecular Bioscience, University of the Witwatersrand, Johannesburg, South Africa.

The H3ABioNet pan-African bioinformatics network, which is funded to support the Human Heredity and Health in Africa (H3Africa) program, has developed node-assessment exercises to gauge the ability of its participating research and service groups to analyze typical genome-wide datasets being generated by H3Africa research groups. We describe a framework for the assessment of computational genomics analysis skills, which includes standard operating procedures, training and test datasets, and a process for administering the exercise. We present the experiences of 3 research groups that have taken the exercise and the impact on their ability to manage complex projects. Finally, we discuss the reasons why many H3ABioNet nodes have declined so far to participate and potential strategies to encourage them to do so.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1371/journal.pcbi.1005419DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5453403PMC
June 2017

H3ABioNet, a sustainable pan-African bioinformatics network for human heredity and health in Africa.

Genome Res 2016 Feb 1;26(2):271-7. Epub 2015 Dec 1.

University of the Free State, Bloemfontein, South Africa 9300;

The application of genomics technologies to medicine and biomedical research is increasing in popularity, made possible by new high-throughput genotyping and sequencing technologies and improved data analysis capabilities. Some of the greatest genetic diversity among humans, animals, plants, and microbiota occurs in Africa, yet genomic research outputs from the continent are limited. The Human Heredity and Health in Africa (H3Africa) initiative was established to drive the development of genomic research for human health in Africa, and through recognition of the critical role of bioinformatics in this process, spurred the establishment of H3ABioNet, a pan-African bioinformatics network for H3Africa. The limitations in bioinformatics capacity on the continent have been a major contributory factor to the lack of notable outputs in high-throughput biology research. Although pockets of high-quality bioinformatics teams have existed previously, the majority of research institutions lack experienced faculty who can train and supervise bioinformatics students. H3ABioNet aims to address this dire need, specifically in the area of human genetics and genomics, but knock-on effects are ensuring this extends to other areas of bioinformatics. Here, we describe the emergence of genomics research and the development of bioinformatics in Africa through H3ABioNet.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.196295.115DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4728379PMC
February 2016

KnowEnG: a knowledge engine for genomics.

J Am Med Inform Assoc 2015 Nov 23;22(6):1115-9. Epub 2015 Jul 23.

Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA Institute of Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA.

We describe here the vision, motivations, and research plans of the National Institutes of Health Center for Excellence in Big Data Computing at the University of Illinois, Urbana-Champaign. The Center is organized around the construction of "Knowledge Engine for Genomics" (KnowEnG), an E-science framework for genomics where biomedical scientists will have access to powerful methods of data mining, network mining, and machine learning to extract knowledge out of genomics data. The scientist will come to KnowEnG with their own data sets in the form of spreadsheets and ask KnowEnG to analyze those data sets in the light of a massive knowledge base of community data sets called the "Knowledge Network" that will be at the heart of the system. The Center is undertaking discovery projects aimed at testing the utility of KnowEnG for transforming big data to knowledge. These projects span a broad range of biological enquiry, from pharmacogenomics (in collaboration with Mayo Clinic) to transcriptomics of human behavior.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/jamia/ocv090DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5009907PMC
November 2015

Identification and validation of copy number variants using SNP genotyping arrays from a large clinical cohort.

BMC Genomics 2012 Jun 15;13:241. Epub 2012 Jun 15.

Department of Medical Genetics, University of Lausanne, Lausanne, Switzerland.

Background: Genotypes obtained with commercial SNP arrays have been extensively used in many large case-control or population-based cohorts for SNP-based genome-wide association studies for a multitude of traits. Yet, these genotypes capture only a small fraction of the variance of the studied traits. Genomic structural variants (GSV) such as Copy Number Variation (CNV) may account for part of the missing heritability, but their comprehensive detection requires either next-generation arrays or sequencing. Sophisticated algorithms that infer CNVs by combining the intensities from SNP-probes for the two alleles can already be used to extract a partial view of such GSV from existing data sets.

Results: Here we present several advances to facilitate the latter approach. First, we introduce a novel CNV detection method based on a Gaussian Mixture Model. Second, we propose a new algorithm, PCA merge, for combining copy-number profiles from many individuals into consensus regions. We applied both our new methods as well as existing ones to data from 5612 individuals from the CoLaus study who were genotyped on Affymetrix 500K arrays. We developed a number of procedures in order to evaluate the performance of the different methods. This includes comparison with previously published CNVs as well as using a replication sample of 239 individuals, genotyped with Illumina 550K arrays. We also established a new evaluation procedure that employs the fact that related individuals are expected to share their CNVs more frequently than randomly selected individuals. The ability to detect both rare and common CNVs provides a valuable resource that will facilitate association studies exploring potential phenotypic associations with CNVs.

Conclusion: Our new methodologies for CNV detection and their evaluation will help in extracting additional information from the large amount of SNP-genotyping data on various cohorts and use this to explore structural variants and their impact on complex traits.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/1471-2164-13-241DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3464625PMC
June 2012

Exome sequencing identifies recurrent somatic MAP2K1 and MAP2K2 mutations in melanoma.

Nat Genet 2011 Dec 25;44(2):133-9. Epub 2011 Dec 25.

Department of Genetic Medicine and Development, University of Geneva, Geneva, Switzerland.

We performed exome sequencing to detect somatic mutations in protein-coding regions in seven melanoma cell lines and donor-matched germline cells. All melanoma samples had high numbers of somatic mutations, which showed the hallmark of UV-induced DNA repair. Such a hallmark was absent in tumor sample-specific mutations in two metastases derived from the same individual. Two melanomas with non-canonical BRAF mutations harbored gain-of-function MAP2K1 and MAP2K2 (MEK1 and MEK2, respectively) mutations, resulting in constitutive ERK phosphorylation and higher resistance to MEK inhibitors. Screening a larger cohort of individuals with melanoma revealed the presence of recurring somatic MAP2K1 and MAP2K2 mutations, which occurred at an overall frequency of 8%. Furthermore, missense and nonsense somatic mutations were frequently found in three candidate melanoma genes, FAT4, LRP1B and DSC1.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1038/ng.1026DOI Listing
December 2011

Network-guided analysis of genes with altered somatic copy number and gene expression reveals pathways commonly perturbed in metastatic melanoma.

PLoS One 2011 Apr 8;6(4):e18369. Epub 2011 Apr 8.

Ludwig Institute for Cancer Research, Lausanne, Switzerland.

Cancer genomes frequently contain somatic copy number alterations (SCNA) that can significantly perturb the expression level of affected genes and thus disrupt pathways controlling normal growth. In melanoma, many studies have focussed on the copy number and gene expression levels of the BRAF, PTEN and MITF genes, but little has been done to identify new genes using these parameters at the genome-wide scale. Using karyotyping, SNP and CGH arrays, and RNA-seq, we have identified SCNA affecting gene expression ('SCNA-genes') in seven human metastatic melanoma cell lines. We showed that the combination of these techniques is useful to identify candidate genes potentially involved in tumorigenesis. Since few of these alterations were recurrent across our samples, we used a protein network-guided approach to determine whether any pathways were enriched in SCNA-genes in one or more samples. From this unbiased genome-wide analysis, we identified 28 significantly enriched pathway modules. Comparison with two large, independent melanoma SCNA datasets showed less than 10% overlap at the individual gene level, but network-guided analysis revealed 66% shared pathways, including all but three of the pathways identified in our data. Frequently altered pathways included WNT, cadherin signalling, angiogenesis and melanogenesis. Additionally, our results emphasize the potential of the EPHA3 and FRS2 gene products, involved in angiogenesis and migration, as possible therapeutic targets in melanoma. Our study demonstrates the utility of network-guided approaches, for both large and small datasets, to identify pathways recurrently perturbed in cancer.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0018369PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3072964PMC
April 2011

EuroDia: a beta-cell gene expression resource.

Database (Oxford) 2010 Oct 12;2010:baq024. Epub 2010 Oct 12.

Vital-IT, SIB Swiss Institute of Bioinformatics, Genopode Building, CH-1015 Lausanne, Switzerland.

Type 2 diabetes mellitus (T2DM) is a major disease affecting nearly 280 million people worldwide. Whilst the pathophysiological mechanisms leading to disease are poorly understood, dysfunction of the insulin-producing pancreatic beta-cells is key event for disease development. Monitoring the gene expression profiles of pancreatic beta-cells under several genetic or chemical perturbations has shed light on genes and pathways involved in T2DM. The EuroDia database has been established to build a unique collection of gene expression measurements performed on beta-cells of three organisms, namely human, mouse and rat. The Gene Expression Data Analysis Interface (GEDAI) has been developed to support this database. The quality of each dataset is assessed by a series of quality control procedures to detect putative hybridization outliers. The system integrates a web interface to several standard analysis functions from R/Bioconductor to identify differentially expressed genes and pathways. It also allows the combination of multiple experiments performed on different array platforms of the same technology. The design of this system enables each user to rapidly design a custom analysis pipeline and thus produce their own list of genes and pathways. Raw and normalized data can be downloaded for each experiment. The flexible engine of this database (GEDAI) is currently used to handle gene expression data from several laboratory-run projects dealing with different organisms and platforms. Database URL: http://eurodia.vital-it.ch.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/database/baq024DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2963318PMC
October 2010

Genome-wide analysis of cancer/testis gene expression.

Proc Natl Acad Sci U S A 2008 Dec 16;105(51):20422-7. Epub 2008 Dec 16.

Department of Biostatistics, Harvard School of Public Health, 655 Huntington Avenue, SPH2, 4th Floor, Boston, MA 02115, USA.

Cancer/Testis (CT) genes, normally expressed in germ line cells but also activated in a wide range of cancer types, often encode antigens that are immunogenic in cancer patients, and present potential for use as biomarkers and targets for immunotherapy. Using multiple in silico gene expression analysis technologies, including twice the number of expressed sequence tags used in previous studies, we have performed a comprehensive genome-wide survey of expression for a set of 153 previously described CT genes in normal and cancer expression libraries. We find that although they are generally highly expressed in testis, these genes exhibit heterogeneous gene expression profiles, allowing their classification into testis-restricted (39), testis/brain-restricted (14), and a testis-selective (85) group of genes that show additional expression in somatic tissues. The chromosomal distribution of these genes confirmed the previously observed dominance of X chromosome location, with CT-X genes being significantly more testis-restricted than non-X CT. Applying this core classification in a genome-wide survey we identified >30 CT candidate genes; 3 of them, PEPP-2, OTOA, and AKAP4, were confirmed as testis-restricted or testis-selective using RT-PCR, with variable expression frequencies observed in a panel of cancer cell lines. Our classification provides an objective ranking for potential CT genes, which is useful in guiding further identification and characterization of these potentially important diagnostic and therapeutic targets.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1073/pnas.0810777105DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2603434PMC
December 2008

Identification of tumor-associated antigens by large-scale analysis of genes expressed in human colorectal cancer.

Cancer Immun 2008 Jun 27;8:11. Epub 2008 Jun 27.

National Center of Competence in Research, Molecular Oncology, ISREC, Ch. des Boveresses 155, 1066 Epalinges, Switzerland.

Despite the high prevalence of colon cancer in the world and the great interest in targeted anti-cancer therapy, only few tumor-specific gene products have been identified that could serve as targets for the immunological treatment of colorectal cancers. The aim of our study was therefore to identify frequently expressed colon cancer-specific antigens. We performed a large-scale analysis of genes expressed in normal colon and colon cancer tissues isolated from colorectal cancer patients using massively parallel signal sequencing (MPSS). Candidates were additionally subjected to experimental evaluation by semi-quantitative RT-PCR on a cohort of colorectal cancer patients. From a pool of more than 6000 genes identified unambiguously in the analysis, we found 2124 genes that were selectively expressed in colon cancer tissue and 147 genes that were differentially expressed to a significant degree between normal and cancer cells. Differential expression of many genes was confirmed by RT-PCR on a cohort of patients. Despite the fact that deregulated genes were involved in many different cellular pathways, we found that genes expressed in the extracellular space were significantly over-represented in colorectal cancer. Strikingly, we identified a transcript from a chromosome X-linked member of the human endogenous retrovirus (HERV) H family that was frequently and selectively expressed in colon cancer but not in normal tissues. Our data suggest that this sequence should be considered as a target of immunological interventions against colorectal cancer.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2935784PMC
June 2008

Genotypic features of lentivirus transgenic mice.

J Virol 2008 Jul 7;82(14):7111-9. Epub 2008 May 7.

Ecole Polytechnique Fédérale de Lausanne, School of Life Sciences, Lausanne CH-1015, Switzerland.

Lentivector-mediated transgenesis is increasingly used, whether for basic studies as an alternative to pronuclear injection of naked DNA or to test candidate gene therapy vectors. In an effort to characterize the genetic features of this approach, we first measured the frequency of germ line transmission of individual proviruses established by infection of fertilized mouse oocytes. Seventy integrants from 11 founder (G0) mice were passed to 111 first generation (G1) pups, for a total of 255 events corresponding to an average rate of transmission of 44%. This implies that integration had most often occurred at the one- or two-cell stage and that the degree of genotypic mosaicism in G0 mice obtained through this approach is generally minimal. Transmission analysis of eight individual proviruses in 13 G2 mice obtained by a G0-G1 cross revealed only 8% of proviral homozygosity, significantly below the 25% expected from purely Mendelian transmission, suggesting counter-selection due to interference with the functions of targeted loci. Mapping of 239 proviral integration sites in 49 founder animals revealed that about 60% resided within annotated genes, with a marked tendency for clustering in the middle of the transcribed region, and that integration was not influenced by the transcriptional orientation. Transcript levels of a set of arbitrarily chosen target genes were significantly higher in two-cell embryos than in embryonic stem cells or adult somatic cells, suggesting that, as previously noted in other settings, lentiviral vectors integrate preferentially into regions of the genome that are transcriptionally active or poised for activation.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1128/JVI.00623-08DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2446968PMC
July 2008

Vertebrate conserved non coding DNA regions have a high persistence length and a short persistence time.

BMC Genomics 2007 Oct 31;8:398. Epub 2007 Oct 31.

Computational Cancer Genomics Group, Swiss Institute of Bioinformatics, Lausanne, Switzerland.

Background: The comparison of complete genomes has revealed surprisingly large numbers of conserved non-protein-coding (CNC) DNA regions. However, the biological function of CNC remains elusive. CNC differ in two aspects from conserved protein-coding regions. They are not conserved across phylum boundaries, and they do not contain readily detectable sub-domains. Here we characterize the persistence length and time of CNC and conserved protein-coding regions in the vertebrate and insect lineages.

Results: The persistence length is the length of a genome region over which a certain level of sequence identity is consistently maintained. The persistence time is the evolutionary period during which a conserved region evolves under the same selective constraints. Our main findings are: (i) Insect genomes contain 1.60 times less conserved information than vertebrates; (ii) Vertebrate CNC have a higher persistence length than conserved coding regions or insect CNC; (iii) CNC have shorter persistence times as compared to conserved coding regions in both lineages.

Conclusion: Higher persistence length of vertebrate CNC indicates that the conserved information in vertebrates and insects is organized in functional elements of different lengths. These findings might be related to the higher morphological complexity of vertebrates and give clues about the structure of active CNC elements. Shorter persistence time might explain the previously puzzling observations of highly conserved CNC within each phylum, and of a lack of conservation between phyla. It suggests that CNC divergence might be a key factor in vertebrate evolution. Further evolutionary studies will help to relate individual CNC to specific developmental processes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/1471-2164-8-398DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2211324PMC
October 2007

Indexing strategies for rapid searches of short words in genome sequences.

PLoS One 2007 Jun 27;2(6):e579. Epub 2007 Jun 27.

Ludwig Institute for Cancer Research, Bâtiment Génopode, Université de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics, Bátiment Génopode, Université de Lausanne, Lausanne, Switzerland. Christian.

Searching for matches between large collections of short (14-30 nucleotides) words and sequence databases comprising full genomes or transcriptomes is a common task in biological sequence analysis. We investigated the performance of simple indexing strategies for handling such tasks and developed two programs, fetchGWI and tagger, that index either the database or the query set. Either strategy outperforms megablast for searches with more than 10,000 probes. FetchGWI is shown to be a versatile tool for rapidly searching multiple genomes, whose performance is limited in most cases by the speed of access to the filesystem. We have made publicly available a Web interface for searching the human, mouse, and several other genomes and transcriptomes with oligonucleotide queries.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0000579PLOS
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1894650PMC
June 2007

MyHits: improvements to an interactive resource for analyzing protein sequences.

Nucleic Acids Res 2007 Jul 1;35(Web Server issue):W433-7. Epub 2007 Jun 1.

Swiss Institute of Bioinformatics (SIB), Vital-IT Group, UNIL-Génopode, CH-1015 Lausanne, Switzerland.

The MyHits web site (http://myhits.isb-sib.ch) is an integrated service dedicated to the analysis of protein sequences. Since its first description in 2004, both the user interface and the back end of the server were improved. A number of tools (e.g. MAFFT, Jacop, Dotlet, Jalview, ESTScan) were added or updated to improve the usability of the service. The MySQL schema and its associated API were revamped and the database engine (HitKeeper) was separated from the web interface. This paper summarizes the current status of the server, with an emphasis on the new services.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkm352DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1933190PMC
July 2007

Rapid evolution of cancer/testis genes on the X chromosome.

BMC Genomics 2007 May 23;8:129. Epub 2007 May 23.

Ludwig Institute for Cancer Research and Swiss Institute of Bioinformatics, Lausanne, Switzerland.

Background: Cancer/testis (CT) genes are normally expressed only in germ cells, but can be activated in the cancer state. This unusual property, together with the finding that many CT proteins elicit an antigenic response in cancer patients, has established a role for this class of genes as targets in immunotherapy regimes. Many families of CT genes have been identified in the human genome, but their biological function for the most part remains unclear. While it has been shown that some CT genes are under diversifying selection, this question has not been addressed before for the class as a whole.

Results: To shed more light on this interesting group of genes, we exploited the generation of a draft chimpanzee (Pan troglodytes) genomic sequence to examine CT genes in an organism that is closely related to human, and generated a high-quality, manually curated set of human:chimpanzee CT gene alignments. We find that the chimpanzee genome contains homologues to most of the human CT families, and that the genes are located on the same chromosome and at a similar copy number to those in human. Comparison of putative human:chimpanzee orthologues indicates that CT genes located on chromosome X are diverging faster and are undergoing stronger diversifying selection than those on the autosomes or than a set of control genes on either chromosome X or autosomes.

Conclusion: Given their high level of diversifying selection, we suggest that CT genes are primarily responsible for the observed rapid evolution of protein-coding genes on the X chromosome.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/1471-2164-8-129DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1890293PMC
May 2007

Establishment of the epithelial-specific transcriptome of normal and malignant human breast cells based on MPSS and array expression data.

Breast Cancer Res 2006 ;8(5):R56

Ludwig Institute for Cancer Research/University College London Breast Cancer Laboratory, 91 Riding House Street, London, W1W 7BS, UK.

Introduction: Diverse microarray and sequencing technologies have been widely used to characterise the molecular changes in malignant epithelial cells in breast cancers. Such gene expression studies to identify markers and targets in tumour cells are, however, compromised by the cellular heterogeneity of solid breast tumours and by the lack of appropriate counterparts representing normal breast epithelial cells.

Methods: Malignant neoplastic epithelial cells from primary breast cancers and luminal and myoepithelial cells isolated from normal human breast tissue were isolated by immunomagnetic separation methods. Pools of RNA from highly enriched preparations of these cell types were subjected to expression profiling using massively parallel signature sequencing (MPSS) and four different genome wide microarray platforms. Functional related transcripts of the differential tumour epithelial transcriptome were used for gene set enrichment analysis to identify enrichment of luminal and myoepithelial type genes. Clinical pathological validation of a small number of genes was performed on tissue microarrays.

Results: MPSS identified 6,553 differentially expressed genes between the pool of normal luminal cells and that of primary tumours substantially enriched for epithelial cells, of which 98% were represented and 60% were confirmed by microarray profiling. Significant expression level changes between these two samples detected only by microarray technology were shown by 4,149 transcripts, resulting in a combined differential tumour epithelial transcriptome of 8,051 genes. Microarray gene signatures identified a comprehensive list of 907 and 955 transcripts whose expression differed between luminal epithelial cells and myoepithelial cells, respectively. Functional annotation and gene set enrichment analysis highlighted a group of genes related to skeletal development that were associated with the myoepithelial/basal cells and upregulated in the tumour sample. One of the most highly overexpressed genes in this category, that encoding periostin, was analysed immunohistochemically on breast cancer tissue microarrays and its expression in neoplastic cells correlated with poor outcome in a cohort of poor prognosis estrogen receptor-positive tumours.

Conclusion: Using highly enriched cell populations in combination with multiplatform gene expression profiling studies, a comprehensive analysis of molecular changes between the normal and malignant breast tissue was established. This study provides a basis for the identification of novel and potentially important targets for diagnosis, prognosis and therapy in breast cancer.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/bcr1604DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1779497PMC
January 2007

Similarities and differences of polyadenylation signals in human and fly.

BMC Genomics 2006 Jul 12;7:176. Epub 2006 Jul 12.

Swiss Institute of Bioinformatics, Batiment Genopode, UNIL, 1015 Lausanne, Switzerland.

Background: Cleavage of messenger RNA (mRNA) precursors is an essential step in mRNA maturation. The signal recognized by the cleavage enzyme complex has been characterized as an A rich region upstream of the cleavage site containing a motif with consensus AAUAAA, followed by a U or UG rich region downstream of the cleavage site.

Results: We studied these signals using exhaustive databases of cleavage sites obtained from aligning raw expressed sequence tags (EST) sequences to genomic sequences in Homo sapiens and Drosophila melanogaster. These data show that the polyadenylation signal is highly conserved in human and fly. In addition, de novo motif searches generated a refined description of the U-rich downstream sequence (DSE) element, which shows more divergence between the two species. These refined motifs are applied, within a Hidden Markov Model (HMM) framework, to predict mRNA cleavage sites.

Conclusion: We demonstrate that the DSE is a specific motif in both human and Drosophila. These findings shed light on the sequence correlates of a highly conserved biological process, and improve in silico prediction of 3' mRNA cleavage and polyadenylation sites.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/1471-2164-7-176DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1574307PMC
July 2006

Identification of a new cancer/testis gene family, CT47, among expressed multicopy genes on the human X chromosome.

Genes Chromosomes Cancer 2006 Apr;45(4):392-400

Weill Medical College of Cornell University, New York, New York 10021, USA.

Cancer/testis (CT) genes are normally expressed in germ cells only, yet are reactivated and expressed in some tumors. Of the approximately 40 CT genes or gene families identified to date, 20 are on the X chromosome and are present as multigene families, many with highly conserved members. This indicates that novel CT gene families may be identified by detecting duplicated expressed genes on chromosome X. By searching for transcript clusters that map to multiple locations on the chromosome, followed by in silico analysis of their gene expression profiles, we identified five novel gene families with testis-specific expression and >98% sequence identity among family members. The expression of these genes in normal tissues and various tumor cell lines and specimens was evaluated by qualitative and quantitative RT-PCR, and a novel CT gene family with at least 13 copies was identified on Xq24, designated as CT47. mRNA expression of CT47 was found mainly in the testes, with weak expression in the placenta. Brain tissue was the only positive somatic tissue tested, with an estimated CT47 transcript level 0.09% of that found in testis. Among the tumor specimens tested, CT47 expression was found in approximately 15% of lung cancer and esophageal cancer specimens, but not in colorectal cancer or breast cancer. The putative CT47 protein consists of 288 amino acid residues, with a C-terminus rich in alanine and glutamic acid. The only species other than human in which a gene homologous to CT47 has been detected is the chimpanzee, with the predicted protein showing approximately 80% identity in its carboxy terminal region.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/gcc.20298DOI Listing
April 2006

Gene expression variation and expression quantitative trait mapping of human chromosome 21 genes.

Hum Mol Genet 2005 Dec 26;14(23):3741-9. Epub 2005 Oct 26.

Department of Genetic Medicine and Development, Geneva University Medical School, Geneva, Switzerland.

Inter-individual differences in gene expression are likely to account for an important fraction of phenotypic differences, including susceptibility to common disorders. Recent studies have shown extensive variation in gene expression levels in humans and other organisms, and that a fraction of this variation is under genetic control. We investigated the patterns of gene expression variation in a 25 Mb region of human chromosome 21, which has been associated with many Down syndrome (DS) phenotypes. Taqman real-time PCR was used to measure expression variation of 41 genes in lymphoblastoid cells of 40 unrelated individuals. For 25 genes found to be differentially expressed, additional analysis was performed in 10 CEPH families to determine heritabilities and map loci harboring regulatory variation. Seventy-six percent of the differentially expressed genes had significant heritabilities, and genomewide linkage analysis led to the identification of significant eQTLs for nine genes. Most eQTLs were in trans, with the best result (P=7.46 x 10(-8)) obtained for TMEM1 on chromosome 12q24.33. A cis-eQTL identified for CCT8 was validated by performing an association study in 60 individuals from the HapMap project. SNP rs965951 located within CCT8 was found to be significantly associated with its expression levels (P=2.5 x 10(-5)) confirming cis-regulatory variation. The results of our study provide a representative view of expression variation of chromosome 21 genes, identify loci involved in their regulation and suggest that genes, for which expression differences are significantly larger than 1.5-fold in control samples, are unlikely to be involved in DS-phenotypes present in all affected individuals.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/hmg/ddi404DOI Listing
December 2005

Identification of CT46/HORMAD1, an immunogenic cancer/testis antigen encoding a putative meiosis-related protein.

Cancer Immun 2005 Jul 7;5. Epub 2005 Jul 7.

Weill Medical College, Cornell University, New York, NY 10021, USA.

Transcripts with ESTs derived exclusively or predominantly from testis, and not from other normal tissues, are likely to be products of genes with testis-restricted expression, and are thus potential cancer/testis (CT) antigen genes. A list of 371 genes with such characteristics was compiled by analyzing publicly available EST databases. RT-PCR analysis of normal and tumor tissues was performed to validate an initial selection of 20 of these genes. Several new CT and CT-like genes were identified. One of these, CT46/HORMAD1, is expressed strongly in testis and weakly in placenta; the highest level of expression in other tissues is <1% of testicular expression. The CT46/HORMAD1 gene was expressed in 31% (34/109) of the carcinomas examined, with 11% (12/109) showing expression levels >10% of the testicular level of expression. CT46/HORMAD1 is a single-copy gene on chromosome 1q21.3, encoding a putative protein of 394 aa. Conserved protein domain analysis identified a HORMA domain involved in chromatin binding. The CT46/HORMAD1 protein was found to be homologous to the prototype HORMA domain-containing protein, Hop1, a yeast meiosis-specific protein, as well as to asy1, a meiotic synaptic mutant protein in Arabidopsis thaliana.
View Article and Find Full Text PDF

Download full-text PDF

Source
July 2005

An atlas of human gene expression from massively parallel signature sequencing (MPSS).

Genome Res 2005 Jul;15(7):1007-14

Office of Information Technology, Ludwig Institute for Cancer Research, and Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland.

We have used massively parallel signature sequencing (MPSS) to sample the transcriptomes of 32 normal human tissues to an unprecedented depth, thus documenting the patterns of expression of almost 20,000 genes with high sensitivity and specificity. The data confirm the widely held belief that differences in gene expression between cell and tissue types are largely determined by transcripts derived from a limited number of tissue-specific genes, rather than by combinations of more promiscuously expressed genes. Expression of a little more than half of all known human genes seems to account for both the common requirements and the specific functions of the tissues sampled. A classification of tissues based on patterns of gene expression largely reproduces classifications based on anatomical and biochemical properties. The unbiased sampling of the human transcriptome achieved by MPSS supports the idea that most human genes have been mapped, if not functionally characterized. This data set should prove useful for the identification of tissue-specific genes, for the study of global changes induced by pathological conditions, and for the definition of a minimal set of genes necessary for basic cell maintenance. The data are available on the Web at http://mpss.licr.org and http://sgb.lynxgen.com.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1101/gr.4041005DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1172045PMC
July 2005

Identification of cancer/testis-antigen genes by massively parallel signature sequencing.

Proc Natl Acad Sci U S A 2005 May 19;102(22):7940-5. Epub 2005 May 19.

Weill Medical College of Cornell University, New York, NY 10021, USA.

Massively parallel signature sequencing (MPSS) generates millions of short sequence tags corresponding to transcripts from a single RNA preparation. Most MPSS tags can be unambiguously assigned to genes, thereby generating a comprehensive expression profile of the tissue of origin. From the comparison of MPSS data from 32 normal human tissues, we identified 1,056 genes that are predominantly expressed in the testis. Further evaluation by using MPSS tags from cancer cell lines and EST data from a wide variety of tumors identified 202 of these genes as candidates for encoding cancer/testis (CT) antigens. Of these genes, the expression in normal tissues was assessed by RT-PCR in a subset of 166 intron-containing genes, and those with confirmed testis-predominant expression were further evaluated for their expression in 21 cancer cell lines. Thus, 20 CT or CT-like genes were identified, with several exhibiting expression in five or more of the cancer cell lines examined. One of these genes is a member of a CT gene family that we designated as CT45. The CT45 family comprises six highly similar (>98% cDNA identity) genes that are clustered in tandem within a 125-kb region on Xq26.3. CT45 was found to be frequently expressed in both cancer cell lines and lung cancer specimens. Thus, MPSS analysis has resulted in a significant extension of our knowledge of CT antigens, leading to the discovery of a distinctive X-linked CT-antigen gene family.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1073/pnas.0502583102DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1142383PMC
May 2005

Rapid and selective surveillance of Arabidopsis thaliana genome annotations with Centrifuge.

Bioinformatics 2005 Jun 7;21(12):2906-8. Epub 2005 Apr 7.

Gene Expression Laboratory, Plant Molecular Biology, University of Lausanne, Biology Building, 1015 Lausanne, Switzerland.

Unlabelled: Centrifuge is a user-friendly system to simultaneously access Arabidopsis gene annotations and intra- and inter-organism sequence comparison data. The tool allows rapid retrieval of user-selected data for each annotated Arabidopsis gene providing, in any combination, data on the following features: predicted protein properties such as mass, pI, cellular location and transmembrane domains; SWISS-PROT annotations; Interpro domains; Gene Ontology records; verified transcription; BLAST matches to the proteomes of A.thaliana, Oryza sativa (rice), Caenorhabditis elegans, Drosophila melanogaster and Homo sapiens. The tool lends itself particularly well to the rapid analysis of contigs or of tens or hundreds of genes identified by high-throughput gene expression experiments. In these cases, a summary table of principal predicted protein features for all genes is given followed by more detailed reports for each individual gene. Centrifuge can also be used for single gene analysis or in a word search mode.

Availability: http://centrifuge.unil.ch/

Contact: edward.farmer@unil.ch.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/bti435DOI Listing
June 2005

Identification of the gonad-specific anion transporter SLCO6A1 as a cancer/testis (CT) antigen expressed in human lung cancer.

Cancer Immun 2004 Nov 17;4:13. Epub 2004 Nov 17.

Ludwig Institute for Cancer Research, New York Branch at Memorial Sloan-Kettering Cancer Center, 1275 York Avenue, New York, NY 10021, USA.

Serological analysis of recombinant cDNA expression libraries (SEREX) has led to the identification of many of the antigens recognized by the immune system of cancer patients, which are collectively referred to as the cancer immunome. We used SEREX to screen a testicular cDNA expression library with sera obtained from non-small cell lung cancer patients and isolated cDNA clones for 82 antigens. These included a total of 31 antigens previously identified by SEREX, and 51 that did not match entries in the Cancer Immunome Database and were considered newly identified antigens. Overall, the antigens comprised 62 known proteins and 20 uncharacterized gene products. Six antigens (NY-TLU-6, -37, -39, -57, -70, -75) were identified as putative cell surface proteins that are potential targets for monoclonal antibody-based immunotherapy. Of these, the gonad-specific anion transport protein SLCO6A1 (NY-TLU-57) was shown to be tissue-restricted. RT-PCR showed it to be expressed strongly only in normal testis, and weakly in spleen, brain, fetal brain, and placenta. In addition, NY-TLU-57 mRNA was found in lung tumor samples (5/10) and lung cancer cell lines (6/11), as well as bladder (5/12) and esophageal (5/12) tumor samples. These data suggest that SLCO6A1 is a putative cancer/testis (CT) cell surface antigen of potential utility as a target for antibody-based therapy for a variety of tumor types. The analysis also permits us to estimate the eventual size of the SEREX-defined cancer immunome at around 4000 genes. This emphasizes the importance of continued SEREX screening to define the cancer immunome.
View Article and Find Full Text PDF

Download full-text PDF

Source
November 2004

MyHits: a new interactive resource for protein annotation and domain identification.

Nucleic Acids Res 2004 Jul;32(Web Server issue):W332-5

Swiss Institute of Bioinformatics, CH-1066 Epalinges/Lausanne, Switzerland.

The MyHits web server (http://myhits.isb-sib.ch) is a new integrated service dedicated to the annotation of protein sequences and to the analysis of their domains and signatures. Guest users can use the system anonymously, with full access to (i) standard bioinformatics programs (e.g. PSI-BLAST, ClustalW, T-Coffee, Jalview); (ii) a large number of protein sequence databases, including standard (Swiss-Prot, TrEMBL) and locally developed databases (splice variants); (iii) databases of protein motifs (Prosite, Interpro); (iv) a precomputed list of matches ('hits') between the sequence and motif databases. All databases are updated on a weekly basis and the hit list is kept up to date incrementally. The MyHits server also includes a new collection of tools to generate graphical representations of pairwise and multiple sequence alignments including their annotated features. Free registration enables users to upload their own sequences and motifs to private databases. These are then made available through the same web interface and the same set of analytical tools. Registered users can manage their own sequences and annotations using only web tools and freeze their data in their private database for publication purposes.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1093/nar/gkh479DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC441617PMC
July 2004

Digital expression profiles of human endogenous retroviral families in normal and cancerous tissues.

Cancer Immun 2004 Feb 11;4. Epub 2004 Feb 11.

Ludwig Institute for Cancer Research, Office of Information Technology, Epalinges, Switzerland.

Human endogenous retroviruses (HERVs) are remnants of ancient retroviral infections that became fixed in the germ line DNA millions of years ago. The fact that humoral and cellular immune responses against HERV-encoded proteins have been identified in cancer patients suggests that these antigens might be used in cancer immunotherapy or diagnosis. We analyzed the digital expression patterns of the HERV-K (HML-2), -W, -H and -E families in normal and cancerous tissues. Thirty-one proviral members of the HERV-K family and one representative each for the other HERV families were used as probes to search human EST data. Matching of HERV proviruses to ESTs was HERV family-specific and the expression profiles of the HERV families distinct. The HERV-K family was expressed in normal tissues such as muscle, skin and brain, as well as in germ cell tumors and other cancerous tissues. HERV-H was the only family expressed in cancers of the intestine, bone marrow, bladder and cervix, and was more highly expressed than the other families in cancers of the stomach, colon and prostate. In contrast, HERV-W was predominantly expressed in normal placenta. Expression patterns were confirmed by MPSS (massively parallel signature sequencing) data where available. For the HERV-K family, we mapped most ESTs to their corresponding proviruses and assessed the coding capacities of the matched proviruses. This study shows that HERV families are more widely expressed than originally thought and that some members of the HERV-K and -H families could encode targets for cancer immunotherapy.
View Article and Find Full Text PDF

Download full-text PDF

Source
February 2004