Publications by authors named "Fawaz Ghali"

12 Publications

  • Page 1 of 1

The proBAM and proBed standard formats: enabling a seamless integration of genomics and proteomics data.

Genome Biol 2018 01 31;19(1):12. Epub 2018 Jan 31.

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

On behalf of The Human Proteome Organization (HUPO) Proteomics Standards Initiative, we introduce here two novel standard data formats, proBAM and proBed, that have been developed to address the current challenges of integrating mass spectrometry-based proteomics data with genomics and transcriptomics information in proteogenomics studies. proBAM and proBed are adaptations of the well-defined, widely used file formats SAM/BAM and BED, respectively, and both have been extended to meet the specific requirements entailed by proteomics data. Therefore, existing popular genomics tools such as SAMtools and Bedtools, and several widely used genome browsers, can already be used to manipulate and visualize these formats "out-of-the-box." We also highlight that a number of specific additional software tools, properly supporting the proteomics information available in these formats, are now available providing functionalities such as file generation, file conversion, and data analysis. All the related documentation, including the detailed file format specifications and example files, are accessible at http://www.psidev.info/probam and at http://www.psidev.info/probed .
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1186/s13059-017-1377-xDOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5793360PMC
January 2018

The mzIdentML Data Standard Version 1.2, Supporting Advances in Proteome Informatics.

Mol Cell Proteomics 2017 07 17;16(7):1275-1285. Epub 2017 May 17.

¶Institute of Integrative Biology, University of Liverpool, Liverpool, L69 7ZB, UK;

The first stable version of the Proteomics Standards Initiative mzIdentML open data standard (version 1.1) was published in 2012-capturing the outputs of peptide and protein identification software. In the intervening years, the standard has become well-supported in both commercial and open software, as well as a submission and download format for public repositories. Here we report a new release of mzIdentML (version 1.2) that is required to keep pace with emerging practice in proteome informatics. New features have been added to support: (1) scores associated with localization of modifications on peptides; (2) statistics performed at the level of peptides; (3) identification of cross-linked peptides; and (4) support for proteogenomics approaches. In addition, there is now improved support for the encoding of sequencing of peptides, spectral library searches, and protein inference. As a key point, the underlying XML schema has only undergone very minor modifications to simplify as much as possible the transition from version 1.1 to version 1.2 for implementers, but there have been several notable updates to the format specification, implementation guidelines, controlled vocabularies and validation software. mzIdentML 1.2 can be described as backwards compatible, in that reading software designed for mzIdentML 1.1 should function in most cases without adaptation. We anticipate that these developments will provide a continued stable base for software teams working to implement the standard. All the related documentation is accessible at http://www.psidev.info/mzidentml.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1074/mcp.M117.068429DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5500760PMC
July 2017

Galaxy Integrated Omics: Web-based Standards-Compliant Workflows for Proteomics Informed by Transcriptomics.

Mol Cell Proteomics 2015 Nov 12;14(11):3087-93. Epub 2015 Aug 12.

From the ‡School of Biological and Chemical Sciences, Queen Mary University of London, Mile End Road, London E1 4NS, UK;

With the recent advent of RNA-seq technology the proteomics community has begun to generate sample-specific protein databases for peptide and protein identification, an approach we call proteomics informed by transcriptomics (PIT). This approach has gained a lot of interest, particularly among researchers who work with nonmodel organisms or with particularly dynamic proteomes such as those observed in developmental biology and host-pathogen studies. PIT has been shown to improve coverage of known proteins, and to reveal potential novel gene products. However, many groups are impeded in their use of PIT by the complexity of the required data analysis. Necessarily, this analysis requires complex integration of a number of different software tools from at least two different communities, and because PIT has a range of biological applications a single software pipeline is not suitable for all use cases. To overcome these problems, we have created GIO, a software system that uses the well-established Galaxy platform to make PIT analysis available to the typical bench scientist via a simple web interface. Within GIO we provide workflows for four common use cases: a standard search against a reference proteome; PIT protein identification without a reference genome; PIT protein identification using a genome guide; and PIT genome annotation. These workflows comprise individual tools that can be reconfigured and rearranged within the web interface to create new workflows to support additional use cases.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1074/mcp.O115.048777DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4638048PMC
November 2015

IPeak: An open source tool to combine results from multiple MS/MS search engines.

Proteomics 2015 Sep 6;15(17):2916-20. Epub 2015 Aug 6.

BGI-Shenzhen, Shenzhen, P. R. China.

Liquid chromatography coupled tandem mass spectrometry (LC-MS/MS) is an important technique for detecting peptides in proteomics studies. Here, we present an open source software tool, termed IPeak, a peptide identification pipeline that is designed to combine the Percolator post-processing algorithm and multi-search strategy to enhance the sensitivity of peptide identifications without compromising accuracy. IPeak provides a graphical user interface (GUI) as well as a command-line interface, which is implemented in JAVA and can work on all three major operating system platforms: Windows, Linux/Unix and OS X. IPeak has been designed to work with the mzIdentML standard from the Proteomics Standards Initiative (PSI) as an input and output, and also been fully integrated into the associated mzidLibrary project, providing access to the overall pipeline, as well as modules for calling Percolator on individual search engine result files. The integration thus enables IPeak (and Percolator) to be used in conjunction with any software packages implementing the mzIdentML data standard. IPeak is freely available and can be downloaded under an Apache 2.0 license at https://code.google.com/p/mzidentml-lib/.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/pmic.201400208DOI Listing
September 2015

A large-scale proteogenomics study of apicomplexan pathogens-Toxoplasma gondii and Neospora caninum.

Proteomics 2015 Aug 15;15(15):2618-28. Epub 2015 May 15.

Institute of Integrative Biology, University of Liverpool, Liverpool, Merseyside, UK.

Proteomics data can supplement genome annotation efforts, for example being used to confirm gene models or correct gene annotation errors. Here, we present a large-scale proteogenomics study of two important apicomplexan pathogens: Toxoplasma gondii and Neospora caninum. We queried proteomics data against a panel of official and alternate gene models generated directly from RNASeq data, using several newly generated and some previously published MS datasets for this meta-analysis. We identified a total of 201 996 and 39 953 peptide-spectrum matches for T. gondii and N. caninum, respectively, at a 1% peptide FDR threshold. This equated to the identification of 30 494 distinct peptide sequences and 2921 proteins (matches to official gene models) for T. gondii, and 8911 peptides/1273 proteins for N. caninum following stringent protein-level thresholding. We have also identified 289 and 140 loci for T. gondii and N. caninum, respectively, which mapped to RNA-Seq-derived gene models used in our analysis and apparently absent from the official annotation (release 10 from EuPathDB) of these species. We present several examples in our study where the RNA-Seq evidence can help in correction of the current gene model and can help in discovery of potential new genes. The findings of this study have been integrated into the EuPathDB. The data have been deposited to the ProteomeXchange with identifiers PXD000297and PXD000298.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/pmic.201400553DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4692086PMC
August 2015

ProteoAnnotator--open source proteogenomics annotation software supporting PSI standards.

Proteomics 2014 Dec 19;14(23-24):2731-41. Epub 2014 Nov 19.

Institute of Integrative Biology, University of Liverpool, Liverpool, UK.

The recent massive increase in capability for sequencing genomes is producing enormous advances in our understanding of biological systems. However, there is a bottleneck in genome annotation--determining the structure of all transcribed genes. Experimental data from MS studies can play a major role in confirming and correcting gene structure--proteogenomics. However, there are some technical and practical challenges to overcome, since proteogenomics requires pipelines comprising a complex set of interconnected modules as well as bespoke routines, for example in protein inference and statistics. We are introducing a complete, open source pipeline for proteogenomics, called ProteoAnnotator, which incorporates a graphical user interface and implements the Proteomics Standards Initiative mzIdentML standard for each analysis stage. All steps are included as standalone modules with the mzIdentML library, allowing other groups to re-use the whole pipeline or constituent parts within other tools. We have developed new modules for pre-processing and combining multiple search databases, for performing peptide-level statistics on mzIdentML files, for scoring grouped protein identifications matched to a given genomic locus to validate that updates to the official gene models are statistically sound and for mapping end results back onto the genome. ProteoAnnotator is available from http://www.proteoannotator.org/. All MS data have been deposited in the ProteomeXchange with identifiers PXD001042 and PXD001390 (http://proteomecentral.proteomexchange.org/dataset/PXD001042; http://proteomecentral.proteomexchange.org/dataset/PXD001390).
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/pmic.201400265DOI Listing
December 2014

A standardized framing for reporting protein identifications in mzIdentML 1.2.

Proteomics 2014 Nov 23;14(21-22):2389-99. Epub 2014 Sep 23.

AB SCIEX, Redwood City, CA, USA.

Inferring which protein species have been detected in bottom-up proteomics experiments has been a challenging problem for which solutions have been maturing over the past decade. While many inference approaches now function well in isolation, comparing and reconciling the results generated across different tools remains difficult. It presently stands as one of the greatest barriers in collaborative efforts such as the Human Proteome Project and public repositories such as the PRoteomics IDEntifications (PRIDE) database. Here we present a framework for reporting protein identifications that seeks to improve capabilities for comparing results generated by different inference tools. This framework standardizes the terminology for describing protein identification results, associated with the HUPO-Proteomics Standards Initiative (PSI) mzIdentML standard, while still allowing for differing methodologies to reach that final state. It is proposed that developers of software for reporting identification results will adopt this terminology in their outputs. While the new terminology does not require any changes to the core mzIdentML model, it represents a significant change in practice, and, as such, the rules will be released via a new version of the mzIdentML specification (version 1.2) so that consumers of files are able to determine whether the new guidelines have been adopted by export software.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/pmic.201400080DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4384534PMC
November 2014

The mzTab data exchange format: communicating mass-spectrometry-based proteomics and metabolomics experimental results to a wider audience.

Mol Cell Proteomics 2014 Oct 30;13(10):2765-75. Epub 2014 Jun 30.

From the ‡European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, CB10 1SD, Hinxton, Cambridge, UK;

The HUPO Proteomics Standards Initiative has developed several standardized data formats to facilitate data sharing in mass spectrometry (MS)-based proteomics. These allow researchers to report their complete results in a unified way. However, at present, there is no format to describe the final qualitative and quantitative results for proteomics and metabolomics experiments in a simple tabular format. Many downstream analysis use cases are only concerned with the final results of an experiment and require an easily accessible format, compatible with tools such as Microsoft Excel or R. We developed the mzTab file format for MS-based proteomics and metabolomics results to meet this need. mzTab is intended as a lightweight supplement to the existing standard XML-based file formats (mzML, mzIdentML, mzQuantML), providing a comprehensive summary, similar in concept to the supplemental material of a scientific publication. mzTab files can contain protein, peptide, and small molecule identifications together with experimental metadata and basic quantitative information. The format is not intended to store the complete experimental evidence but provides mechanisms to report results at different levels of detail. These range from a simple summary of the final results to a representation of the results including the experimental design. This format is ideally suited to make MS-based proteomics and metabolomics results available to a wider biological community outside the field of MS. Several software tools for proteomics and metabolomics have already adapted the format as an output format. The comprehensive mzTab specification document and extensive additional documentation can be found online.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1074/mcp.O113.036681DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4189001PMC
October 2014

Tools (Viewer, Library and Validator) that facilitate use of the peptide and protein identification standard format, termed mzIdentML.

Mol Cell Proteomics 2013 Nov 28;12(11):3026-35. Epub 2013 Jun 28.

Institute of Integrative Biology, University of Liverpool, Liverpool, L69 7ZB, United Kingdom;

The Proteomics Standards Initiative has recently released the mzIdentML data standard for representing peptide and protein identification results, for example, created by a search engine. When a new standard format is produced, it is important that software tools are available that make it straightforward for laboratory scientists to use it routinely and for bioinformaticians to embed support in their own tools. Here we report the release of several open-source Java-based software packages based on mzIdentML: ProteoIDViewer, mzidLibrary, and mzidValidator. The ProteoIDViewer is a desktop application allowing users to visualize mzIdentML-formatted results originating from any appropriate identification software; it supports visualization of all the features of the mzIdentML format. The mzidLibrary is a software library containing routines for importing data from external search engines, post-processing identification data (such as false discovery rate calculations), combining results from multiple search engines, performing protein inference, setting identification thresholds, and exporting results from mzIdentML to plain text files. The mzidValidator is able to process files and report warnings or errors if files are not correctly formatted or contain some semantic error. We anticipate that these developments will simplify adoption of the new standard in proteomics laboratories and the integration of mzIdentML into other software tools. All three tools are freely available in the public domain.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1074/mcp.O113.029777DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3820921PMC
November 2013

A guide for integration of proteomic data standards into laboratory workflows.

Proteomics 2013 Feb 15;13(3-4):480-92. Epub 2013 Jan 15.

Centro Nacional de Biotecnología, CSIC, Madrid, Spain.

The development of the HUPO-Proteomics Standards Initiative standard data formats and Minimum Information About a Proteomics Experiment guidelines facilitate coordination within the scientific community. The data standards provide a framework to exchange and share data regardless of the source instrument or software. Nevertheless there remains a view that Proteomics Standards Initiative standards are challenging to use and integrate into routine laboratory pipelines. In this article, we review the tools available for integrating the different data standards and building compliant software. These tools are focused on a range of different data types and support different scenarios, intended for software developers or end users, allowing the standards to be used in a straightforward manner.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/pmic.201200268DOI Listing
February 2013

Software for analysing ion mobility mass spectrometry data to improve peptide identification.

Proteomics 2012 Jun;12(12):1912-6

Institute of Integrative Biology, University of Liverpool, Liverpool, UK.

The development of ion mobility (IM) MS instruments has the capability to provide an added dimension to peptide analysis pipelines in proteomics, but, as yet, there are few software tools available for analysing such data. IM can be used to provide additional separation of parent ions or product ions following fragmentation. In this work, we have created a set of software tools that are capable of converting three dimensional IM data generated from analysis of fragment ions into a variety of formats used in proteomics. We demonstrate that IM can be used to calculate the charge state of a fragment ion, demonstrating the potential to improve peptide identification by excluding non-informative ions from a database search. We also provide preliminary evidence of structural differences between b and y ions for certain peptide sequences but not others. All software tools and data sets are made available in the public domain at http://code.google.com/p/ion-mobility-ms-tools/.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/pmic.201200029DOI Listing
June 2012

jmzIdentML API: A Java interface to the mzIdentML standard for peptide and protein identification data.

Proteomics 2012 Mar;12(6):790-4

EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

We present a Java application programming interface (API), jmzIdentML, for the Human Proteome Organisation (HUPO) Proteomics Standards Initiative (PSI) mzIdentML standard for peptide and protein identification data. The API combines the power of Java Architecture of XML Binding (JAXB) and an XPath-based random-access indexer to allow a fast and efficient mapping of extensible markup language (XML) elements to Java objects. The internal references in the mzIdentML files are resolved in an on-demand manner, where the whole file is accessed as a random-access swap file, and only the relevant piece of XMLis selected for mapping to its corresponding Java object. The APIis highly efficient in its memory usage and can handle files of arbitrary sizes. The APIfollows the official release of the mzIdentML (version 1.1) specifications and is available in the public domain under a permissive licence at http://www.code.google.com/p/jmzidentml/.
View Article and Find Full Text PDF

Download full-text PDF

Source
http://dx.doi.org/10.1002/pmic.201100577DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3933944PMC
March 2012