A subscription to JoVE is required to view this content. Sign in or start your free trial.
Method Article
Clinical metaproteomics offers insights into the human microbiome and its contributions to disease. We harnessed the computational power of the Galaxy platform to develop a modular bioinformatics workflow that facilitates complex mass spectrometry-based metaproteomic analysis and characterization of diverse clinical sample types relevant to studies of disease.
Clinical metaproteomics reveals host-microbiome interactions underlying diseases. However, challenges to this approach exist. In particular, the characterization of microbial proteins present in low abundance relative to host proteins is difficult. Other significant challenges are attributed to using very large protein sequence databases, which impedes sensitivity and accuracy during peptide and protein identification from mass spectrometry data in addition to retrieving taxonomy and functional annotations and performing statistical analysis. To address these problems, we present an integrated bioinformatics workflow for mass spectrometry-based metaproteomics that combines custom protein sequence database generation, peptide-spectrum match generation and verification, quantification, taxonomic and functional annotations, and statistical analysis. This workflow also offers characterization of human proteins (while prioritizing microbial proteins), thus offering insights into host-microbe dynamics in disease. The tools and workflow are deployed in the Galaxy ecosystem, enabling the development, optimization, and dissemination of these computational resources. We have applied this workflow for metaproteomic analysis of numerous clinical sample types, such as nasopharyngeal swabs and bronchoalveolar lavage fluid. Here, we demonstrate its utility via the analysis of residual fluid from cervical swabs. The complete workflow and accompanying training resources are accessible on the Galaxy Training Network to equip non-experts and experienced researchers with the necessary knowledge and tools to analyze their data.
Mass spectrometry (MS)-based metaproteomics identifies and quantifies microbial and human proteins from clinical samples. This approach provides a new understanding of microbiome responses to disease and uncovers potential mediators of host-microbiome interactions1,2. Although metaproteomic analysis of clinical samples can uncover the microbiome's interactions with its host environment, the field still faces many challenges. One main challenge is the relatively high abundance of host (human) proteins, which hampers the identification of lower abundant microbial proteins. Moreover, MS-based metaproteomics depends on the use of very large protein sequence databases. These databases comprise microbial proteomes that are present in the sample, which can result in a large database containing millions of sequences. Following the generation of tandem mass spectrometry (MS/MS) spectra from tryptically digested proteins, the MS/MS spectra are searched against large protein sequence databases, matching a peptide sequence to each spectrum (peptide-spectrum match, or PSM). However, sensitivity decreases, and the potential for false positives increases with large databases used for metaproteomics3. Additionally, conserved protein sequences across taxa and insufficient annotation of encoded proteins limit taxonomic and functional annotations for detected peptides and proteins4,5. We present a bioinformatics workflow for effective metaproteomic analysis of clinical samples that addresses many of these challenges and provides accessible software resources for researchers to investigate host-microbiome dynamics underlying human disease.
Clinical metaproteomics has been used to investigate diverse sample types, including feces and vaginal swabs, among others, to decipher pathogenic mechanisms in diseases and conditions6,7,8,9,10,11,12,13,14,15,16,17,18,19,20. Here, we use a metaproteomic bioinformatics workflow to analyze a subset of MS/MS data from Pap test fluid (PTF) samples from ovarian cancer (OVCA) and non-OVCA patients21. The software tools and workflow are accessible via the Galaxy platform, which streamlines the development and execution of complex clinical metaproteomic workflows22,23,24,25. Galaxy is an open-source platform that is designed for bioinformatics and computational biology. It provides a web-based environment for the use of open-source tools and workflows where academic researchers can perform and share complex data analyses. A thriving global community of software developers, data scientists, and end-users maintains the Galaxy ecosystem, including the Galaxy Training Network (GTN; https://training.galaxyproject.org/), which offers online and on-demand training resources22,23,24,25,26,27. Our workflow aims to reveal a new understanding of host-microbe dynamics in clinical samples as well as generate novel, well-characterized peptide targets of interest for developing targeted MS-based clinical assays for further study of clinical samples6,20,28. Furthermore, this manuscript intends to highlight the clinical metaproteomics workflow methodology. More detailed and beginner-friendly guides are provided in the GTN (https://training.galaxyproject.org/) as it is a valuable resource that can be used in parallel with this manuscript for users seeking additional explanations not covered. The Galaxy community has authored numerous manuscripts to aid beginner users of the Galaxy platform20,21,22,23,24,25,26,27.
All supplementary tables (e.g., tool parameters) and figures (e.g., example plots) for this manuscript have been provided as separate files and are referenced accordingly. Current tool versions within Galaxy version 2.3.0 were used for this manuscript. Therefore, results may differ slightly depending on Galaxy and tool version updates. The Galaxy platform and its tools are open-source and can be used for academic research purposes.
Access restricted. Please log in or start a trial to view this content.
MS/MS spectral data were obtained from de-identified residual PTF samples that were collected using procedures that followed institutional board-approved guidelines and regulations, as previously described21,29,30.
NOTE: Figure 1 provides an overview of the complete workflow, which consists of five modules. All inputs, outputs, and software tools are summarized in Supplementary Table 1.
Figure 1: Summary of Clinical Metaproteomics Workflow Modules Within Galaxy. The complete clinical metaproteomics workflow comprises five modules: Database Generation, Discovery, Verification, Quantification, and Data Interpretation. (A) The large comprehensive database includes protein sequences from microbial species thought to be present in the sample, humans, and common contaminants. The MetaNovo software tool directly matched MS/MS spectral data to peptides and infers proteins and their source organism from raw MS data and the large input protein sequence database, creating a reduced database33. The reduced database from MetaNovo is then merged with human and contaminant proteins to create the database for peptide discovery. (B)Two peptide identification algorithms, SearchGUI/PeptideShaker, and MaxQuant, match peptide sequences to MS/MS spectra and the target-decoy protein database49. (C)Peptides identified by SearchGUI/PeptideShaker and MaxQuant are next verified using PepQuery2. PepQuery2 rigorously re-examines putatively identified microbial peptide sequences and their matched MS/MS spectra against other potential matches to the human host proteome and/or contaminants, thereby verifying high-confident microbial matches40,41. Verified peptides are used to generate a verified protein sequence database that will be used for peptide and protein quantification. (D) MaxQuant42 searches MS/MS data against the verified protein sequence and quantifies microbial peptides and inferred proteins along with human proteins. (E) Unipept45 and MSstatsTMT46 are used in the final step to annotate proteins with taxonomy and functional information (enzyme commission accessions) as well as generate volcano and comparison plots. Please click here to view a larger version of this figure.
1. TMT labeling and generation of MS/MS spectra
2. Module set up
NOTE: Button/menu selections are bolded. Example files, workflows, and tool parameters are accessible via Supplementary tables. More information on how to use Galaxy can be found on the GTN FAQs page (https://training.galaxyproject.org/training-material/faqs/galaxy/).
3. Module 1: Protein sequence database generation
NOTE: If a user wants to use the example inputs and workflow from Supplementary Table 2, be sure to follow the instructions in section 2. For Module 1, import the input and workflow for DATABASE GENERATION. The output column of Supplementary Table 2 includes examples of completed output histories for reference. For all modules, the corresponding GTN tutorial can be found in Supplementary Table 3.
4. Module 2: Peptide discovery via database searching
NOTE: If a user wants to use the example inputs and workflow from Supplementary Table 2, be sure to follow the instructions in Section 2. For Module 2, import the input and workflow for DISCOVERY. For all modules, the corresponding GTN tutorial can be found in Supplementary Table 3. SearchGUI34,35,36 and PeptideShaker37 are separate software but will be considered as one peptide identification and processing program as they are used in tandem. For software compatibility, the MS/MS data sets will be converted from RAW to MGF for SearchGUI/PeptideShaker using the msconvert tool (in the provided workflow). MaxQuant38 can process RAW files.
5. Module 3: Verification of microbial peptides
NOTE: If a user wants to use the example inputs and workflow from Supplementary Table 2, be sure to follow the instructions in Section 2. For Module 2, import the input and workflow for VERIFICATION. For all modules, the corresponding GTN tutorial can be found in Supplementary Table 3.
6. Module 4: MaxQuant quantification
NOTE: If a user wants to use the example inputs and workflow from Supplementary Table 2, be sure to follow the instructions in Section 2. For Module 2, import the input and workflow for QUANTIFICATION. For all modules, the corresponding GTN tutorial can be found in Supplementary Table 3.
7. Module 5: Data interpretation
NOTE: If a user wants to use the example inputs and workflow from Supplementary Table 2, be sure to follow the instructions in Section 2. For Module 2, import the input and workflow for DATA INTERPRETATION. For all modules, the corresponding GTN tutorial can be found in Supplementary Table 3. Outputs from MaxQuant quantification in the previous module will be used here for taxonomic and functional annotations using Unipept and statistical analysis using MSstatsTMT. Unipept enables researchers to identify and quantify microorganisms within diverse environments and integrates with public databases (like UniProt) to retrieve updated annotations. MSstatsTMT was designed for robust statistical analysis of mass spectrometry-based quantitative proteomics data using TMT labeling.
Access restricted. Please log in or start a trial to view this content.
The general protocol described here was demonstrated on MS/MS files obtained from a subset of PTF samples21. Do et al.21 analyzed four MS/MS files from PTF samples that were collected following procedures described by Boylan et al.29and Afiuni-Zadel et al.30. This workflow prioritizes microbial proteins but offers the flexibility for the characterization of human proteins in parallel with microbial proteins21...
Access restricted. Please log in or start a trial to view this content.
Clinical metaproteomics research offers potential breakthroughs for clinical studies, but challenges in its implementation persist. The lower abundance of microbial proteins relative to the host proteins in most samples hinders the detection and characterization of non-host proteins6,10. Dependence on large protein sequence databases for accurate peptide and protein identification and quantification, along with complexities of taxonomically and functionally annot...
Access restricted. Please log in or start a trial to view this content.
The authors declare no conflict of interest.
We thank Dr. Amy Skubitz and Dr. Kristin Boylan (University of Minnesota) for the pilot data sets and Dr. Paul Piehowski, Dr. Tao Liu, and Dr. Karin Rodland (Pacific Northwest National Laboratories (PNNL)) for their expertise in the sample collection, and processing of the PTF samples and generation of the TMT-labeled MS data used in this study. This project was funded in part by the Minnesota Ovarian Cancer Alliance (MOCA), the National Institutes of Health/National Cancer Institute Grant Number: 5R01CA262153 (A.P.N.S.), 1R21CA267707 (P.D.J and T.J.G.), and the National Institutes of Health/National Cancer Institute Grant Number: P30CA077598 (P.D.J. and T.J.G.).
Access restricted. Please log in or start a trial to view this content.
Name | Company | Catalog Number | Comments |
Collapse Collection | GalaxyP | Galaxy Version 5.1.1 | Combines a dataset list collection into a single file (in the order of the list) |
Concatenate datasets | GalaxyP | Galaxy Version 0.1.1 | Concatenate files tail-to-head |
Cut | GalaxyP | Galaxy Version 1.0.2 | Cut (select) specified columns from a file |
FASTA Merge Files and Filter Unique Sequences | GalaxyP | Galaxy Version 1.2.0 | Concatenate FASTA database files together |
FastaCLI | GalaxyP | Galaxy Version 4.0.41+galaxy1 | Appends decoy sequences to FASTA files |
FASTA-to-Tablular | GalaxyP | Galaxy Version 1.1.0 | Convert FASTA-formatted sequences to TAB-delimited format |
Filter | GalaxyP | Galaxy Version 1.1.1 | Filter columns using simple expressions |
Filter Tabular | GalaxyP | Galaxy Version 3.3.0 | Filter a tabular file via line filters |
Galaxy Europe (EU) server | GalaxyP | https://usegalaxy.eu/ | |
Group | GalaxyP | Galaxy Version 2.1.4 | Group a file by a particular column and perform aggregate functions |
Identification Parameters | GalaxyP | Galaxy Version 4.0.41+galaxy1 | Set identification parameters for SearchGUI/PeptideShaker |
Learning Pathway: Clinical metaproteomics workflows within Galaxy | GalaxyP | https://training.galaxyproject.org/training-material/learning-pathways/clinical-metaproteomics.html | |
MaxQuant | GalaxyP | Galaxy Version 2.0.3.0+galaxy0 (Discovery module); Galaxy Version 1.6.17.0+galaxy4 (Quantification module) | Quantitative proteomics software package for analysis of large mass spectrometric data files |
MetaNovo | GalaxyP | Galaxy Version 1.9.4+galaxy4 | Search MS/MS data against a FASTA database (of known proteins) to produce a targeted database (of matched proteins) for mass spectrometry analysis |
msconvert | GalaxyP | Galaxy Version 3.0.20287.2 | Convert and/or filter mass spectrometry files |
MSstatsTMT | GalaxyP | Galaxy Version 2.0.0+galaxy1 | R-based package for detection of differentially abundant proteins in shotgun mass spectrometry-based proteomic experiments using tandem mass tag (TMT) labeling |
PepQuery2 | GalaxyP | Galaxy Version 2.0.2+galaxy0 | Peptide-centric search engine for identification and/or validating known and novel peptides of interest |
PeptideShaker | GalaxyP | Galaxy Version 2.0.33+galaxy1 | Interpret results from SearchGUI for protein identification |
Protein Database Downloader | GalaxyP | Galaxy Version 0.3.4 | Download specified protein sequences as a FASTA file |
Query Tabular | GalaxyP | Galaxy Version 3.3.0 | Load tabular files intoa SQLite database |
Remove beginning | GalaxyP | Galaxy Version 1.0.0 | Remove the specified number of (header) lines from a file |
SearchGUI | GalaxyP | Galaxy Version 4.0.41+galaxy1 | Run search engines on MGF peak lists and prepare results for input to Peptide Shaker |
Select | GalaxyP | Galaxy Version 1.0.4 | Select lines that match an expression |
Unipept | GalaxyP | Galaxy Version 4.5.1 | Retrieve UniProt entries and taxonomic information for tryptic peptides |
UniProt | GalaxyP | Galaxy Version 2.3.0 | Download proteome as a XML (UniProtXML) or FASTA file from UniProtKB |
Access restricted. Please log in or start a trial to view this content.
Request permission to reuse the text or figures of this JoVE article
Request PermissionThis article has been published
Video Coming Soon
Copyright © 2025 MyJoVE Corporation. All rights reserved