Aby wyświetlić tę treść, wymagana jest subskrypcja JoVE. Zaloguj się lub rozpocznij bezpłatny okres próbny.
Method Article
* Wspomniani autorzy wnieśli do projektu równy wkład.
This analytical computational platform provides practical guidance for microbiologists, ecologists, and epidemiologists interested in bacterial population genomics. Specifically, the work presented here demonstrated how to perform: i) phylogeny-guided mapping of hierarchical genotypes; ii) frequency-based analysis of genotypes; iii) kinship and clonality analyses; iv) identification of lineage differentiating accessory loci.
Routine and systematic use of bacterial whole-genome sequencing (WGS) is enhancing the accuracy and resolution of epidemiological investigations carried out by Public Health laboratories and regulatory agencies. Large volumes of publicly available WGS data can be used to study pathogenic populations at a large scale. Recently, a freely available computational platform called ProkEvo was published to enable reproducible, automated, and scalable hierarchical-based population genomic analyses using bacterial WGS data. This implementation of ProkEvo demonstrated the importance of combining standard genotypic mapping of populations with mining of accessory genomic content for ecological inference. In particular, the work highlighted here used ProkEvo-derived outputs for population-scaled hierarchical analyses using the R programming language. The main objective was to provide a practical guide for microbiologists, ecologists, and epidemiologists by showing how to: i) use a phylogeny-guided mapping of hierarchical genotypes; ii) assess frequency distributions of genotypes as a proxy for ecological fitness; iii) determine kinship relationships and genetic diversity using specific genotypic classifications; and iv) map lineage differentiating accessory loci. To enhance reproducibility and portability, R markdown files were used to demonstrate the entire analytical approach. The example dataset contained genomic data from 2,365 isolates of the zoonotic foodborne pathogen Salmonella Newport. Phylogeny-anchored mapping of hierarchical genotypes (Serovar -> BAPS1 -> ST -> cgMLST) revealed the population genetic structure, highlighting sequence types (STs) as the keystone differentiating genotype. Across the three most dominant lineages, ST5 and ST118 shared a common ancestor more recently than with the highly clonal ST45 phylotype. ST-based differences were further highlighted by the distribution of accessory antimicrobial resistance (AMR) loci. Lastly, a phylogeny-anchored visualization was used to combine hierarchical genotypes and AMR content to reveal the kinship structure and lineage-specific genomic signatures. Combined, this analytical approach provides some guidelines for conducting heuristic bacterial population genomic analyses using pan-genomic information.
The increasing use of bacterial whole-genome sequencing (WGS) as a basis for routine surveillance and epidemiological inquiry by Public Health laboratories and regulatory agencies has substantially enhanced pathogen outbreak investigations1,2,3,4. As a consequence, large volumes of de-identified WGS data are now publicly available and can be used to study aspects of the population biology of pathogenic species at an unprecedented scale, including studies based on: population structures, genotype frequencies, and gene/allele frequencies across multiple reservoirs, geographical regions, and types of environments5. The most commonly used WGS-guided epidemiological inquiries are based on analyses using only the shared core-genomic content, where the shared (conserved) content alone is used for genotypic classification (e.g., variant calling), and these variants become the basis for epidemiological analysis and tracing1,2,6,7. Typically, bacterial core-genome-based genotyping is carried out with multi-locus sequence typing (MLST) approaches using seven to a few thousand loci8,9,10. These MLST-based strategies encompass mapping of pre-assembled or assembled genomic sequences onto highly curated databases, thereby combining allelic information into reproducible genotypic units for epidemiological and ecological analysis11,12. For instance, this MLST-based classification can generate genotypic information at two levels of resolution: lower-level sequence types (STs) or ST lineages (7 loci), and higher-level core-genome MLST (cgMLST) variants (~ 300-3,000 loci)10.
MLST-based genotypic classification is computationally portable and highly reproducible between laboratories, making it widely accepted as an accurate sub-typing approach beneath the bacterial species level13,14. However, bacterial populations are structured with species-specific varying degrees of clonality (i.e., genotypic homogeneity), complex patterns of hierarchical kinship between genotypes15,16,17, and a wide range of variation in the distribution of accessory genomic content18,19. Thus, a more holistic approach goes beyond discrete classifications into MLST genotypes and incorporates the hierarchical relationships of genotypes at different scales of resolution, along with mapping of accessory genomic content onto genotypic classifications, which facilitates population-based inference18,20,21. Moreover, analyses can also focus on shared patterns of inheritance of accessory genomic loci among even distantly-related genotypes21,22. Overall, the combined approach enables agnostic interrogation of relationships between population structure and the distribution of specific genomic compositions (e.g., loci) among geospatial or environmental gradients. Such an approach can yield both fundamental and practical information about the ecological characteristics of specific populations that may, in turn, explain their tropism and dispersion patterns across reservoirs, such as food animals or humans.
This systems-based hierarchical population-oriented approach demands large volumes of WGS data for sufficient statistical power to predict distinguishable genomic signatures. Consequently, the approach requires a computational platform capable of processing many thousands of bacterial genomes at once. Recently, ProkEvo was developed and is a freely available, automated, portable, and scalable bioinformatics platform that allows for integrative hierarchical-based bacterial population analyses, including pan-genomic mapping20. ProkEvo allows for the study of moderate-to-large scale bacterial datasets while providing a framework to generate testable and inferable epidemiological and ecological hypotheses and phenotypic predictions that can be customized by the user. This work complements that pipeline in providing a guide on how to utilize ProkEvo-derived output files as input for analyses and interpretation of hierarchical population classifications and accessory genomic mining. The case study presented here utilized the population of Salmonella enterica lineage I zoonotic serovar S. Newport as an example and was specifically aimed at providing practical guidelines for microbiologists, ecologists, and epidemiologists on how to: i) use an automated phylogeny-dependent approach to map hierarchical genotypes; ii) assess the frequency distribution of genotypes as a proxy for evaluating ecological fitness; iii) determine lineage-specific degrees of clonality using independent statistical approaches; and iv) map lineage-differentiating AMR loci as an example of how to mine accessory genomic content in the context of the population structure. More broadly, this analytical approach provides a generalizable framework to perform a population-based genomic analysis at a scale that can be used to infer evolutionary and ecological patterns regardless of the targeted species.
1. Prepare input files
NOTE: The protocol is available here - https://github.com/jcgneto/jove_bacterial_population_genomics/tree/main/code. The protocol assumes that the researcher has specifically used ProkEvo (or a comparable pipeline) to get the necessary outputs available in this Figshare repository (https://figshare.com/account/projects/116625/articles/15097503 - login credentials are required - The user must create a free account to have file access!). Of note, ProkEvo automatically downloads genomic sequences from the NCBI-SRA repository and only requires a .txt file containing a list of genome identifications as an input20, and the one used for this work on S. Newport USA isolates is provided here (https://figshare.com/account/projects/116625/articles/15097503?file=29025729). Detailed information on how to install and use this bacterial genomics platform is available here (https://github.com/npavlovikj/ProkEvo/wiki/2.-Quick-start)20
2. Download and install the statistical software and integrated development environment (IDE) application
3. Install and activate data science libraries
4. Data entry and analysis
NOTE: A detailed information on each step of this analysis can be found in the available script (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/data_analysis_R_code.Rmd). However, here are some important points to be considered:
5. Conduct analyses and generate visualizations
NOTE: A detailed description of each step needed to produce all the analysis and visualizations can be found in the markdown file for this paper (https://github.com/jcgneto/jove_bacterial_population_genomics/tree/main/code). Code for each figure is separated in chunks and the entire script should be run sequentially. Additionally, the code for each main and supplementary figure is provided as a separate file (see Supplementary File 1 and Supplementary File 2). Here are some essential points (with snippets of code) to be considered while generating each main and supplementary figures.
By utilizing the computational platform ProkEvo for population genomics analyses, the first step in bacterial WGS data mining is comprised of examining the hierarchical population structure in the context of a core-genome phylogeny (Figure 1). In the case of S. enterica lineage I, as exemplified by the S. Newport dataset, the population is hierarchically structured as follows: serovar (lowest level of resolution), BAPS1 sub-groups or haplotypes, ST lineages, and cg...
The utilization of a systems-based heuristic and hierarchical population structure analysis provides a framework to identify novel genomic signatures in bacterial datasets that have the potential to explain unique ecological and epidemiological patterns20. Additionally, the mapping of accessory genome data onto the population structure can be used to infer ancestrally-acquired and/or recently-derived traits that facilitate the spread of ST lineages or cgMLST variants across reservoirs
The authors have declared that no competing interests exist.
This work was supported by funding provided by the UNL-IANR Agricultural Research Division and the National Institute for Antimicrobial Resistance Research and Education and by the Nebraska Food for Health Center at the Food Science and Technology Department (UNL). This research could only be completed by utilizing the Holland Computing Center (HCC) at UNL, which receives support from the Nebraska Research Initiative. We are also thankful for having access, through the HCC, to resources provided by the Open Science Grid (OSG), which is supported by the National Science Foundation and the U.S. Department of Energy's Office of Science. This work used the Pegasus Workflow Management Software which is funded by the National Science Foundation (grant #1664162).
Name | Company | Catalog Number | Comments |
amr_data_filtered | https://figshare.com/account/projects/116625/articles/14829225?file=28758762 | ||
amr_data_raw | https://figshare.com/account/projects/116625/articles/14829225?file=28547994 | ||
baps_output | https://figshare.com/account/projects/116625/articles/14829225?file=28548003 | ||
Core-genome phylogeny | https://figshare.com/account/projects/116625/articles/14829225?file=28548006 | ||
genome_sra | https://figshare.com/account/projects/116625/articles/14829225?file=28639209 | ||
Linux, Mac, or PC | any high-performance platform | ||
mlst_output | https://figshare.com/account/projects/116625/articles/14829225?file=28547997 | ||
sistr_output | https://figshare.com/account/projects/116625/articles/14829225?file=28548000 | ||
figshare credentials are required for login and have access to the files |
Zapytaj o uprawnienia na użycie tekstu lub obrazów z tego artykułu JoVE
Zapytaj o uprawnieniaThis article has been published
Video Coming Soon
Copyright © 2025 MyJoVE Corporation. Wszelkie prawa zastrzeżone