Heuristic Mining of Hierarchical Genotypes and Accessory Genome Loci in Bacterial Populations

Natasha Pavlovikj; Joao Carlos Gomes-Neto; Andrew K. Benson

doi:10.3791/63115

Aby wyświetlić tę treść, wymagana jest subskrypcja JoVE. Zaloguj się lub rozpocznij bezpłatny okres próbny.

Method Article

Heuristic Mining of Hierarchical Genotypes and Accessory Genome Loci in Bacterial Populations

DOI:

10.3791/63115

⸱

December 7th, 2021

Natasha Pavlovikj*¹, Joao Carlos Gomes-Neto*²^,³, Andrew K. Benson²^,³

¹Department of Computer Science and Engineering, University of Nebraska-Lincoln, ²Department of Food Science and Technology, University of Nebraska-Lincoln, ³Nebraska Food for Health Center, University of Nebraska-Lincoln

* Wspomniani autorzy wnieśli do projektu równy wkład.

Please note that all translations are automatically generated. Click here for the English version.

Podsumowanie

This analytical computational platform provides practical guidance for microbiologists, ecologists, and epidemiologists interested in bacterial population genomics. Specifically, the work presented here demonstrated how to perform: i) phylogeny-guided mapping of hierarchical genotypes; ii) frequency-based analysis of genotypes; iii) kinship and clonality analyses; iv) identification of lineage differentiating accessory loci.

Streszczenie

Routine and systematic use of bacterial whole-genome sequencing (WGS) is enhancing the accuracy and resolution of epidemiological investigations carried out by Public Health laboratories and regulatory agencies. Large volumes of publicly available WGS data can be used to study pathogenic populations at a large scale. Recently, a freely available computational platform called ProkEvo was published to enable reproducible, automated, and scalable hierarchical-based population genomic analyses using bacterial WGS data. This implementation of ProkEvo demonstrated the importance of combining standard genotypic mapping of populations with mining of accessory genomic content for ecological inference. In particular, the work highlighted here used ProkEvo-derived outputs for population-scaled hierarchical analyses using the R programming language. The main objective was to provide a practical guide for microbiologists, ecologists, and epidemiologists by showing how to: i) use a phylogeny-guided mapping of hierarchical genotypes; ii) assess frequency distributions of genotypes as a proxy for ecological fitness; iii) determine kinship relationships and genetic diversity using specific genotypic classifications; and iv) map lineage differentiating accessory loci. To enhance reproducibility and portability, R markdown files were used to demonstrate the entire analytical approach. The example dataset contained genomic data from 2,365 isolates of the zoonotic foodborne pathogen Salmonella Newport. Phylogeny-anchored mapping of hierarchical genotypes (Serovar -> BAPS1 -> ST -> cgMLST) revealed the population genetic structure, highlighting sequence types (STs) as the keystone differentiating genotype. Across the three most dominant lineages, ST5 and ST118 shared a common ancestor more recently than with the highly clonal ST45 phylotype. ST-based differences were further highlighted by the distribution of accessory antimicrobial resistance (AMR) loci. Lastly, a phylogeny-anchored visualization was used to combine hierarchical genotypes and AMR content to reveal the kinship structure and lineage-specific genomic signatures. Combined, this analytical approach provides some guidelines for conducting heuristic bacterial population genomic analyses using pan-genomic information.

Wprowadzenie

The increasing use of bacterial whole-genome sequencing (WGS) as a basis for routine surveillance and epidemiological inquiry by Public Health laboratories and regulatory agencies has substantially enhanced pathogen outbreak investigations¹^,²^,³^,⁴. As a consequence, large volumes of de-identified WGS data are now publicly available and can be used to study aspects of the population biology of pathogenic species at an unprecedented scale, including studies based on: population structures, genotype frequencies, and gene/allele frequencies across multiple reservoirs, geographical regions, and types of environments⁵. The most commonly used WGS-guided epidemiological inquiries are based on analyses using only the shared core-genomic content, where the shared (conserved) content alone is used for genotypic classification (e.g., variant calling), and these variants become the basis for epidemiological analysis and tracing¹^,²^,⁶^,⁷. Typically, bacterial core-genome-based genotyping is carried out with multi-locus sequence typing (MLST) approaches using seven to a few thousand loci⁸^,⁹^,¹⁰. These MLST-based strategies encompass mapping of pre-assembled or assembled genomic sequences onto highly curated databases, thereby combining allelic information into reproducible genotypic units for epidemiological and ecological analysis¹¹^,¹². For instance, this MLST-based classification can generate genotypic information at two levels of resolution: lower-level sequence types (STs) or ST lineages (7 loci), and higher-level core-genome MLST (cgMLST) variants (~ 300-3,000 loci)¹⁰.

MLST-based genotypic classification is computationally portable and highly reproducible between laboratories, making it widely accepted as an accurate sub-typing approach beneath the bacterial species level¹³^,¹⁴. However, bacterial populations are structured with species-specific varying degrees of clonality (i.e., genotypic homogeneity), complex patterns of hierarchical kinship between genotypes¹⁵^,¹⁶^,¹⁷, and a wide range of variation in the distribution of accessory genomic content¹⁸^,¹⁹. Thus, a more holistic approach goes beyond discrete classifications into MLST genotypes and incorporates the hierarchical relationships of genotypes at different scales of resolution, along with mapping of accessory genomic content onto genotypic classifications, which facilitates population-based inference¹⁸^,²⁰^,²¹. Moreover, analyses can also focus on shared patterns of inheritance of accessory genomic loci among even distantly-related genotypes²¹^,²². Overall, the combined approach enables agnostic interrogation of relationships between population structure and the distribution of specific genomic compositions (e.g., loci) among geospatial or environmental gradients. Such an approach can yield both fundamental and practical information about the ecological characteristics of specific populations that may, in turn, explain their tropism and dispersion patterns across reservoirs, such as food animals or humans.

This systems-based hierarchical population-oriented approach demands large volumes of WGS data for sufficient statistical power to predict distinguishable genomic signatures. Consequently, the approach requires a computational platform capable of processing many thousands of bacterial genomes at once. Recently, ProkEvo was developed and is a freely available, automated, portable, and scalable bioinformatics platform that allows for integrative hierarchical-based bacterial population analyses, including pan-genomic mapping²⁰. ProkEvo allows for the study of moderate-to-large scale bacterial datasets while providing a framework to generate testable and inferable epidemiological and ecological hypotheses and phenotypic predictions that can be customized by the user. This work complements that pipeline in providing a guide on how to utilize ProkEvo-derived output files as input for analyses and interpretation of hierarchical population classifications and accessory genomic mining. The case study presented here utilized the population of Salmonella enterica lineage I zoonotic serovar S. Newport as an example and was specifically aimed at providing practical guidelines for microbiologists, ecologists, and epidemiologists on how to: i) use an automated phylogeny-dependent approach to map hierarchical genotypes; ii) assess the frequency distribution of genotypes as a proxy for evaluating ecological fitness; iii) determine lineage-specific degrees of clonality using independent statistical approaches; and iv) map lineage-differentiating AMR loci as an example of how to mine accessory genomic content in the context of the population structure. More broadly, this analytical approach provides a generalizable framework to perform a population-based genomic analysis at a scale that can be used to infer evolutionary and ecological patterns regardless of the targeted species.

Access restricted. Please log in or start a trial to view this content.

Protokół

1. Prepare input files

NOTE: The protocol is available here - https://github.com/jcgneto/jove_bacterial_population_genomics/tree/main/code. The protocol assumes that the researcher has specifically used ProkEvo (or a comparable pipeline) to get the necessary outputs available in this Figshare repository (https://figshare.com/account/projects/116625/articles/15097503 - login credentials are required - The user must create a free account to have file access!). Of note, ProkEvo automatically downloads genomic sequences from the NCBI-SRA repository and only requires a .txt file containing a list of genome identifications as an input²⁰, and the one used for this work on S. Newport USA isolates is provided here (https://figshare.com/account/projects/116625/articles/15097503?file=29025729). Detailed information on how to install and use this bacterial genomics platform is available here (https://github.com/npavlovikj/ProkEvo/wiki/2.-Quick-start)²⁰

Generate core-genome phylogeny using FastTree²³ as previously described²⁰, which is not part of the bioinformatics platform²⁰. FastTree requires the Roary²⁴ core-genome alignment as an input file. The phylogeny file is named newport_phylogeny.tree (https://figshare.com/account/projects/116625/articles/15097503?file=29025690).
Generate SISTR²⁵ output containing the information regarding serovars classifications for Salmonella and cgMLST variant calling data (sistr_output.csv - https://figshare.com/account/projects/116625/articles/15097503?file=29025699).
Generate BAPS file by fastbaps²⁶^,²⁷ containing the BAPS levels 1-6 classification of genomes into sub-groups or haplotypes (fastbaps_partition_baps_prior_l6.csv - https://figshare.com/account/projects/116625/articles/15097503?file=29025684).
Generate MLST-based classification of genomes into STs using the MLST program (https://github.com/tseemann/mlst)²⁸ (salmonellast_output.csv - https://figshare.com/account/projects/116625/articles/15097503?file=29025696).
Generate ABRicate (https://github.com/tseemann/abricate)²⁹ output as a .csv file containing AMR loci mapped per genome (sabricate_resfinder_output.csv - https://figshare.com/account/projects/116625/articles/15097503?file=29025693).
NOTE: The user can turn off specific parts of the ProkEvo bioinformatics pipeline (check here for more information - https://github.com/npavlovikj/ProkEvo/wiki/4.2.-Remove-existing-bioinformatics-tool-from-ProkEvo). The analytical approach presented here provides guidelines for how to conduct a population-based analysis after the bioinformatics pipeline has been run.

2. Download and install the statistical software and integrated development environment (IDE) application

Download the most up-to-date freely available version of the R software for Linux, Mac, or PC³⁰. Follow the default installation steps.
Download the most up-to-date freely available version of the RStudio desktop IDE here³¹. Follow the default steps for installation.
NOTE: The next steps are included in the available script, including detailed information of code utilization, and should be run sequentially to generate the outputs and figures presented in this work (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/data_analysis_R_code.Rmd). The user may decide to use another programming language to conduct this analytical/statistical analysis such as Python. In that case, use the steps in the scripts as a framework to carry out the analysis.

3. Install and activate data science libraries

Install all data science libraries at once as a first step in the analysis. Avoid installing the libraries every time the script needs to be re-run. Use the function install.packages() for library installation. Alternatively, the user may click on the Packages tab inside of the IDE and automatically install the packages. The code used to install all needed libraries is presented here:
# Install Tidyverse
install.packages("tidyverse")
# Install skimr
install.packages("skimr")
# Install vegan
install.packages("vegan")
# Install forcats
install.packages("forcats")
# Install naniar
install.packages("naniar")
# Install ggpubr
install.packages("ggpubr")
# Install ggrepel
install.packages("ggrepel")
# Install reshape2
install.packages("reshape2")
# Install RColorBrewer
install.packages("RColorBrewer")
# Install ggtree
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ggtree")
# Installation of ggtree will prompt a question about installation - answer is "a" to install/update all dependencies
Activate all the libraries or packages using the library() function at the beginning of the script, right after installation. Here is a demonstration on how to activate all necessary packages:
# Activate the libraries and packages
library(tidyverse)
library(skimr)
library(vegan)
library(forcats)
library(naniar)
library(ggtree)
library(ggpubr)
library(ggrepel)
library(reshape2)
library(RColorBrewer)
Suppress outputting the code used for library and package installation and activation by using {r, include = FALSE} in the code chuck, as follows:
``` {r, include = FALSE}
# Install Tidyverse
install.packages("tidyverse")
```
NOTE: This step is optional but avoids showing chunks of unnecessary code in the final html, doc, or pdf report.
For a brief description of the specific functions of all libraries along with some useful links to gather further information, refer to steps 3.4.1-3.4.11.
1. Tidyverse - use this collection of packages used for data science, including data entry, visualization, parsing and aggregation, and statistical modeling. Typically, ggplot2 (data visualization) and dplyr (data wrangling and modeling) are practical packages present in this library³².
2. skimr - use this package for generating summary statistics of data frames, including identification of missing values³³.
3. vegan - use this package for community ecology statistical analyses, such as calculating diversity-based statistics (e.g., alpha and beta-diversity)³⁴.
4. forcats - use this package to work with categorical variables such as re-ordering classifications. This package is part of the Tidyverse library³².
5. naniar - use this package to visualize the distribution of missing values across variables in a data frame, by using the viss_miss() function³⁵.
6. ggtree - use this package for the visualization of phylogenetic trees³⁶.
7. ggpubr - use this package to improve the quality of ggplot2-based visualizations³⁷.
8. ggrepel - use this package for text labeling inside of graphs³⁸.
9. reshape2 - use the melt() function from this package for the transformation of data frames from wide to long format³⁹.
10. RColorBrewer - use this package to manage colors in ggplot2-based visualizations⁴⁰.
11. Use the following basic functions for exploratory data analysis: head() to check the first observations in a data frame, tail() to check the last observations of a data frame, is.na() to count the number of rows with missing values across a data frame, dim() to check the number of rows and columns in a dataset, table() to count observations across a variable, and sum() to count the total number of observations or instances.

4. Data entry and analysis

NOTE: A detailed information on each step of this analysis can be found in the available script (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/data_analysis_R_code.Rmd). However, here are some important points to be considered:

Do all genomic data entry, including all genotypic classifications (serovar, BAPS, ST, and cgMLST) using the read_csv() function.
Rename, create new variables, and select columns of interest from each dataset before multi-dataset aggregation.
Don't remove missing values from any independent dataset. Wait until all datasets are aggregated to modify or exclude missing values. If new variables are created for each dataset, then missing values are by default categorized into one of the newly generated classifications.
Check for erroneous characters such as hyphens or interrogations marks and replace them with NA (Not applicable). Do the same for missing values.
Aggregate data based on the hierarchical order of genotypes (serovar -> BAPS1 -> ST -> cgMLST), and by grouping based on the individual genome identifications.
Check for missing values using multiple strategies and deal with such inconsistencies explicitly. Only remove a genome or isolate from the data if the classification is unreliable. Otherwise, consider the analysis being done and remove NAs on a case-by-case basis.
NOTE: It is highly recommended to establish a strategy to deal with such values a priori. Avoid removing all genomes or isolates with missing values across any variables. For instance, a genome may have ST classification without having cgMLST variant number. In that case, the genome can still be used for the ST-based analysis.
Once all datasets are aggregated, assign them to a data frame name or object that can be used in multiple locations in the follow-up analysis, to avoid having to generate the same metadata file for every figure in the paper.

5. Conduct analyses and generate visualizations

NOTE: A detailed description of each step needed to produce all the analysis and visualizations can be found in the markdown file for this paper (https://github.com/jcgneto/jove_bacterial_population_genomics/tree/main/code). Code for each figure is separated in chunks and the entire script should be run sequentially. Additionally, the code for each main and supplementary figure is provided as a separate file (see Supplementary File 1 and Supplementary File 2). Here are some essential points (with snippets of code) to be considered while generating each main and supplementary figures.

Use ggtree to plot a phylogenetic tree along with genotypic information (Figure 1).
1. Optimize the ggtree figure size, including diameter and width of rings, by changing the numerical values inside of the xlim() and gheatmap(width = ) functions, respectively (see example code below).
  tree_plot <- ggtree(tree, layout = "circular") + xlim(-250, NA)
  figure_1 <- gheatmap(tree_plot, d4, offset=.0, width=20, colnames = FALSE)
  NOTE: For a more detailed comparison of programs that can be used for phylogenetic plotting, check this work²⁰. The work highlighted an attempt made to identify strategies to improve ggtree-based visualizations such as decreasing the dataset size, but branch lengths and tree topology were not as clearly discriminating as compared to phandango⁴¹.
2. Aggregate all metadata into as few categories as possible to facilitate the choice of coloring panel when plotting multiple layers of data with the phylogenetic tree (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/figure_1.Rmd). Conduct the data aggregation based on the question of interest and domain knowledge.
Use a bar plot to assess relative frequencies (Figure 2).
1. Aggregate data for both ST lineages and cgMLST variants to facilitate visualizations. Choose an empirical or statistical threshold used for data aggregation, while considering the question being asked.
2. For an example code that can be used to inspect the frequency distribution of ST lineages to determine the cut-off see below:
  st_dist <- d2 %>% group_by(ST) %>% # group by the ST column
  count() %>% # count the number of observations
  arrange(desc(n)) # arrange the counts in decreasing order
3. For an example code showing how minor (low-frequency) STs can be aggregated refer below. As demonstrated below, STs that are not numbered as 5, 31, 45, 46, 118, 132, or 350, are grouped together as "Other STs". Use a similar code for cgMLST variants (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/figure_2.Rmd).
  d2$st <- ifelse(d2$ST == 5, "ST5", # create a new ST column for which minor S Ts are aggregated as Others
  ifelse(d2$ST == 31, "ST31",
  ifelse(d2$ST == 45, "ST45",
  ifelse(d2$ST == 46, "ST46",
  ifelse(d2$ST == 118, "ST118",
  ifelse(d2$ST == 132, "ST132", ifelse(d2$ST == 350, "ST350", "Other STs")))))))
Use a nested approach to calculate the proportion of each ST lineage within each BAPS1 sub-group to identify STs that are ancestrally related (belong to the same BAPS1 sub-group) (Figure 3). The code below exemplifies how the ST-based proportion can be calculated across BAPS1 sub-groups (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/figure_3.Rmd):
baps <- d2b %>% filter(serovar == "Newport") %>% # filter Newport serovars
select(baps_1, ST) %>% # select baps_1 and ST columns
mutate(ST = as.numeric(ST)) %>% # change ST column to numeric
drop_na(baps_1, ST) %>% # drop NAs
group_by(baps_1, ST) %>% # group by baps_1 and ST
summarise(n = n()) %>% # count observations
mutate(prop = n/sum(n)*100) # calculate proportions
Plot the distribution of AMR loci across ST lineages using the Resfinder-based gene annotation results (Figure 4).
NOTE: Resfinder has been widely used in ecological and epidemiological studies⁴². Annotation of protein-coding genes can vary depending on how often databases are curated and updated. If using the suggested bioinformatics pipeline, the researcher can compare AMR-based loci classifications across different databases²⁰. Be sure to check which databases are continually being updated. Do not use out-of-date or poorly curated databases, in order to avoid miscalls.
1. Use an empirical or statistical threshold to filter out the most important AMR loci to facilitate visualizations. Provide a raw .csv file containing the calculated proportions of all AMR loci across all ST lineages, such as shown here (https://figshare.com/account/projects/116625/articles/15097503?file=29025687).
2. Calculate the AMR proportion for each ST using the following code (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/figure_4.Rmd):
  # Calculations for ST45
  d2c <- data6 %>% filter(st == "ST45") # filter ST45 data first
  # for ST45, calculate the proportion of AMR loci and only keep proportion greater than 10%
  d3c <- d2c %>% select(id, gene) %>% # select columns
  group_by(id, gene) %>% # group by id and gene
  summarize(count = n()) %>% # count observations
  mutate(count = replace(count, count == 2, 1)) %>% # replace counts equal to 2 with 1 to only consider one copy of each gene (duplications may not be reliable), but the researcher can decide to exclude or keep them. If the researcher wants to exclude them, then use the filter(count != 2) function or else leave as is
  filter(count <= 1) # filter counts below or equal to 1
  d4c <- d3c %>% group_by(gene) %>% # group by gene
  summarize(value = n()) %>% # count observations
  mutate(total = table(data1$st)[6]) %>% # get the total counts of st mutate(prop = (value/total)*100) # calculate proportions
  d5c <- d4c %>% mutate(st = "ST45") # create a st column and add ST information
3. After calculations are done for all STs, combine datasets as one data frame, using the following code:
  # Combine datasets
  d6 <- rbind(d5a, d5b, d5c, d5d, d5e, d5f, d5g, d5h) # row bind datasets
4. To export the .csv file containing the calculated proportions, use the code:
  # Export data table containing ST and AMR loci information
  abx_newport_st <- d6 write.csv(abx_newport_st,"abx_newport_st.csv", row.names = FALSE)
5. Before plotting the AMR-based distribution across ST lineages, filter the data based on a threshold to facilitate visualizations, as shown below:
  # Filter AMR loci with proportion higher than or equal to 10%
  d7 <- d6 %>% filter(prop >= 10) # determine the threshold empirically or statistically
Plot the core-genome phylogeny along with the hierarchical genotypic classifications and AMR data in a single plot using ggtree (Figure 5).
1. Optimize the figure size inside ggtree using the abovementioned parameters (see step 5.1.1.).
2. Optimize visualizations by aggregating variables, or using binary classification such as gene presence or absence. The more features are added to the plot, the harder the coloring selection process becomes (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/figure_5.Rmd).
  NOTE: Supplementary figures - detailed description of the entire code can be found here (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/data_analysis_R_code.Rmd).
Use a scatter plot in ggplot2, without data aggregation, to display the distribution of ST lineages or cgMLST variants while highlighting the most frequent genotypes (Supplementary Figure 1) (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/supplementary_figure_s1.Rmd).
Do a nested analysis to assess the composition of ST lineages through the proportion of cgMLST variants in order to get a glimpse of the ST-based genetic diversity, while identifying the most frequent variants and their genetic relationships (i.e., cgMLST variants that belong to the same ST shared an ancestor more recently than those belonging to distinct STs) (Supplementary Figure 2) (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/supplementary_figure_s2.Rmd).
Use community ecology metric, namely Simpson's D index of diversity, to measure the degree of clonality or genotypic diversity of each of the major ST lineages⁴³ (Supplementary Figure 3).
1. Calculate the index of diversity across ST lineages at different levels of genotypic resolution including BAPS level 1 through 6 and cgMLST. Below is the code example on how to do this calculation at the BAPS level 1 (BAPS1) of genotypic resolution:
  # BAPS level 1 (BAPS1)
  # drop the STs and BAPS1 with NAs, group by ST and BAPS1 and then calculate Simpson's index
  baps1 <- data6 %>%
  select(st, BAPS1) %>% # select columns
  drop_na(st, BAPS1) %>% # drop NAs
  group_by(st, BAPS1) %>% # group by columns
  summarise(n = n()) %>% # count observations
  mutate(simpson = diversity(n, "simpson")) %>% # calculate diversity
  group_by(st) %>% # group by column
  summarise(simpson = mean(simpson)) %>% # calculate the mean of the index
  melt(id.vars=c("st"), measure.vars="simpson",
  variable.name="index", value.name="value") %>% # covert into long format
  mutate(strat = "BAPS1") # create a strat column
  NOTE: A more genetically diverse population (i.e., more variants at different layers of genotypic resolution) has a higher index at the cgMLST level and produces an increasing index-based values going from BAPS level 2 to 6 (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/supplementary_figure_s3.Rmd).
Examine the degree of genotypic diversity of ST lineages by plotting the relative frequency of BAPS sub-groups at all levels of resolution (BAPS1-6) (Supplementary Figure 4). The more diverse the population is, the sparser the distribution of BAPS sub-groups (haplotypes) becomes going from BAPS1 (lower level of resolution) to BAPS6 (higher level of resolution) (https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/supplementary_figure_s4.Rmd).

Access restricted. Please log in or start a trial to view this content.

Wyniki

By utilizing the computational platform ProkEvo for population genomics analyses, the first step in bacterial WGS data mining is comprised of examining the hierarchical population structure in the context of a core-genome phylogeny (Figure 1). In the case of S. enterica lineage I, as exemplified by the S. Newport dataset, the population is hierarchically structured as follows: serovar (lowest level of resolution), BAPS1 sub-groups or haplotypes, ST lineages, and cg...

Access restricted. Please log in or start a trial to view this content.

Dyskusje

The utilization of a systems-based heuristic and hierarchical population structure analysis provides a framework to identify novel genomic signatures in bacterial datasets that have the potential to explain unique ecological and epidemiological patterns²⁰. Additionally, the mapping of accessory genome data onto the population structure can be used to infer ancestrally-acquired and/or recently-derived traits that facilitate the spread of ST lineages or cgMLST variants across reservoirs

Access restricted. Please log in or start a trial to view this content.

Ujawnienia

The authors have declared that no competing interests exist.

Podziękowania

This work was supported by funding provided by the UNL-IANR Agricultural Research Division and the National Institute for Antimicrobial Resistance Research and Education and by the Nebraska Food for Health Center at the Food Science and Technology Department (UNL). This research could only be completed by utilizing the Holland Computing Center (HCC) at UNL, which receives support from the Nebraska Research Initiative. We are also thankful for having access, through the HCC, to resources provided by the Open Science Grid (OSG), which is supported by the National Science Foundation and the U.S. Department of Energy's Office of Science. This work used the Pegasus Workflow Management Software which is funded by the National Science Foundation (grant #1664162).

Access restricted. Please log in or start a trial to view this content.

Materiały

Name	Company	Catalog Number	Comments
amr_data_filtered			https://figshare.com/account/projects/116625/articles/14829225?file=28758762
amr_data_raw			https://figshare.com/account/projects/116625/articles/14829225?file=28547994
baps_output			https://figshare.com/account/projects/116625/articles/14829225?file=28548003
Core-genome phylogeny			https://figshare.com/account/projects/116625/articles/14829225?file=28548006
genome_sra			https://figshare.com/account/projects/116625/articles/14829225?file=28639209
Linux, Mac, or PC			any high-performance platform
mlst_output			https://figshare.com/account/projects/116625/articles/14829225?file=28547997
sistr_output			https://figshare.com/account/projects/116625/articles/14829225?file=28548000
figshare credentials are required for login and have access to the files

Odniesienia

Grad, Y. H., et al. Genomic epidemiology of the Escherichia coli O104:H4 outbreaks in Europe, 2011. Proceedings of the National Academy of Sciences of the United States of America. 109 (8), 3065-3070 (2012).
Worby, C. J., Chang, H. -H., Hanage, W. P., Lipsitch, M. The distribution of pairwise genetic distances: a tool for investigating disease transmission. Genetics. 198 (4), 1395-1404 (2014).
Leekitcharoenphon, P., et al. Global genomic epidemiology of Salmonella enterica serovar Typhimurium DT104. Applied and Environmental Microbiology. 82 (8), 2516-2526 (2016).
Alba, P., et al. Molecular epidemiology of Salmonella Infantis in Europe: insights into the success of the bacterial host and its parasitic pESI-like megaplasmid. Microbial Genomics. 6 (5), (2020).
Zhou, Z., Alikhan, N. -F., Mohamed, K., Fan, Y. the Agama Study Group, Achtman, M. The EnteroBase user's guide, with case studies on Salmonella transmissions, Yersinia pestis phylogeny, and Escherichia core genomic diversity. Genome Research. 30 (1), 138-152 (2020).
Azarian, T., et al. Global emergence and population dynamics of divergent serotype 3 CC180 pneumococci. PLOS Pathogens. 14 (11), 1007438(2018).
Saltykova, A., et al. Comparison of SNP-based subtyping workflows for bacterial isolates using WGS data, applied to Salmonella enterica serotype Typhimurium and serotype 1,4,[5],12:i. PLOS ONE. 13 (2), 0192504(2018).
Achtman, M., et al. Multi-locus sequence typing as a replacement for serotyping in Salmonella enterica. PLoS Pathogens. 8 (6), 1002776(2012).
Maiden, M. C. J., et al. Multi-locus sequence typing: A portable approach to the identification of clones within populations of pathogenic microorganisms. Proceedings of the National Academy of Sciences of the United States of America. 95 (6), 3140-3145 (1998).
Alikhan, N. -F., Zhou, Z., Sergeant, M. J., Achtman, M. A genomic overview of the population structure of Salmonella. PLOS Genetics. 14 (4), 1007261(2018).
Gupta, A., Jordan, I. K., Rishishwar, L. stringMLST: a fast k-mer based tool for multi-locus sequence typing. Bioinformatics. 33 (1), 119-121 (2017).
Jolley, K. A., Maiden, M. C. BIGSdb: Scalable analysis of bacterial genome variation at the population level. BMC Bioinformatics. 11 (1), 595(2010).
Maiden, M. C. J., et al. MLST revisited: the gene-by-gene approach to bacterial genomics. Nature Reviews Microbiology. 11 (10), 728-736 (2013).
Maiden, M. C. J. Multilocus sequence typing of bacteria. Annual Review of Microbiology. 60 (1), 561-588 (2006).
Shapiro, B. J., Polz, M. F. Ordering microbial diversity into ecologically and genetically cohesive units. Trends in Microbiology. 22 (5), 235-247 (2014).
Cordero, O. X., Polz, M. F. Explaining microbial genomic diversity in light of evolutionary ecology. Nature Reviews Microbiology. 12 (4), 263-273 (2014).
Achtman, M., Wagner, M. Microbial diversity and the genetic nature of microbial species. Nature Reviews Microbiology. 6 (6), 431-440 (2008).
Abudahab, K., et al. PANINI: Pangenome neighbour identification for bacterial populations. Microbial Genomics. 5 (4), (2019).
Laing, C. R., Whiteside, M. D., Gannon, V. P. J. Pan-genome analyses of the species Salmonella enterica, and identification of genomic markers predictive for species, subspecies, and serovar. Frontiers in Microbiology. 8, 1345(2017).
Pavlovikj, N., Gomes-Neto, J. C., Deogun, J. S., Benson, A. K. ProkEvo: an automated, reproducible, and scalable framework for high-throughput bacterial population genomics analyses. PeerJ. 9, 11376(2021).
McNally, A., et al. Combined analysis of variation in core, accessory and regulatory genome regions provides a super-resolution view into the evolution of bacterial populations. PLOS Genetics. 12 (9), 1006280(2016).
Langridge, G. C., et al. Patterns of genome evolution that have accompanied host adaptation in Salmonella. Proceedings of the National Academy of Sciences of the United States of America. 112 (3), 863-868 (2015).
Price, M. N., Dehal, P. S., Arkin, A. P. FastTree 2 - Approximately maximum-likelihood trees for large alignments. PLoS ONE. 5 (3), 9490(2010).
Page, A. J., et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics. 31 (22), 3691-3693 (2015).
Yoshida, C. E., et al. The Salmonella In silico typing resource (SISTR): An open web-accessible tool for rapidly typing and subtyping draft Salmonella genome assemblies. PLOS ONE. 11 (1), 0147101(2016).
Cheng, L., Connor, T. R., Siren, J., Aanensen, D. M., Corander, J. Hierarchical and spatially explicit clustering of DNA sequences with BAPS software. Molecular Biology and Evolution. 30 (5), 1224-1228 (2013).
Tonkin-Hill, G., Lees, J. A., Bentley, S. D., Frost, S. D. W., Corander, J. Fast hierarchical Bayesian analysis of population structure. Nucleic Acids Research. 47 (11), 5539-5549 (2019).
Seemann, T. MLST. GitHub. , Available from: https://github.com/tseemann/mist (2020).
Seemann, T. ABRicate. GitHub. , Available from: https://github.com/tseemann/abricate (2020).
R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing. , Vienna, Austria. at. Available from: https://cran.r-project.org (2021).
Studio Team. RStudio: Integrated Development for R. Studio, PBC. , Boston, MA. Available from: http://www.rstudio.com (2020).
Wickham, H., et al. Welcome to the Tidyverse. Journal of Open Source Software. 4 (43), 1686(2019).
rOpenSci: The skimr package. GitHub. , Berkeley, CA. Available from: https://github.com/ropensci/skimr/ (2021).
Oksanen, J., et al. vegan: Community ecology package. R package version 2.5-5. , Available from: https://CRAN.R-project.org/package=vegan (2019).
Tierney, N. J., Cook, D. H. Expanding tidy data principles to facilitate missing data exploration, visualization and assessment of imputations. arXiv. , Available from: http://arxiv.org/abs/1809.02264 (2020).
Yu, G. Using ggtree to visualize data on tree-like structures. Current Protocols in Bioinformatics. 69 (1), (2020).
Kassambara, A. ggpubr: "ggplot2" Based Publication Ready Plots. R package version 0.4.0. , Available from: https://CRAN.R-project.org/package=ggpubr (2020).
Slowikowski, K. ggrepel: Automatically Position Non-Overlapping Text Labels with "ggplot2”. R package version 0.9.1. , Available from: https://CRAN.R-project.org/package=ggrepel (2021).
Wickham, H. Reshaping Data with the reshape Package. Journal of Statistical Software. 21 (12), (2007).
Neuwirth, E. RColorBrewer: ColorBrewer Palettes. R package version 1.1-2. , Available from: https://CRAN.R-project.org/package=RColorBrewer (2014).
Hadfield, J., Croucher, N. J., Goater, R. J., Abudahab, K., Aanensen, D. M., Harris, S. R. Phandango: an interactive viewer for bacterial population genomics. Bioinformatics. 34 (2), 292-293 (2018).
Perron, G. G., et al. Functional characterization of bacteria isolated from ancient arctic soil exposes diverse resistance mechanisms to modern antibiotics. PLOS ONE. 10 (3), 0069533(2015).
Mitchell, P. K., et al. Population genomics of pneumococcal carriage in Massachusetts children following introduction of PCV-13. Microbial Genomics. 5 (2), (2019).
Klemm, E. J., et al. Emergence of host-adapted Salmonella Enteritidis through rapid evolution in an immunocompromised host. Nature Microbiology. 1 (3), 15023(2016).
Břinda, K., et al. Rapid inference of antibiotic resistance and susceptibility by genomic neighbour typing. Nature Microbiology. 5 (3), 455-464 (2020).
MacFadden, D. R., et al. Using genetic distance from archived samples for the prediction of antibiotic resistance in Escherichia coli. Antimicrobial Agents and Chemotherapy. 64 (5), (2020).
Mageiros, L., et al. Genome evolution and the emergence of pathogenicity in avian Escherichia coli. Nature Communications. 12 (1), 765(2021).
Yahara, K., et al. Genome-wide association of functional traits linked with Campylobacter jejuni survival from farm to fork. Environmental Microbiology. 19 (1), 361-380 (2017).
Walter, J., Maldonado-Gómez, M. X., Martínez, I. To engraft or not to engraft: an ecological framework for gut microbiome modulation with live microbes. Current Opinion in Biotechnology. 49, 129-139 (2018).
Maldonado-Gómez, M. X., et al. Stable engraftment of Bifidobacterium longum AH1206 in the human gut depends on individualized features of the resident microbiome. Cell Host & Microbe. 20 (4), 515-526 (2016).
Zhao, S., et al. Adaptive evolution within gut microbiomes of healthy people. Cell Host & Microbe. 25 (5), 656-667 (2019).
Treangen, T. J., Ondov, B. D., Koren, S., Phillippy, A. M. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biology. 15 (11), 524(2014).
Letunic, I., Bork, P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Research. 49, 293-296 (2021).
Croucher, N. J., et al. Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Research. 43 (3), 15(2015).
Fenske, G. J., Thachil, A., McDonough, P. L., Glaser, A., Scaria, J. Geography shapes the population genomics of Salmonella enterica Dublin. Genome Biology and Evolution. 11 (8), 2220-2231 (2019).
Lees, J. A., et al. Fast and flexible bacterial genomic epidemiology with PopPUNK. Genome Research. 29 (2), 304-316 (2019).
Cohan, F. M. Towards a conceptual and operational union of bacterial systematics, ecology, and evolution. Philosophical Transactions of the Royal Society B: Biological Sciences. 361 (1475), 1985-1996 (2006).
Cohan, F. M., Koeppel, A. F. The origins of ecological diversity in prokaryotes. Current Biology. 18 (21), 1024-1034 (2008).
Cohan, F. M. Transmission in the origins of bacterial diversity, from ecotypes to phyla. Microbial Transmission. 5 (5), 311-343 (2019).
Davis, J. J., et al. The PATRIC bioinformatics resource center: expanding data and analysis capabilities. Nucleic Acids Research. 48, 606-612 (2019).
Feng, Y., Zou, S., Chen, H., Yu, Y., Ruan, Z. BacWGSTdb 2.0: a one-stop repository for bacterial whole-genome sequence typing and source tracking. Nucleic Acids Research. 49, 644-650 (2021).

Access restricted. Please log in or start a trial to view this content.

Przedruki i uprawnienia

Zapytaj o uprawnienia na użycie tekstu lub obrazów z tego artykułu JoVE

Zapytaj o uprawnienia

Przeglądaj więcej artyków

Heuristic Mining Hierarchical Genotypes Accessory Genome Loci Bacterial Populations ProkEvo Computational Platform Pan genomic Content Ecological Investigation Epidemiological Investigation Pathogen Tracking Diagnostics Facilitation Public Health Surveillance Bacterial Population Genomics Open Source Platform Phylogenetic Tree Data Aggregation

This article has been published

Video Coming Soon

Keep me updated: