细菌种群中分层基因型和辅助基因组位点的启发式挖掘

Natasha Pavlovikj; Joao Carlos Gomes-Neto; Andrew K. Benson

doi:10.3791/63115

需要订阅 JoVE 才能查看此. 登录或开始免费试用。

本文内容

摘要
摘要
引言
研究方案
结果
讨论
披露声明
致谢
材料
参考文献
转载和许可

摘要

该分析计算平台为对细菌种群基因组学感兴趣的微生物学家，生态学家和流行病学家提供实用指导。具体而言，这里介绍的工作展示了如何执行:i）分层基因型的系统发育指导映射;ii）基于频率的基因型分析;iii）亲属关系和克隆性分析;iv）识别谱系分化附属位点。

摘要

常规和系统地使用细菌全基因组测序（WGS）正在提高公共卫生实验室和监管机构开展的流行病学调查的准确性和分辨率。大量公开可用的WGS数据可用于大规模研究致病人群。最近，一个名为ProkEvo的免费计算平台被发布，以使用细菌WGS数据实现可重复，自动化和可扩展的基于分层的群体基因组分析。ProkEvo的这种实施证明了将种群的标准基因型图谱与挖掘辅助基因组内容以进行生态推断相结合的重要性。特别是，这里强调的工作使用ProkEvo派生的输出，使用R编程语言进行人口规模的分层分析。主要目标是通过展示如何:i）使用系统发育指导的分层基因型图谱;（ii）使用系统发育指导的分层基因型图谱;（iii）使用系统发育指导的分级基因型图谱图;（iii）使用分级基因型图ii）评估基因型的频率分布，作为生态适应性的代理;iii）使用特定的基因型分类确定亲属关系和遗传多样性;和iv）地图谱系区分附属位点。为了增强可重复性和可移植性，使用R markdown文件来演示整个分析方法。示例数据集包含来自 2，365 个人畜共患食源性病原体 纽波特沙门氏菌 分离株的基因组数据。分层基因型（血清->BAPS1 ->ST -> cgMLST）的系统发育锚定图揭示了群体遗传结构，突出了序列类型（STs）作为区分基因型的基石。在三个最主要的谱系中，ST5和ST118比高克隆的ST45系统型更晚地共享一个共同的祖先。基于ST的差异进一步突出了辅助抗菌素耐药性（AMR）位点的分布。最后，使用系统发育锚定的可视化来结合分层基因型和AMR内容，以揭示亲缘结构和谱系特异性基因组特征。综合起来，这种分析方法为使用泛基因组信息进行启发式细菌群体基因组分析提供了一些指导。

引言

公共卫生实验室和监管机构越来越多地使用细菌全基因组测序（WGS）作为常规监测和流行病学调查的基础，大大加强了病原体疫情调查¹^，²^，³^，⁴。因此，大量去识别的WGS数据现已公开，可用于以前所未有的规模研究致病物种的种群生物学的各个方面，包括基于以下方面的研究:多个储层，地理区域和环境类型的种群结构，基因型频率和基因/等位基因频率⁵.最常用的WGS引导的流行病学调查基于仅使用共享核心基因组内容的分析，其中共享（保守）内容仅用于基因型分类（例如，变异呼叫），这些变异成为流行病学分析和追踪的基础¹^，²^，⁶^，⁷.通常，基于细菌核心基因组的基因分型是使用7到几千个位点⁸^，⁹^，¹⁰的多位点序列分型（MLST）方法进行的。这些基于MLST的策略包括将预组装或组装的基因组序列映射到高度策划的数据库，从而将等位基因信息组合成可重复的基因型单元，用于流行病学和生态学分析¹¹^，¹²。例如，这种基于MLST的分类可以在两个分辨率水平上生成基因型信息:较低级别的序列类型（ST）或ST谱系（7个位点），以及更高级别的核心基因组MLST（cgMLST）变体（〜300-3，000个位点）¹⁰。

基于MLST的基因型分类在实验室之间具有可计算性和高度可重复性，使其被广泛接受为细菌物种水平¹³^，¹⁴以下的精确亚型方法。然而，细菌种群的结构具有物种特异性的不同程度的克隆性（即基因型同质性），基因型之间等级亲缘关系的复杂模式¹⁵^，¹⁶^，¹⁷，以及辅助基因组内容分布的广泛差异¹⁸^，¹⁹.因此，一种更全面的方法超越了离散分类，进入了MLST基因型，并结合了不同分辨率下基因型的层次结构关系，以及将辅助基因组内容映射到基因型分类上，这有助于基于人群的推断¹⁸^，²⁰^，²¹.此外，分析还可以集中在偶异相关基因型^21，22中辅助基因组位点的共同遗传^模式上。总体而言，组合方法能够对种群结构与地理空间或环境梯度中特定基因组组成（例如，位点）的分布之间的关系进行不可知论的询问。这种方法可以产生关于特定种群生态特征的基本和实用信息，这些信息反过来又可以解释它们在水库（如食用动物或人类）中的向性和分散模式。

这种基于系统的分层人口导向方法需要大量的WGS数据，以获得足够的统计能力来预测可区分的基因组特征。因此，该方法需要一个能够同时处理数千个细菌基因组的计算平台。最近，ProkEvo被开发出来，是一个免费获得，自动化，便携式和可扩展的生物信息学平台，允许基于分层的综合细菌种群分析，包括泛基因组图谱²⁰。ProkEvo允许研究中大规模细菌数据集，同时提供一个框架来生成可测试且可推断的流行病学和生态假设以及可由用户自定义的表型预测。这项工作补充了该管道，提供了有关如何利用ProkEvo派生的输出文件作为分析和解释分层种群分类和辅助基因组挖掘的输入的指南。这里介绍的案例研究利用了 肠道沙门氏菌 谱系I人畜共患血清 S的种群。以纽波特为例，特别旨在为微生物学家，生态学家和流行病学家提供有关如何:i）使用自动化系统发育依赖性方法来绘制分层基因型的实用指南;ii）评估基因型的频率分布，作为评估生态适应性的代理;iii）使用独立的统计方法确定谱系特异性的克隆程度;iv）绘制谱系分化AMR位点，作为如何在种群结构背景下挖掘附属基因组内容的示例。更广泛地说，这种分析方法提供了一个可推广的框架，可以在一定规模上进行基于人群的基因组分析，无论目标物种如何，都可以用来推断进化和生态模式。

Access restricted. Please log in or start a trial to view this content.

研究方案

1. 准备输入文件

注意:该协议可在此处获得 - https://github.com/jcgneto/jove_bacterial_population_genomics/tree/main/code。该协议假设研究人员专门使用ProkEvo（或类似的管道）来获取此Figshare存储库中可用的必要输出（https://figshare.com/account/projects/116625/articles/15097503 - 需要登录凭据 - 用户必须创建一个免费帐户才能访问文件！值得注意的是，ProkEvo会自动从NCBI-SRA存储库下载基因组序列，并且只需要一个包含基因组鉴定列表的.txt文件作为输入²⁰，以及用于 S上这项工作的文件。这里提供了美国纽波特分离株（https://figshare.com/account/projects/116625/articles/15097503?file=29025729）。有关如何安装和使用该细菌基因组学平台的详细信息，请参阅此处（https://github.com/npavlovikj/ProkEvo/wiki/2.-Quick-start）²⁰

如前所述，使用FastTree²³ 生成核心基因组系统发育²⁰，这不是生物信息学平台²⁰的一部分。FastTree需要Roary²⁴ 核心基因组比对作为输入文件。系统发育文件命名为 newport_phylogeny.tree （https://figshare.com/account/projects/116625/articles/15097503?file=29025690）。
生成 SISTR²⁵ 输出，其中包含有关 沙门氏菌 和 cgMLST 变体调用数据的血清分类的信息（sistr_output.csv - https://figshare.com/account/projects/116625/articles/15097503?file=29025699）。
通过fastbaps²⁶^，²⁷ 生成BAPS文件，其中包含BAPS水平1-6将基因组分类为亚组或单倍型（fastbaps_partition_baps_prior_l6.csv - https://figshare.com/account/projects/116625/articles/15097503?file=29025684）。
使用MLST程序生成基于MLST的基因组分类为ST（https://github.com/tseemann/mlst）²⁸（salmonellast_output.csv - https://figshare.com/account/projects/116625/articles/15097503?file=29025696）。
生成 ABRicate （https://github.com/tseemann/abricate）²⁹ 输出作为.csv文件，其中包含每个基因组映射的 AMR 位点（sabricate_resfinder_output.csv - https://figshare.com/account/projects/116625/articles/15097503?file=29025693）。
注意:用户可以关闭 ProkEvo 生物信息学管道的特定部分（有关详细信息，请查看此处 - https://github.com/npavlovikj/ProkEvo/wiki/4.2.-Remove-existing-bioinformatics-tool-from-ProkEvo）。此处介绍的分析方法为如何在运行生物信息学管道后进行基于人群的分析提供了指南。

2. 下载并安装统计软件和集成开发环境（IDE）应用程序

下载适用于 Linux、Mac 或 PC³⁰ 的最新免费 R 软件版本。按照默认安装步骤操作。
在此处下载最新的免费 RStudio 桌面 IDE 版本³¹。按照默认步骤进行安装。
注意:后续步骤包含在可用的脚本中，包括代码利用率的详细信息，应按顺序运行以生成本工作中提供的输出和数字（https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/data_analysis_R_code.Rmd）。用户可以决定使用另一种编程语言来执行此分析/统计分析，例如Python。在这种情况下，请使用脚本中的步骤作为框架来执行分析。

3. 安装和激活数据科学库

作为分析的第一步，一次安装所有数据科学库。避免每次需要重新运行脚本时都安装库。使用函数 install.packages（）进行库安装。或者，用户可以单击 IDE 内部的" 包 "选项卡并自动安装包。用于安装所有需要的库的代码如下所示:
# 安装 Tidyverse
install.packages（"tidyverse"）
# 安装撇脂器
install.packages（"skimr"）
# 安装素食主义者
install.packages（"素食主义者"）
# 安装猫咪
install.packages（"forcats"）
# 安装 naniar
install.packages（"naniar"）
# 安装 ggpubr
install.packages（"ggpubr"）
# 安装 ggrepel
install.packages（"ggrepel"）
# 安装重塑2
install.packages（"reshape2"）
# 安装 RColorBrewer
install.packages（"RColorBrewer"）
# 安装 ggtree
if （！requireNamespace（"BiocManager"， quietly = TRUE））
install.packages（"BiocManager"）
BiocManager::install（"ggtree"）
# ggtree 的安装会提示一个关于安装的问题 - 答案是"a"来安装/更新所有依赖项
安装后立即使用脚本开头的 library（）函数激活所有库或包。下面是有关如何激活所有必需包的演示:
# 激活库和包
库（整齐）
图书馆（略读）
图书馆（素食）
图书馆（猫）
图书馆（纳尼亚尔）
图书馆（ggtree）
图书馆（ggpubr）
图书馆（ggrepel）
图书馆（重塑2）
图书馆（RColorBrewer）
通过在代码卡盘中使用 {r， include = FALSE} 禁止输出用于库和包安装和激活的代码，如下所示:
''' {r， include = FALSE}
# 安装 Tidyverse
install.packages（"tidyverse"）
```
注意:此步骤是可选的，但可避免在最终的 html、doc 或 pdf 报告中显示不必要的代码块。
有关所有库的特定功能的简要说明以及一些用于收集更多信息的有用链接，请参阅步骤 3.4.1-3.4.11。
1. Tidyverse - 使用此包集合用于数据科学，包括数据输入、可视化、解析和聚合以及统计建模。通常，ggplot2（数据可视化）和dplyr（数据整理和建模）是存在于该库³²中的实用包。
2. skimr - 使用此包生成数据框的汇总统计数据，包括缺失值³³ 的标识。
3. 素食主义者 - 将此软件包用于群落生态学统计分析，例如计算基于多样性的统计数据（例如，α和β-多样性）³⁴。
4. forcats - 使用此包来处理分类变量，如重新排序分类。此包是 Tidyverse 库³² 的一部分。
5. naniar - 使用此包，通过使用 viss_miss（）函数³⁵ 来可视化数据框中变量之间缺失值的分布。
6. ggtree - 使用此包进行系统发育树的可视化³⁶.
7. ggpubr - 使用此软件包可以提高基于 ggplot2 的可视化的质量³⁷.
8. ggrepel - 使用此包在图形³⁸ 内部进行文本标记。
9. reshape2 - 使用此包中的 melt（）函数将数据帧从宽格式转换为长格式³⁹。
10. RColorBrewer - 使用此包来管理基于 ggplot2 的可视化中的颜色⁴⁰.
11. 使用以下基本函数进行探索性数据分析:head（）检查数据框中的第一个观测值，tail（）检查数据框中的最后一个观测值，is.na（）来计算数据框中缺失值的行数，dim（）检查数据集中的行数和列数，table（）检查变量中的观测值，和 sum（）来计算观测值或实例的总数。

4. 数据录入与分析

注意:有关此分析的每个步骤的详细信息，请参阅可用脚本（https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/data_analysis_R_code.Rmd）。但是，以下是一些需要考虑的要点:

使用read_csv（）函数输入所有基因组数据，包括所有基因型分类（血清，BAPS，ST和cgMLST）。
在多数据集聚合之前，重命名、创建新变量并从每个数据集中选择感兴趣的列。
不要从任何独立数据集中删除缺失值。等到聚合所有数据集后，再修改或排除缺失值。如果为每个数据集创建了新变量，则默认情况下，缺失值将分类到新生成的分类之一中。
检查是否有错误字符，如连字符或询问标记，并将其替换为 NA（不适用）。对缺失值执行相同的操作。
根据基因型（血清->BAPS1 ->ST -> cgMLST）的分层顺序以及基于个体基因组鉴定的分组来汇总数据。
使用多个策略检查缺失值，并显式处理此类不一致。仅当分类不可靠时才从数据中删除基因组或分离。否则，请考虑正在进行的分析，并根据具体情况删除NA。
注意:强烈建议建立一个策略来先验地处理这些值。避免删除所有基因组或任何变量中缺失值的分离株。例如，基因组可能具有ST分类，而没有cgMLST变异数。在这种情况下，基因组仍然可以用于基于ST的分析。
聚合所有数据集后，将其分配给可在后续分析中的多个位置使用的数据框名称或对象，以避免为论文中的每个图形生成相同的元数据文件。

5. 进行分析并生成可视化

注意:生成所有分析和可视化所需的每个步骤的详细说明可以在本文的 markdown 文件中找到（https://github.com/jcgneto/jove_bacterial_population_genomics/tree/main/code）。每个图的代码以块分隔，整个脚本应按顺序运行。此外，每个主图和补充图的代码作为单独的文件提供（请参见 补充文件 1 和 补充文件 2）。以下是在生成每个主要和补充数字时要考虑的一些要点（带有代码片段）。

使用ggtree绘制系统发育树以及基因型信息（图1）。
1. 通过分别更改 xlim（）和 gheatmap（width = ）函数中的数值来优化 ggtree 图形大小，包括环的直径和宽度（请参阅下面的示例代码）。
  tree_plot <- ggtree（tree， layout = "circular"） + xlim（-250， NA）
  figure_1 <- gheatmap（tree_plot， d4， offset=.0， width=20， colnames = FALSE）
  注意:有关可用于系统发育绘图的程序的更详细比较，请查看这项工作²⁰。这项工作强调了确定改进基于ggtree的可视化的策略的尝试，例如减小数据集大小，但与phandango 41相比，分支长度和树拓扑并不像phandango⁴¹那样具有明显的区别性。
2. 将所有元数据聚合到尽可能少的类别中，以便在使用系统发育树（https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/figure_1.Rmd）绘制多层数据时方便选择着色面板。根据兴趣和领域知识问题进行数据汇总。
使用条形图评估相对频率（图 2）。
1. 聚合意法半导体谱系和 cgMLST 变体的数据，以促进可视化。选择用于数据聚合的经验或统计阈值，同时考虑所提出的问题。
2. 有关可用于检查 ST 谱系的频率分布以确定截止值的示例代码，请参阅下文:
  st_dist <- d2 %>% group_by（ST） %>% # 按 ST 列分组
  count（） %>% # 计算观测值的数量
  arrange（desc（n）） # 按降序排列计数
3. 有关演示如何聚合次要（低频）ST 的示例代码，请参阅下文。如下图所示，未编号为 5、31、45、46、118、132 或 350 的 ST 被归为"其他 ST"。对 cgMLST 变体使用类似的代码（https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/figure_2.Rmd）。
  d2$st <- ifelse（d2$ST == 5， "ST5"， # 创建一个新的 ST 列，其中次要的 S T 被聚合为其他
  ifelse（d2$ST == 31， "ST31"，
  ifelse（d2$ST == 45， "ST45"，
  ifelse（d2$ST == 46， "ST46"，
  ifelse（d2$ST == 118， "ST118"，
  ifelse（d2$ST == 132， "ST132"， ifelse（d2$ST == 350， "ST350"， "Other ST"））））））
使用嵌套方法计算每个BAPS1子组中每个ST谱系的比例，以识别祖先相关的ST（属于同一BAPS1子组）（图3）。下面的代码说明了如何跨 BAPS1 子组（https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/figure_3.Rmd）计算基于 ST 的比例:
baps <- d2b %>% filter（serovar == "Newport"） %>% # filter Newport serovars
选择（baps_1，ST） %>% # 选择baps_1和 ST 列
mutate（ST = as.numeric（ST）） %>% # 将 ST 列更改为数字
drop_na（baps_1，ST） %>% # 丢弃 NA
group_by（baps_1，ST） %>% # 按baps_1和 ST 分组
summaryse（n = n（）） %>% # 计数观测值
mutate（prop = n/sum（n）*100） # 计算比例
使用基于Resfinder的基因注释结果绘制AMR位点在ST谱系中的分布（图4）。
注:再觅食者已广泛应用于生态学和流行病学研究⁴².蛋白质编码基因的注释可能会有所不同，具体取决于数据库的整理和更新频率。如果使用建议的生物信息学管道，研究人员可以比较不同数据库中基于AMR的位点分类²⁰。请务必检查哪些数据库正在不断更新。不要使用过时或管理不善的数据库，以避免误判。
1. 使用经验或统计阈值过滤掉最重要的AMR位点，以促进可视化。提供原始.csv文件，其中包含所有 ST 谱系中所有 AMR 位点的计算比例，如下所示（https://figshare.com/account/projects/116625/articles/15097503?file=29025687）。
2. 使用以下代码计算每个 ST 的 AMR 比率（https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/figure_4.Rmd）:
  # ST45的计算
  d2c <- data6 %>% filter（st == "ST45"） # 先过滤 ST45 数据
  # 对于ST45，计算AMR位点的比例，只保持比例大于10%
  d3c <- d2c %>% select（id， gene） %>% # select columns
  group_by（id， gene） %>% # 按 id 和基因分组
  summarize（count = n（）） %>% # 计数观测值
  mutate（count = replace（count， count == 2， 1）） %>% # replace count 等于 2 with 1，只考虑每个基因的一个拷贝（重复可能不可靠），但研究人员可以决定排除或保留它们。如果研究人员想要排除它们，请使用过滤器（count ！= 2）函数，否则保持原样
  filter（count <= 1） # filter counts 小于或等于 1
  d4c <- d3c %>% group_by（基因） %>% # 按基因分组
  summarize（value = n（）） %>% # 计数观测值
  mutate（total = table（data1$st）[6]） %>% # 获取 st mutate（prop = （value/total）*100） # 计算比例
  d5c <- d4c %>% mutate（st = "ST45"） # 创建一个 st 列并添加 ST 信息
3. 对所有 ST 进行计算后，使用以下代码将数据集合并为一个数据框:
  # 合并数据集
  d6 <- rbind（d5a， d5b， d5c， d5d， d5e， d5f， d5g， d5h） # 行绑定数据集
4. 要导出包含计算比例的.csv文件，请使用以下代码:
  # 导出包含 ST 和 AMR 位点信息的数据表
  abx_newport_st <- d6 write.csv（abx_newport_st，"abx_newport_st.csv"， row.names = FALSE）
5. 在绘制跨 ST 谱系的基于 AMR 的分布之前，请根据阈值筛选数据以方便可视化，如下所示:
  # 过滤比例高于或等于10%的AMR位点
  d7 <- d6 %>% 滤波（prop >= 10） # 根据经验或统计确定阈值
使用ggtree在单个图中绘制核心基因组系统发育以及分层基因型分类和AMR数据（图5）。
1. 使用上述参数优化ggtree内部的图形大小（请参阅步骤5.1.1）。
2. 通过聚合变量或使用二元分类（如基因存在或不存在）来优化可视化。添加到图中的要素越多，着色选择过程就越困难（https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/figure_5.Rmd）。
  注意:补充数字 - 整个代码的详细说明可以在这里找到（https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/data_analysis_R_code.Rmd）。
在没有数据聚合的情况下，在 ggplot2 中使用散点图来显示 ST 谱系或 cgMLST 变异的分布，同时突出显示最常见的基因型（补充图 1）（https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/supplementary_figure_s1.Rmd）。
通过cgMLST变异的比例进行嵌套分析以评估ST谱系的组成，以便瞥见基于ST的遗传多样性，同时识别最常见的变异及其遗传关系（即，属于同一ST的cgMLST变异比属于不同ST的cgMLST变异更晚共享祖先）（补充图2））（https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/supplementary_figure_s2.Rmd）。
使用群落生态学指标，即辛普森D多样性指数，来测量每个主要ST谱系⁴³ 的克隆度或基因型多样性（补充图3）。
1. 计算不同基因型分辨率水平下 ST 谱系的多样性指数，包括 BAPS 1 至 6 级和 cgMLST。以下是有关如何在基因型分辨率的BAPS级别1（BAPS1）进行此计算的代码示例:
  # BAPS 级别 1 （BAPS1）
  # 用NA去掉STs和BAPS1，按ST和BAPS1分组，然后计算辛普森指数
  baps1 <- 数据6 %>%
  select（st， BAPS1） %>% # select columns
  drop_na（st， BAPS1） %>% # 丢弃 NAs
  group_by（st， BAPS1） %>% # 按列分组
  summarise（n = n（）） %>% # 计数观测值
  mutate（simpson = diversity（n， "simpson"）） %>% # 计算多样性
  group_by（st） %>% # 按列分组
  summaryse（simpson = mean（simpson）） %>% # 计算指数的平均值
  melt（id.vars=c（"st"）， measure.vars="simpson"，
  variable.name="索引"， value.name="值"） %>% # 隐蔽成长格式
  mutate（strat = "BAPS1"） # 创建一个 strat 列
  注意:遗传多样性更强的群体（即，在不同基因分辨层的更多变异）在cgMLST水平上具有更高的指数，并产生从BAPS 2级到6级的基于指数的值的增加（https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/supplementary_figure_s3.Rmd）。
通过绘制BAPS亚组在所有分辨率水平下的相对频率（BAPS1-6）来检查ST谱系的基因型多样性程度（补充图4）。群体越多样化，BAPS亚组（单倍型）的分布就越稀疏，从BAPS1（较低的分辨率水平）到BAPS6（更高的分辨率水平）（https://github.com/jcgneto/jove_bacterial_population_genomics/blob/main/code/supplementary_figure_s4.Rmd）。

Access restricted. Please log in or start a trial to view this content.

结果

通过利用计算平台ProkEvo进行群体基因组学分析，细菌WGS数据挖掘的第一步包括在核心基因组系统发育的背景下检查分层种群结构（图1）。在 S的情况下。肠系谱系I，如 S所示。 Newport数据集，总体的分层结构如下:血清（最低分辨率水平），BAPS1亚组或单倍型，ST谱系和cgMLST变体（最高分辨率）²⁰。这种对分层种群结构的系统发育指导分?...

Access restricted. Please log in or start a trial to view this content.

讨论

利用基于系统的启发式和分层种群结构分析为识别细菌数据集中的新基因组特征提供了一个框架，这些特征有可能解释独特的生态和流行病学模式²⁰.此外，将辅助基因组数据映射到种群结构上可用于推断祖先获得的和/或最近衍生的性状，这些性状有助于ST谱系或cgMLST变体在储库6^，²⁰，²¹^，⁴⁵

Access restricted. Please log in or start a trial to view this content.

披露声明

作者宣布不存在相互竞争的利益。

致谢

这项工作得到了UNL-IANR农业研究司和国家抗菌素耐药性研究与教育研究所以及食品科学和技术部内布拉斯加州食品卫生中心提供的资金的支持。这项研究只能通过利用UNL的荷兰计算中心（HCC）来完成，该中心得到了内布拉斯加州研究计划的支持。我们还感谢通过HCC获得开放科学网格（OSG）提供的资源，该网格得到了美国国家科学基金会和美国能源部科学办公室的支持。这项工作使用了Pegasus Workflow Management Software，该软件由美国国家科学基金会（grant #1664162）资助。

Access restricted. Please log in or start a trial to view this content.

材料

Name	Company	Catalog Number	Comments
amr_data_filtered			https://figshare.com/account/projects/116625/articles/14829225?file=28758762
amr_data_raw			https://figshare.com/account/projects/116625/articles/14829225?file=28547994
baps_output			https://figshare.com/account/projects/116625/articles/14829225?file=28548003
Core-genome phylogeny			https://figshare.com/account/projects/116625/articles/14829225?file=28548006
genome_sra			https://figshare.com/account/projects/116625/articles/14829225?file=28639209
Linux, Mac, or PC			any high-performance platform
mlst_output			https://figshare.com/account/projects/116625/articles/14829225?file=28547997
sistr_output			https://figshare.com/account/projects/116625/articles/14829225?file=28548000
figshare credentials are required for login and have access to the files

参考文献

Grad, Y. H., et al. Genomic epidemiology of the Escherichia coli O104:H4 outbreaks in Europe, 2011. Proceedings of the National Academy of Sciences of the United States of America. 109 (8), 3065-3070 (2012).
Worby, C. J., Chang, H. -H., Hanage, W. P., Lipsitch, M. The distribution of pairwise genetic distances: a tool for investigating disease transmission. Genetics. 198 (4), 1395-1404 (2014).
Leekitcharoenphon, P., et al. Global genomic epidemiology of Salmonella enterica serovar Typhimurium DT104. Applied and Environmental Microbiology. 82 (8), 2516-2526 (2016).
Alba, P., et al. Molecular epidemiology of Salmonella Infantis in Europe: insights into the success of the bacterial host and its parasitic pESI-like megaplasmid. Microbial Genomics. 6 (5), (2020).
Zhou, Z., Alikhan, N. -F., Mohamed, K., Fan, Y. the Agama Study Group, Achtman, M. The EnteroBase user's guide, with case studies on Salmonella transmissions, Yersinia pestis phylogeny, and Escherichia core genomic diversity. Genome Research. 30 (1), 138-152 (2020).
Azarian, T., et al. Global emergence and population dynamics of divergent serotype 3 CC180 pneumococci. PLOS Pathogens. 14 (11), 1007438(2018).
Saltykova, A., et al. Comparison of SNP-based subtyping workflows for bacterial isolates using WGS data, applied to Salmonella enterica serotype Typhimurium and serotype 1,4,[5],12:i. PLOS ONE. 13 (2), 0192504(2018).
Achtman, M., et al. Multi-locus sequence typing as a replacement for serotyping in Salmonella enterica. PLoS Pathogens. 8 (6), 1002776(2012).
Maiden, M. C. J., et al. Multi-locus sequence typing: A portable approach to the identification of clones within populations of pathogenic microorganisms. Proceedings of the National Academy of Sciences of the United States of America. 95 (6), 3140-3145 (1998).
Alikhan, N. -F., Zhou, Z., Sergeant, M. J., Achtman, M. A genomic overview of the population structure of Salmonella. PLOS Genetics. 14 (4), 1007261(2018).
Gupta, A., Jordan, I. K., Rishishwar, L. stringMLST: a fast k-mer based tool for multi-locus sequence typing. Bioinformatics. 33 (1), 119-121 (2017).
Jolley, K. A., Maiden, M. C. BIGSdb: Scalable analysis of bacterial genome variation at the population level. BMC Bioinformatics. 11 (1), 595(2010).
Maiden, M. C. J., et al. MLST revisited: the gene-by-gene approach to bacterial genomics. Nature Reviews Microbiology. 11 (10), 728-736 (2013).
Maiden, M. C. J. Multilocus sequence typing of bacteria. Annual Review of Microbiology. 60 (1), 561-588 (2006).
Shapiro, B. J., Polz, M. F. Ordering microbial diversity into ecologically and genetically cohesive units. Trends in Microbiology. 22 (5), 235-247 (2014).
Cordero, O. X., Polz, M. F. Explaining microbial genomic diversity in light of evolutionary ecology. Nature Reviews Microbiology. 12 (4), 263-273 (2014).
Achtman, M., Wagner, M. Microbial diversity and the genetic nature of microbial species. Nature Reviews Microbiology. 6 (6), 431-440 (2008).
Abudahab, K., et al. PANINI: Pangenome neighbour identification for bacterial populations. Microbial Genomics. 5 (4), (2019).
Laing, C. R., Whiteside, M. D., Gannon, V. P. J. Pan-genome analyses of the species Salmonella enterica, and identification of genomic markers predictive for species, subspecies, and serovar. Frontiers in Microbiology. 8, 1345(2017).
Pavlovikj, N., Gomes-Neto, J. C., Deogun, J. S., Benson, A. K. ProkEvo: an automated, reproducible, and scalable framework for high-throughput bacterial population genomics analyses. PeerJ. 9, 11376(2021).
McNally, A., et al. Combined analysis of variation in core, accessory and regulatory genome regions provides a super-resolution view into the evolution of bacterial populations. PLOS Genetics. 12 (9), 1006280(2016).
Langridge, G. C., et al. Patterns of genome evolution that have accompanied host adaptation in Salmonella. Proceedings of the National Academy of Sciences of the United States of America. 112 (3), 863-868 (2015).
Price, M. N., Dehal, P. S., Arkin, A. P. FastTree 2 - Approximately maximum-likelihood trees for large alignments. PLoS ONE. 5 (3), 9490(2010).
Page, A. J., et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics. 31 (22), 3691-3693 (2015).
Yoshida, C. E., et al. The Salmonella In silico typing resource (SISTR): An open web-accessible tool for rapidly typing and subtyping draft Salmonella genome assemblies. PLOS ONE. 11 (1), 0147101(2016).
Cheng, L., Connor, T. R., Siren, J., Aanensen, D. M., Corander, J. Hierarchical and spatially explicit clustering of DNA sequences with BAPS software. Molecular Biology and Evolution. 30 (5), 1224-1228 (2013).
Tonkin-Hill, G., Lees, J. A., Bentley, S. D., Frost, S. D. W., Corander, J. Fast hierarchical Bayesian analysis of population structure. Nucleic Acids Research. 47 (11), 5539-5549 (2019).
Seemann, T. MLST. GitHub. , Available from: https://github.com/tseemann/mist (2020).
Seemann, T. ABRicate. GitHub. , Available from: https://github.com/tseemann/abricate (2020).
R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing. , Vienna, Austria. at. Available from: https://cran.r-project.org (2021).
Studio Team. RStudio: Integrated Development for R. Studio, PBC. , Boston, MA. Available from: http://www.rstudio.com (2020).
Wickham, H., et al. Welcome to the Tidyverse. Journal of Open Source Software. 4 (43), 1686(2019).
rOpenSci: The skimr package. GitHub. , Berkeley, CA. Available from: https://github.com/ropensci/skimr/ (2021).
Oksanen, J., et al. vegan: Community ecology package. R package version 2.5-5. , Available from: https://CRAN.R-project.org/package=vegan (2019).
Tierney, N. J., Cook, D. H. Expanding tidy data principles to facilitate missing data exploration, visualization and assessment of imputations. arXiv. , Available from: http://arxiv.org/abs/1809.02264 (2020).
Yu, G. Using ggtree to visualize data on tree-like structures. Current Protocols in Bioinformatics. 69 (1), (2020).
Kassambara, A. ggpubr: "ggplot2" Based Publication Ready Plots. R package version 0.4.0. , Available from: https://CRAN.R-project.org/package=ggpubr (2020).
Slowikowski, K. ggrepel: Automatically Position Non-Overlapping Text Labels with "ggplot2”. R package version 0.9.1. , Available from: https://CRAN.R-project.org/package=ggrepel (2021).
Wickham, H. Reshaping Data with the reshape Package. Journal of Statistical Software. 21 (12), (2007).
Neuwirth, E. RColorBrewer: ColorBrewer Palettes. R package version 1.1-2. , Available from: https://CRAN.R-project.org/package=RColorBrewer (2014).
Hadfield, J., Croucher, N. J., Goater, R. J., Abudahab, K., Aanensen, D. M., Harris, S. R. Phandango: an interactive viewer for bacterial population genomics. Bioinformatics. 34 (2), 292-293 (2018).
Perron, G. G., et al. Functional characterization of bacteria isolated from ancient arctic soil exposes diverse resistance mechanisms to modern antibiotics. PLOS ONE. 10 (3), 0069533(2015).
Mitchell, P. K., et al. Population genomics of pneumococcal carriage in Massachusetts children following introduction of PCV-13. Microbial Genomics. 5 (2), (2019).
Klemm, E. J., et al. Emergence of host-adapted Salmonella Enteritidis through rapid evolution in an immunocompromised host. Nature Microbiology. 1 (3), 15023(2016).
Břinda, K., et al. Rapid inference of antibiotic resistance and susceptibility by genomic neighbour typing. Nature Microbiology. 5 (3), 455-464 (2020).
MacFadden, D. R., et al. Using genetic distance from archived samples for the prediction of antibiotic resistance in Escherichia coli. Antimicrobial Agents and Chemotherapy. 64 (5), (2020).
Mageiros, L., et al. Genome evolution and the emergence of pathogenicity in avian Escherichia coli. Nature Communications. 12 (1), 765(2021).
Yahara, K., et al. Genome-wide association of functional traits linked with Campylobacter jejuni survival from farm to fork. Environmental Microbiology. 19 (1), 361-380 (2017).
Walter, J., Maldonado-Gómez, M. X., Martínez, I. To engraft or not to engraft: an ecological framework for gut microbiome modulation with live microbes. Current Opinion in Biotechnology. 49, 129-139 (2018).
Maldonado-Gómez, M. X., et al. Stable engraftment of Bifidobacterium longum AH1206 in the human gut depends on individualized features of the resident microbiome. Cell Host & Microbe. 20 (4), 515-526 (2016).
Zhao, S., et al. Adaptive evolution within gut microbiomes of healthy people. Cell Host & Microbe. 25 (5), 656-667 (2019).
Treangen, T. J., Ondov, B. D., Koren, S., Phillippy, A. M. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biology. 15 (11), 524(2014).
Letunic, I., Bork, P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Research. 49, 293-296 (2021).
Croucher, N. J., et al. Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Research. 43 (3), 15(2015).
Fenske, G. J., Thachil, A., McDonough, P. L., Glaser, A., Scaria, J. Geography shapes the population genomics of Salmonella enterica Dublin. Genome Biology and Evolution. 11 (8), 2220-2231 (2019).
Lees, J. A., et al. Fast and flexible bacterial genomic epidemiology with PopPUNK. Genome Research. 29 (2), 304-316 (2019).
Cohan, F. M. Towards a conceptual and operational union of bacterial systematics, ecology, and evolution. Philosophical Transactions of the Royal Society B: Biological Sciences. 361 (1475), 1985-1996 (2006).
Cohan, F. M., Koeppel, A. F. The origins of ecological diversity in prokaryotes. Current Biology. 18 (21), 1024-1034 (2008).
Cohan, F. M. Transmission in the origins of bacterial diversity, from ecotypes to phyla. Microbial Transmission. 5 (5), 311-343 (2019).
Davis, J. J., et al. The PATRIC bioinformatics resource center: expanding data and analysis capabilities. Nucleic Acids Research. 48, 606-612 (2019).
Feng, Y., Zou, S., Chen, H., Yu, Y., Ruan, Z. BacWGSTdb 2.0: a one-stop repository for bacterial whole-genome sequence typing and source tracking. Nucleic Acids Research. 49, 644-650 (2021).

Access restricted. Please log in or start a trial to view this content.

转载和许可

请求许可使用此 JoVE 文章的文本或图形

请求许可

探索更多文章

178

This article has been published

Video Coming Soon

Keep me updated: