基于机器学习模型的 3 个淋巴结分期系统对结直肠印戒细胞癌的预测性能比较

Jinyan Jia; Zixuan Yu; Maorun Zhang; Fang Hu; Gang Liu

doi:10.3791/67941

需要订阅 JoVE 才能查看此. 登录或开始免费试用。

本文内容

摘要
摘要
引言
研究方案
结果
讨论
披露声明
致谢
材料
参考文献
转载和许可

摘要

本研究使用机器学习模型和竞争风险分析评估结直肠印戒细胞癌患者的预后系统。与 pN 分期相比，它将阳性淋巴结的对数几率确定为更好的预测因子，展示了强大的预测性能，并通过强大的生存预测工具帮助临床决策。

摘要

淋巴结状态是患者的关键预后预测指标;然而，结直肠印戒细胞癌（SRCC）的预后受到的关注有限。本研究使用机器学习模型（随机森林、XGBoost 和神经网络）以及竞争风险模型调查了 SRCC 患者阳性淋巴结（LODDS）、淋巴结比值（LNR）和 pN 分期的对数预测能力。相关数据从监测、流行病学和最终结果（SEER）数据库中提取。对于机器学习模型，通过单变量和多变量 Cox 回归分析确定癌症特异性生存期（CSS）的预后因素，然后应用 XGBoost 、 RF 和 NN 三种机器学习方法来确定最佳淋巴结分期系统。在竞争风险模型中，采用单因素和多因素竞争风险分析来确定预后因素，并构建列线图来预测 SRCC 患者的预后。采用受试者工作特征曲线下面积（AUC-ROC）和校准曲线来评估模型的性能。本研究共纳入 2,409 例 SRCC 患者。为了验证该模型的有效性，包括另外 15,122 名结直肠癌患者队列，不包括 SRCC 病例，用于外部验证。机器学习模型和竞争风险列线图在预测生存结果方面都表现出强大的表现。与 pN 分期相比，LODDS 分期系统表现出卓越的预后能力。经评估，机器学习模型和竞争风险模型取得了出色的预测性能，其特点是具有良好的区分、校准和可解释性。我们的研究结果可能有助于为患者的临床决策提供信息。

引言

结直肠癌（CRC）是全球第三大最常见的恶性肿瘤 ^1,2,3。印戒细胞癌（SRCC）是 CRC 的一种罕见亚型，约占病例的 1%，其特征是丰富的细胞内粘蛋白取代细胞核 ^1,2,4。SRCC 通常与年轻患者相关，女性患病率较高，诊断时肿瘤分期已进入晚期。与结直肠腺癌相比，SRCC 的分化较差，远处转移的风险更高，5 年生存率仅为 12%-20%^5,6。为 SRCC 开发准确有效的预后模型对于优化治疗策略和改善临床结果至关重要。

本研究旨在使用先进的统计方法为 SRCC 患者构建一个稳健的预后模型，包括机器学习（ML）和竞争风险模型。这些方法可以适应临床数据中的复杂关系，提供个性化的风险评估，并在预测准确性方面超越传统方法。机器学习模型（如 Random Forest、XGBoost 和 Neural Networks）在处理高维数据和识别复杂模式方面表现出色。研究表明，AI 模型可有效预测结直肠癌的生存结果，凸显了 ML 在临床应用中的潜力 ^7,8。作为 ML 的补充，竞争风险模型解决了多种事件类型，例如癌症特异性死亡率与其他死亡原因，以改进生存分析。与 Kaplan-Meier 估计器等传统方法不同，竞争风险模型在存在竞争风险的情况下准确估计事件的边际概率，从而提供更精确的生存评估⁸。集成 ML 和竞争风险分析可增强预测性能，为 SRCC ^9,10,11 中的个性化预后工具提供强大的框架。

淋巴结转移显着影响 CRC 患者的预后和复发。虽然 TNM 分类中的 N 期评估至关重要，但淋巴结检查不充分（48%-63% 的病例报告）可能导致疾病被低估。为了解决这个问题，已经引入了替代方法，如淋巴结比率（LNR）和阳性淋巴结的对数几率（LODDS）。LNR 是阳性淋巴结（PLN）与总淋巴结（TLN）的比率，受 TLN 计数的影响较小，是 CRC 的预后因素。LODDS 是 PLN 与负淋巴结（NLN）的对数比，在胃 SRCC 和结直肠癌中均显示出卓越的预测能力^10,11。机器学习已越来越多地应用于肿瘤学，模型改进了各种癌症（包括乳腺癌、前列腺癌和肺癌）的风险分层和预后预测 12,13,14。然而，它在结直肠 SRCC 中的应用仍然有限。

本研究旨在通过将 LODDS 与 ML 和竞争风险模型集成来弥合这一差距，以创建一个全面的预后工具。通过评估 LODDS 的预后价值并利用先进的预测技术，本研究旨在加强临床决策并改善 SRCC 患者的预后。

Access restricted. Please log in or start a trial to view this content.

研究方案

本研究不涉及伦理批准和参与同意。本研究中使用的数据是从数据库中获得的。我们纳入了 2004年至 2015年诊断为结直肠印戒细胞癌的患者，以及其他类型的结直肠癌患者。排除标准包括生存时间少于 1 个月的患者、临床病理信息不完整的患者以及死因不明或不明的病例。

1. 数据采集

下载 SEER。从 SEER 数据库网站（http://seer.cancer.gov/about/overview.html）获取 Statistics 8.4.3 软件。登录软件后，单击“案例列表会话”（Case List Session） >数据 并选择 Incidence SEER Research Plus Data， 17 Registries， Nov 2021 Sub （2000-2019）数据库。
点击 >编辑选择 “并选择 {种族、性别、诊断年份} = '2004'， '2005'， '2006'， '2007'， '2008'， '2009'， '2010'， '2011'， '2012'， '2013'， '2014'， '2015' AND {地点和形态学.网站重新编码 ICD-O-3/WHO 2008} = '8490/3'。
然后单击表格，在可用变量界面中，选择年龄重新编码单年龄和 100+、性别、婚姻、站点重新编码 ICD-O-3/WHO 2008、CS 肿瘤大小、区域nodes_examined（1988+）、区域nodes_positive（1988+）、衍生的 AJCC 阶段组，第 6 版（2004-2015），衍生的 AJCC T，第 6 版（2004-2015），衍生的 AJCC N，第 6 版（2004-2015），衍生的 AJCC M，第 6 版（2004-2015），CEA，放疗重新编码，化疗重新编码（是、否/unk）、SEER 原因特异性死亡分类、生命状态重新编码（使用研究截止值）、生存月数、诊断年份。
最后，单击输出，命名数据，然后单击执行以输出并保存数据。详细的包含过程如图 1 所示。
下载结直肠癌患者（不包括 SRCC 病例）的数据，以进行后续外部验证。点击>编辑选择“并选择 {种族、性别、诊断年份} = '2004'， '2005'， '2006'， '2007'， '2008'， '2009'， '2010'， '2011'， '2012'， '2013'， '2014'， '2015' AND {主要地点 - 标记} = 'C18-C20'。重复步骤 1.3 和 1.4 以获取临床病理信息，并使用 {Site and Morphology.网站从下载的文件中重新编码 ICD-O-3/WHO 2008} = '8490/3'。
为了进行比较，请处理多个变量。使用淋巴结比率（LNR）和阳性淋巴结几率（LODDS）的对数对淋巴结状态进行分类。
1. 将 LNR 定义为阳性淋巴结（PLN）与总淋巴结（TLN）的比率。使用以下公式计算 LODDS 值：
  loge（PLN 数 + 0.5）/（阴性淋巴结数（NLN） + 0.5）
  其中添加 0.5 以防止无限结果。使用基于最小 P 值法的 X-tile 软件（版本 3.6.1）确定 LNR、LODDS 和肿瘤大小的临界值。
打开 X-tile 软件，单击 文件>打开，然后选择数据文件以将其导入软件。加载数据后，映射变量：Censor 对应于生存状态，Survival time 对应于生存时间，marker1 是要分析的变量，确保数据正确匹配。
然后，单击 Do > Kaplan-Meier > Marker1 进行 Kaplan-Meier 生存分析并生成生存曲线。根据 Kaplan-Meier 生存曲线的分离、统计显着性（例如，p 值）和临床相关性，确定最佳临界值，最后记录或导出分析结果。
1. 将 LNR 分为三组：LNR 1 （≤0.16）、LNR 2 （0.16 - 0.78）和 LNR 3 （≥ 0.78）。根据 LODDS 将患者分为三组：LODDS 1 （≤ -1.44）、LODDS 2 （-1.44 - 0.86）和 LODDS 3 （≥ 0.86）。
2. 将肿瘤大小分为三类：≤ 3.5 cm、3.5 - 5.5 cm 和 ≥ 5.5 cm。将 age 从连续变量转换为分类变量。将患者初始诊断时的年龄分为 ≥60 岁和 <60 岁。根据印戒细胞癌（SRCC）肿瘤的分布将肿瘤位置分类为右结肠、左结肠和直肠。右结肠包括盲肠、升结肠、肝曲和横结肠，而左结肠包括脾曲、降结肠、乙状结肠和直肠乙状结肠交界处。
在这项研究中，将总共 2409 名符合条件的 SRCC 患者数据以 7：3 的比例随机分配到训练队列（N = 1686）和验证队列（N = 723）。使用以下代码进行随机拆分，并从 SEER 数据库中获取 data.csv。随机拆分后生成的文件将用于进一步分析。
library（插入符号）
数据 <- read.csv（「data.csv”）
set.seed（123）
train_indices <- createDataPartition（data$variable， p = 0.7， list = FALSE）
train_data <- data[train_indices， ]
test_data <- data[-train_indices， ]
write.csv（train_data， “traindata.csv”， row.names = FALSE）
write.csv（test_data， “testdata.csv”， row.names = FALSE）

2. ML 模型开发和验证

下载 RStudio （2024.04.2+764）和 R 软件（4.4.1）。打开 RStudio 以运行 R 软件。单击 New File（新建文件 ）并选择 R Script（R 脚本 ）以创建新的 R 编程接口。在代码编辑器中输入相关代码，然后单击 Run 以执行代码。
使用以下代码通过 Cox 回归分析筛选 ML 模型中包含的变量。此外，探索 LODDS 、 LNR 和 pN 分期对 SRCC 患者癌症特异性生存期（CSS）的影响。traindata.csv是从 SEER 数据库获取的数据。
库（“Survival”）
库（“survminer”）
库（“RMS”）
库（“dplyr”）
数据 <- read.csv（“traindata.csv”）
data$time=as.numeric（data$time）
data$status=as.numeric（data$status）
变量 <- c（“性别”、“年龄”、“种族”、“婚姻”、“阶段”、“T”、“N”、“M”、“Tumor_size”、“LNR”、“LODS”、“CEA”、“放疗”、“化疗”、“部位”）
数据 < - 数据 %>%
mutate（across（all_of（变量）， as.factor））
cox=coxph（Surv（时间，状态） ~ data$T， data = data）
Cox$系数
pval=方差分析（cox）$Pr[2]
clean_data=数据[，c（1：12， 14：18）]
get_coxVariable=function（your_data，index）{cox_list=c（） k=1
for （i in 1：index） {mod=coxph（Surv（time， status） ~ your_data[，i]，data=your_data） pval=anova（mod）$Pr[2] print（pval） print（colnames（your_data）[i]） if （pval<0.05） {cox_list[k]=colnames（your_data）[i] k=k+1}}return（cox_list）}
variable_select=get_coxVariable（clean_data，15）
for（i in 1：15）{print（variable_select[i]）}
for （var in variable_select） {公式 <- as.formula（paste（“Surv（时间，状态） ~”， var））cox_model <- coxph（公式，数据 = 数据） print（summary（cox_model））
GGFOREST（考克斯）
变量 <- c（“性别”、“年龄”、“种族”、“婚姻”、“阶段”、“T”、“N”、“M”、“Tumor_size”、“LNR”、“LODDS”、“化疗”）
数据 < - 数据 %>%
mutate（across（all_of（变量）， as.factor））
cox=coxph（Surv（时间、状态） ~ 性别+年龄+种族+婚姻+T+N+M+Tumor_size+LNR+
LODDS+化疗，数据 = 数据）
ggforest（cox，data = 数据）
ggplot_forest <- ggforest（cox， data = data）
使用以下代码比较三个 LN 系统（LODDS、LNR 和 pN 分期）在训练、验证和外部验证队列中的预后预测能力。
库（RMS）
图书馆（生存）
库（Survminer）
库（riskRegression）
库（GT）
train_data <- read.csv（“train_data123.csv”）
validation_data <- read.csv（“test_data123.csv”）
dd <- datadist（train_data）
options（datadist = “dd”）
model_LNR <- cph（Surv（时间，状态） ~ LNR，数据 = train_data，x = TRUE，y = TRUE）
model_LODDS <- cph（Surv（时间，状态） ~ LODDS， data = train_data， x = TRUE， y = TRUE）
model_pN <- cph（Surv（时间，状态） ~ N，数据 = train_data， x = TRUE， y = TRUE）
calculate_performance <- function（model， data） {pred <- predict（model， newdata = data） c_index_result <- concordance（Surv（data$time， data$status） ~ pred） c_index <- c_index_result$concordance aic <- AIC（model） bic <- BIC（model） return（c（C_index = round（c_index， 3）， AIC = round（aic， 2）， BIC = round（bic， 2）））}}
calculate_performance <- function（model， data） {pred <- predict（model， newdata = data， type = “lp”） concordance_result <- concordancefit（Surv（data$time， data$status）， x = pred） c_index <- concordance_result$concordance ci_lower <- c_index - 1.96 * sqrt（concordance_result$var） ci_upper <- c_index + 1.96 * sqrt（concordance_result$var） aic <- AIC（model） bic <- BIC（model） return（c（C_Index = round（c_index， 3）、CI_Lower = 圆（ci_lower、3）、CI_Upper = 圆（ci_upper、3）、AIC = 圆（aic、2）、BIC = 圆（bic、2）））}}
train_LNR <- calculate_performance（model_LNR， train_data）
train_LODDS <- calculate_performance（model_LODDS， train_data）
train_pN <- calculate_performance（model_pN， train_data）
model_LNR_val <- cph（Surv（时间，状态） ~ LNR，数据 = validation_data， x = TRUE， y = TRUE）
model_LODDS_val <- cph（Surv（时间，状态） ~ LODDS， data = validation_data， x = TRUE， y = TRUE）
model_pN_val <- cph（Surv（时间，状态） ~ N，数据 = validation_data， x = TRUE， y = TRUE）
val_LNR <- calculate_performance（model_LNR_val， validation_data）
val_LODDS <- calculate_performance（model_LODDS_val， validation_data）
val_pN <- calculate_performance（model_pN_val， validation_data）
结果 <- data.frame（变量 = c（“LNR”， “LODDS”， “pN”）， Training_C_Index = c（paste（train_LNR[“C_Index”]， “（”， train_LNR[“CI_Lower”]， “， ”， train_LNR[“CI_Upper”]， “）”， sep = “”）， paste（train_LODDS[“C_Index”]， “（”， train_LODDS[“CI_Lower”]， “， ”， train_LODDS[“CI_Upper”]， “）”， sep = “”）， paste（train_pN[“C_Index”]， “（”， train_pN[“CI_Lower”]， “， ”， train_pN[“CI_Upper”]， “）”， sep = “”））， Training_AIC = c（train_LNR[“AIC”]， train_LODDS[“AIC”]， train_pN[“AIC”]）， Training_BIC = c（train_LNR[“BIC”]， train_LODDS[“BIC”]， train_pN[“BIC”]）， Validation_C_Index = c（粘贴（val_LNR[“C_Index”]， “（”， val_LNR[“CI_Lower”]， “， ”， val_LNR[“CI_Upper”]， “）”， sep = “”）， paste（val_LODDS[“C_Index”]， “（”， val_LODDS[“CI_Lower”]， “， val_LODDS[”CI_Upper“]， ”）“， sep = ”“）， paste（val_pN[”C_Index“]， ”（“， val_pN[”CI_Lower“]， “， ”， val_pN[“CI_Upper”]， “）”， sep = “”））， Validation_AIC = c（val_LNR[“AIC”]， val_LODDS[“AIC”]， val_pN[“AIC”]）， Validation_BIC = c（val_LNR[“BIC”]， val_LODDS[“BIC”]， val_pN[“BIC”]））
results_table <- gt（结果） %>%
tab_header（title = “三个淋巴结分期系统的预测性能”） %>%
cols_label（变量 = “变量”，Training_C_Index = “C-index （95% CI）（训练）”， Training_AIC = “AIC （训练）”， Training_BIC = “BIC（训练）”， Validation_C_Index = “C-index （95% CI）（验证）”， Validation_AIC = “AIC （验证）”， Validation_BIC = “BIC （验证）”）
write.csv（results， “prediction_performance.csv”， row.names = FALSE）
使用以下代码构建 XGBoost 模型并生成变量相对重要性的条形图，从而比较三个 LN 系统的重要性。同样，生成 ROC 曲线和校准曲线。数据是从 SEER 数据库获取的。
库（XGBoost）
library（插入符号）
文库（pROC）
train_data <- read.csv（“train_data.csv”）
test_data <- read.csv（“test_data.csv”）
train_matrix <- XGB。DMatrix（data = as.matrix（train_data[， c（'年龄'， 'T'， 'N'， 'M'， 'LODDS'， '化疗'）]），标签 = train_data$status）
test_matrix <- XGB 的。DMatrix（data = as.matrix（test_data[， c（'年龄'， 'T'， 'N'， 'M'， 'LODDS'， '化疗'）]）， label = test_data$status）
参数 <- list（booster = “gbtree”， objective = “binary：logistic”， eval_metric = “auc”， eta = 0.1， max_depth = 6，子样本 = 0.8， colsample_bytree = 0.8）
xgb_model <- xgb.train（params = params， data = train_matrix， nrounds = 100， watchlist= list（train = train_matrix）， verbose = 1）
pred_probs <- 预测（xgb_model， newdata = test_matrix）
pred_labels <- ifelse（pred_probs > 0.5， 1， 0）
conf_matrix <- confusionMatrix（as.factor（pred_labels）， as.factor（test_data$status））
roc_curve <- roc（test_data$status， pred_probs）
auc_value <- AUC（roc_curve）
ci_auc <- ci.auc（roc_curve）
sensitivity <- conf_matrix$byClass[“灵敏度”]
特异性 <- conf_matrix$byClass[“特异性”]
accuracy <- conf_matrix$overall[“准确性”]
ppv <- conf_matrix$byClass[“Pos Pred Value”]
npv <- conf_matrix$byClass[“负目标值”]
result_table <- data.frame（Model = “XGBoost”， AUC = sprintf（“%.3f （%.3f-%.3f）”， auc_value， ci_auc[1]， ci_auc[3]），灵敏度 = sprintf（“%.3f”，灵敏度），特异性 = sprintf（“%.3f”，特异性），准确性 = sprintf（“%.3f”，准确性）， PPV = sprintf（“%.3f”， ppv）， NPV = sprintf（“%.3f”， npv））
write.csv（result_table， “xgboost_model_performance.csv”， row.names = FALSE）
roc_df <- data.frame（FPR = 1 - roc_curve$特异性，TPR = roc_curve$敏感性）
roc_plot <- ggplot（roc_df， aes（x = FPR， y = TPR）） +geom_line（color = “steelblue”， size = 1.2） + geom_abline（截距 = 0，斜率 = 1，线型 = “虚线”， color = “gray”） + annotate（“text”， x = 0.9， y = 0.2， label = paste（“AUC =”， round（auc_value， 3））， size = 5， color = “black”） + labs（title = “XGBoost 模型的 ROC 曲线”， x = “假阳性率”， y = “真阳性率”） + theme_minimal（） + theme（panel.border = element_rect（color = “black”， fill = NA，大小 = 1））
calibration_data <- data.frame（状态 = as.factor（test_data$status）， pred_probs = pred_probs）
calib_model <- 校准（状态 ~ pred_probs，数据 = calibration_data，类 = “1”，切割数 = 5）
ggplot（calib_model$data， aes（x = 中点， y = 百分比）） + geom_line（color = “steelblue”， size = 1） + geom_point（color = “red”， size = 2） + geom_abline（截距 = 0，斜率 = 1，线型 = “虚线”， color = “黑色”） +labs（title = “XGBoost 模型的校准曲线”， x = “预测概率”， y = “观测比例”） + theme_minimal（） + theme（panel.border = element_rect（color = “black”， fill = NA， size = 0.5）））
使用以下代码构建 RF 模型并生成变量相对重要性的条形图，从而比较三个 LN 系统的重要性。同样，生成 ROC 曲线和校准曲线。数据从 SEER database.library（randomForest）获取
库（DPlyr）
库（ggplot2）
文库（pROC）
library（插入符号）
库（RMS）
trainset <- read.csv（“train_data.csv”）
已测试 <- read.csv（“test_data.csv”）
trainset$status=factor（trainset$status）
变量 1 <- c（“年龄”， “T”， “N”， “M”， “LODDS”， “化疗”）
trainset <- trainset %>%
mutate（across（all_of（variables1）， as.numeric））
测试$状态=因子（测试$状态）
已测试 <- 已测试 %>%
mutate（across（all_of（variables1）， as.numeric））
RF=randomForest（trainset$status ~ 年龄 + T + N + M + LODDS + 化疗，data=trainset，ntree=100，importance=TRUE，proximity=TRUE）
imp=重要性（RF）
varImpPlot（RF）
impvar=rownames（imp）[order（imp[，4]，递减 = TRUE）]
importance_df <- as.data.frame（imp）
importance_df$Variables <- 行名（importance_df）
importance_plot <- ggplot（importance_df， aes（x = reorder（变量， MeanDecreaseAccuracy）， y = MeanDecreaseAccuracy）） +geom_bar（stat = “identity”， fill = “steelblue”） +coord_flip（） + labs（title = “变量重要性”， x = “变量”， y = “平均降低准确性”） + theme_minimal（）
pred_probs <- predict（RF， testset， type = “prob”）[，2]
roc_obj <- roc（testset$status， pred_probs）
auc_value <- AUC（roc_obj）
roc_plot <- ggplot（） +geom_line（aes（x = 1 - roc_obj$特异性， y = roc_obj$sensitivities）， color = “steelblue”， size = 1.2） +geom_abline（截距 = 0，斜率 = 1，线型 = “虚线”， color = “gray”） + annotate（“text”， x = 0.8， y = 0.2， label = paste（“AUC =”， round（auc_value， 3））， color = “black”， size = 5， hjust = 0） + labs（title = “随机森林模型的 ROC 曲线”， x = “假阳性率”， y = “真阳性率”） +theme_minimal（） + theme（panel.border = element_rect（color = “black”， fill = NA， size = 1））
calibration_data <- data.frame（pred_probs = pred_probs，状态 = 测试$status）
calib_model <- 校准（状态 ~ pred_probs，数据 = calibration_data，类 = “1”，切割 = 5）
calib_df <- as.data.frame（calib_model[[“data”]]）
calib_df$mid <- calib_df$midpoint
calib_df $% <- calib_df $%
calibration_plot <- ggplot（calib_df， aes（x = mid， y = Percent）） + geom_line（color = “steelblue”， size = 1.2） + geom_point（color = “steelblue”， size = 3） + geom_abline（截距 = 0， slope = 1， linetype = “dashed”， color = “black”， size = 0.8） + labs（title = “随机森林的校准曲线”， x = “预测概率”， y = “实际概率”） + theme_minimal（） + theme（panel.border = element_rect（color = “black”， fill = NA， size = 1）， plot.title = element_text（hjust = -0.05， vjust = -1.5， face = “bold”， size = 12））
rf_probs <- predict（RF， newdata=testsed， type=“prob”）[， 2]
rf_auc <- roc（已测试$状态，rf_probs）
auc_value <- AUC（rf_auc）
ci_auc <- ci.auc（rf_auc）
rf_predictions <- predict（RF， newdata=testeded）
conf_matrix <- confusionMatrix（rf_predictions，测试$status）
sensitivity <- conf_matrix$byClass[“灵敏度”]
特异性 <- conf_matrix$byClass[“特异性”]
accuracy <- conf_matrix$overall[“准确性”]
ppv <- conf_matrix$byClass[“Pos Pred Value”]
npv <- conf_matrix$byClass[“负目标值”]
result_table <- data.frame（Model = “RF”， AUC = sprintf（“%.3f （%.3f-%.3f）”， auc_value， ci_auc[1]， ci_auc[3]），灵敏度 = sprintf（“%.3f”，灵敏度），特异性 = sprintf（“%.3f”，特异性），准确性 = sprintf（“%.3f”，准确性）， PPV = sprintf（“%.3f”， ppv）， NPV = sprintf（“%.3f”， npv））
write.csv（result_table， “RF_model_performance.csv”， row.names = FALSE）
使用以下代码构建 NN 模型并生成变量相对重要性的条形图，从而比较三个 LN 系统的重要性。同样，生成 ROC 曲线和校准曲线。数据从 SEER database.library（nnet）获取
library（插入符号）
文库（pROC）
库（ggplot2）
train_data <- read.csv（“train_data.csv”）
test_data <- read.csv（“test_data.csv”）
train_data$status <- as.factor（train_data$status）
test_data$status <- as.factor（test_data$status）
特征 <- c（“年龄”， “T”， “N”， “M”， “LODDS”， “化疗”）
x_train <- train_data[，功能]
y_train <- train_data$status
x_test <- test_data[，功能]
y_test <- test_data$status
nn_model <- nnet（状态 ~ 年龄 + T + N + M + LODDS + 化疗，数据 = train_data，大小 = 5，衰变 = 0.01，最大值 = 200）
pred_probs <- predict（nn_model， newdata = x_test， type = “raw”）
pred_labels <- ifelse（pred_probs > 0.5， 1， 0）
roc_curve <- roc（as.numeric（y_test）， pred_probs）
auc_value <- AUC（roc_curve）
auc_ci <- ci.auc（roc_curve）
auc_text <- paste0（round（auc_value， 3）， “ （”， round（auc_ci[1]， 3）， “-”， round（auc_ci[3]， 3）， “）”）
conf_matrix <- confusionMatrix（as.factor（pred_labels）， y_test）
accuracy <- conf_matrix$overall[“准确性”]
sensitivity <- conf_matrix$byClass[“灵敏度”]
特异性 <- conf_matrix$byClass[“特异性”]
ppv <- conf_matrix$byClass[“Pos Pred Value”]
npv <- conf_matrix$byClass[“负目标值”]
performance_table <- data.frame（指标 = c（“AUC （95% CI）”， “准确性”， “敏感性”， “特异性”， “PPV”， “NPV”），值 = c（auc_text， round（准确性， 3）， round（敏感性， 3）， round（特异性， 3）， round（ppv， 3）， round（npv， 3）））
write.csv（performance_table， “NN_performance_table.csv”， row.names = FALSE）
roc_curve <- roc（y_test， pred_probs）
auc_value <- AUC（roc_curve）
roc_plot <- ggplot（） + geom_line（aes（x = 1 - roc_curve$specificities， y = roc_curve$sensitivities）， color = “steelblue”， size = 1.2） +geom_abline（截距 = 0，斜率 = 1，线型 = “虚线”， color = “gray”） + annotate（“text”， x = 0.8， y = 0.2， label = paste（“AUC =”， round（auc_value， 3））， color = “black”， size = 5， hjust = 0） + labs（title = “神经网络模型的 ROC 曲线”， x = “假阳性率”， y = “真阳性率”） + theme_minimal（） + theme（panel.border = element_rect（color = “black”， fill = NA， size = 1））
calibration_data <- data.frame（pred_probs = pred_probs， status = as.numeric（y_test） - 1）
calibration_data$pred_probs <- as.numeric（calibration_data$pred_probs）
calibration_data$calibration_bin <- cut（calibration_data$pred_probs， breaks = seq（0， 1， by = 0.2）， include.lowest = TRUE）
calibration_summary <- aggregate（状态 ~ calibration_bin，数据 = calibration_data，FUN = 平均值）
calibration_summary pred_mean < 美元 - 聚合（pred_probs ~ calibration_bin，数据 = calibration_data，FUN = 平均值）$pred_probs
calibration_plot <- ggplot（calibration_summary， aes（x = pred_mean， y = 状态）） + geom_line（颜色 = “steelblue”，大小 = 1.2） + geom_point（颜色 = “红色”，大小 = 3） + geom_abline（截距 = 0，斜率 = 1，线型 = “虚线”，颜色 = “黑色”，大小 = 0.8） + labs（title = “神经网络校准曲线”， x = “预测概率”， y = “实际概率”） + theme_minimal（） + 主题（panel.border = element_rect（color = “黑色”， fill = NA，大小 = 1））
nn_var_importance <- varImp （nn_model）
importance_df <- data.frame（Feature = rownames（nn_var_importance）， importance = nn_var_importance$overall ）
importance_plot <- ggplot（importance_df， aes（x = reorder（特征，重要性）， y = 重要性）） + geom_bar（stat = “identity”， fill = “steelblue”） + coord_flip（） + labs（title = “神经网络的变量重要性”， x = “特征”， y = “重要性”） + theme_minimal（）

3. 竞争风险模型的开发和验证

使用以下代码执行单变量分析并绘制累积发生函数（CIF）曲线。data.csv是从 SEER 数据库获取的数据。保存后续图像的方法与此步骤中的方法相同。将代码中的 Site 逐个替换为其他因子，以对所有因子执行单变量分析。
图书馆（Tidycmprsk）
库（GT摘要）
库（ggplot2）
图书馆（ggsurvfit）
库（GGPrivers）
AA <- read.csv（“data.csv”）
cif2 <- tidycmprsk：：cuminc（Surv（时间，状态 1） ~站点，数据 = aa）
tidy（cif2，times = c（12,24,36,48,60））
tbl_cuminc（cif2，次 =c（12,24,36,48,60），结果 = c（“CSS”， “OSS”），estimate_fun = NULL， label_header = “**{时间/12}-年**”） %>%
add_p（） %>%
add_n（位置 = “级别”）
cuminc_plot <- ggcuminc（cif2， result = c（“CSS”， “OSS”），大小 = 1.5） + labs（x = “时间”） +add_quantile（y_value = 0.20，大小 = 1） + scale_x_continuous（breaks = seq（0， 84， by = 12）， limits = c（0， 84）） +scale_y_continuous（label = scales：:p ercent， breaks = seq（0， 1， by = 0.2）， limits = c（0， 1）） + theme_prism（） + theme（legend.position = c（0.2， 0.8）， panel.grid = element_blank（），panel.grid.major.y = element_line（color = “grey80”）） + theme（legend.spacing.x = unit（0.1， “cm”）， legend.spacing.y = unit（0.01， “cm”）） + theme（axis.ticks.length.x = unit（-0.2， “cm”）， axis.ticks.x = element_line（color = “black”， size = 1， lineend = 1）） + theme（axis.ticks.length.y = unit（-0.2， “cm”）， axis.ticks.y = element_line（color = “black”， size = 1， lineend = 1））
使用以下代码执行多变量分析和可视化。data1.csv来自上一个代码的结果。运行代码后，点击出口，然后点击 另存为 PDF，最后点击优惠保存图像。
图书馆（Tidycmprsk）
库（GT摘要）
aa <-read.csv（'data1.csv'）
for （i in names（aa）[c（1：16， 19）]）{aa[，i] <- as.factor（aa[，i]）}
mul1表 2 <- mul1 %>%
gtsummary：：tbl_regression（exponentiate = TRUE） %>%
add_n（位置 = “级别”）;表2
table_df <- as_tibble（表 2）
选项卡 <- Table2$table_body
tab1 <- tab[，c（12,19,20,22：29）]
使用以下代码绘制列线图、ROC 曲线和校准曲线。使用训练队列中的数据训练模型后，使用验证和外部验证队列数据来验证 model.library（QHScrnomo）。外部队列数据包括在步骤 1.4 中选择的环状细胞癌以外的结直肠癌样本。
库（RMS）
库（timeROC）
图书馆（生存）
aa <-read.csv（'data3.csv'）
for （i in names（aa）[c（1：16， 19）]）{aa[，i] <- as.factor（aa[，i]）}
DD <- 数据区（AA）
options（datadist = “dd”）
mul <- cph（Surv（时间，状态 1 == 1） ~ T + N + M + LODDS + 站点，数据 = aa， x = TRUE， y = TRUE， surv = TRUE）
m3 <- crr.fit（mul， failcode = 1， cencode = 0）
nomo <-Newlabels（fit = m3， labels =c（T=“T”， n= “N”， m = “M”， LODDS = “LODDS”， Site = “Site”））
nomoc（“N0”，“N1”，“N2”），M=c（“M0”，“M1”），LODDS=c
（“LODDS1”，“LODDS2”，“LODDS3”），站点=
c（“RSC”，“LSC”，“直肠”）））
nomogram.crr（fit =nomo ， lp = F， xfrac = 0.3， fun.at =seq（from=0， to=1， by= 0.1）， failtime =c（12,36,60）， funlabel = c（“1 年 CSS 累积发生率”，“3 年 CSS 累积发生率”，“5 年 CSS 累积发生率”））
time_points <- c（12， 36， 60）
pred_risks_list <- lapply（time_points， function（time_point） {predict（m3， newdata = aa， type = “risk”， time = time_point）}）
pred_risks_df <- data.frame（do.call（cbind， pred_risks_list））
列名（pred_risks_df） <- paste（“risk_at”， time_points， “月”， sep = “_”）
roc_1year <- timeROC（T = aa$time， delta = ifelse（aa$Status1 == “CSS”， 1， 0）， marker = pred_risks_df$risk_at_12_months， cause = 1， times = 12， iid = TRUE）
roc_3year <- timeROC（T = aa$time， delta = ifelse（aa$Status1 == “CSS”， 1， 0）， marker = pred_risks_df$risk_at_36_months， cause = 1， times = 36， iid = TRUE）
roc_5year <- timeROC（T = aa$time， delta = ifelse（aa$Status1 == “CSS”， 1， 0）， marker = pred_risks_df$risk_at_60_months， cause = 1，times = 60， iid = TRUE）
legend（“bottomright”，legend = c（“1 年 CSS”， “3 年 CSS”， “5 年 CSS”）， col = c（“#BF1D2D”， “#262626”， “#397FC7”）， lwd = 2）
sas.cmprsk（m3，时间 = 36）
set.seed（123）
aa$pro <- tenf.crr（m3，时间 = 36）
cindex（prob = aa$pro， fstatus = aa$Status1， ftime = aa$time， type = “crr”， failcode = 1， cencode = 0， tol = 1e-20）
groupci（x=aa$pro， ftime = aa$time， fstatus = aa$Status1， failcode = 1， cencode = 0， ci = TRUE， g = 5， m = 1000， u = 36， xlab = “预测概率”， ylab = “实际概率”， lty=1， lwd=2， col=“#262626”，xlim=c（0,1.0）， ylim=c（0,1.0）， add =TRUE）

Access restricted. Please log in or start a trial to view this content.

结果

患者特征
本研究侧重于诊断为结直肠 SRCC 的患者，使用来自 2004 年至 2015 年的 SEER 数据库的数据。排除标准包括生存时间少于 1 个月的患者、临床病理信息不完整的患者以及死因不明或不明的病例。共有 2409 例符合纳入标准的结直肠 SRCC 患者被随机分为训练队列（N = 1686）和验证队列（N = 723）。使用 R 软件分析训练和验证队列的人口统计学和临床参数，...

Access restricted. Please log in or start a trial to view this content.

讨论

结直肠癌（CRC） SRCC 是预后不良的罕见特殊结直肠癌亚型。因此，需要更加关注 SRCC 患者的预后。SRCC 患者的准确生存预测对于确定其预后和做出个体化治疗决策至关重要。在这项研究中，我们探讨了 SRCC 患者临床特征与预后之间的关系，并从 SEER 数据库中确定了 SRCC 患者的最佳 LN 分期系统。据我们所知，这是第一项通过综合使用机器学习和竞争风险分析方法确定适合?...

Access restricted. Please log in or start a trial to view this content.

披露声明

作者没有需要披露的财务利益冲突。

致谢

没有

Access restricted. Please log in or start a trial to view this content.

材料

Name	Company	Catalog Number	Comments
SEER database	National Cancer institiute at NIH
X-tile software	Yale school of medicine
R-studio	Posit

参考文献

Siegel, R. L., Giaquinto, A. N., Jemal, A. Cancer statistics, 2024. CA Cancer J Clin. 74 (1), 12-49 (2024).
Korphaisarn, K., et al. Signet ring cell colorectal cancer: Genomic insights into a rare subpopulation of colorectal adenocarcinoma. Br J Cancer. 121 (6), 505-510 (2019).
Willauer, A. N., et al. Clinical and molecular characterization of early-onset colorectal cancer. Cancer. 125 (12), 2002-2010 (2019).
Watanabe, A., et al. A case of primary colonic signet ring cell carcinoma in a young man which preoperatively mimicked Phlebosclerotic colitis. Acta Med Okayama. 73 (4), 361-365 (2019).
Kim, H., Kim, B. H., Lee, D., Shin, E. Genomic alterations in signet ring and mucinous patterned colorectal carcinoma. Pathol Res Pract. 215 (10), 152566(2019).
Deng, X., et al. Neoadjuvant radiotherapy versus surgery alone for stage II/III mid-low rectal cancer with or without high-risk factors: A prospective multicenter stratified randomized trial. Ann Surg. 272 (6), 1060-1069 (2020).
Buk Cardoso, L., et al. Machine learning for predicting survival of colorectal cancer patients. Sci Rep. 13 (1), 8874(2023).
Monterrubio-Gómez, K., Constantine-Cooke, N., Vallejos, C. A. A review on statistical and machine learning competing risks methods. Biom J. 66 (2), e2300060(2024).
Kim, H. J., Choi, G. S. Clinical implications of lymph node metastasis in colorectal cancer: Current status and future perspectives. Ann Coloproctol. 35 (3), 109-117 (2019).
Xu, T., et al. Log odds of positive lymph nodes is an excellent prognostic factor for patients with rectal cancer after neoadjuvant chemoradiotherapy. Ann Transl Med. 9 (8), 637(2021).
Chen, Y. R., et al. Prognostic performance of different lymph node classification systems in young gastric cancer. J Gastrointest Oncol. 12 (4), 285-1300 (2021).
Bouvier, A. M., et al. How many nodes must be examined to accurately stage gastric carcinomas? Results from a population based study. Cancer. 94 (11), 2862-2866 (2002).
Coburn, N. G., Swallow, C. J., Kiss, A., Law, C. Significant regional variation in adequacy of lymph node assessment and survival in gastric cancer. Cancer. 107 (9), 2143-2151 (2006).
Li Destri, G., Di Carlo, I., Scilletta, R., Scilletta, B., Puleo, S. Colorectal cancer and lymph nodes: the obsession with the number 12. World J Gastroenterol. 20 (8), 1951-1960 (2014).
Dinaux, A. M., et al. Outcomes of persistent lymph node involvement after neoadjuvant therapy for stage III rectal cancer. Surgery. 163 (4), 784-788 (2018).
Sun, Y., Zhang, Y., Huang, Z., Chi, P. Prognostic implication of negative lymph node count in ypN+ rectal cancer after neoadjuvant chemoradiotherapy and construction of a prediction nomogram. J Gastrointest Surg. 23 (5), 1006-1014 (2019).
Xu, Z., Jing, J., Ma, G. Development and validation of prognostic nomogram based on log odds of positive lymph nodes for patients with gastric signet ring cell carcinoma. Chin J Cancer Res. 32 (6), 778-793 (2020).
Scarinci, A., et al. The impact of log odds of positive lymph nodes (LODDS) in colon and rectal cancer patient stratification: a single-center analysis of 323 patients. Updates Surg. 70 (1), 23-31 (2018).
Nitsche, U., et al. Prognosis of mucinous and signet-ring cell colorectal cancer in a population-based cohort. J Cancer Res Clin Oncol. 142 (11), 2357-2366 (2016).
Kang, H., O'Connell, J. B., Maggard, M. A., Sack, J., Ko, C. Y. A 10-year outcomes evaluation of mucinous and signet-ring cell carcinoma of the colon and rectum. Dis Colon Rectum. 48 (6), 1161-1168 (2005).
Sung, C. O., et al. Clinical significance of signet ring cells in colorectal mucinous adenocarcinoma. Mod Pathol. 21 (12), 1533-1541 (2008).
Alvi, M. A., et al. Molecular profiling of signet ring cell colorectal cancer provides a strong rationale for genomic targeted and immune checkpoint inhibitor therapies. Br J Cancer. 117 (2), 203-209 (2017).
Brownlee, S., et al. Evidence for overuse of medical services around the world. Lancet. 390 (10090), 156-168 (2017).

Access restricted. Please log in or start a trial to view this content.

转载和许可

请求许可使用此 JoVE 文章的文本或图形

请求许可

探索更多文章

218

This article has been published

Video Coming Soon

Keep me updated: