기계 학습 모델을 기반으로 한 대장직장 인장 고리 세포 암종에서 3개의 림프절 병기 결정 시스템의 예측 성능 비교

Jinyan Jia; Zixuan Yu; Maorun Zhang; Fang Hu; Gang Liu

doi:10.3791/67941

JoVE 비디오를 활용하시려면 도서관을 통한 기관 구독이 필요합니다. 전체 비디오를 보시려면 로그인하거나 무료 트라이얼을 시작하세요.

Method Article

기계 학습 모델을 기반으로 한 대장직장 인장 고리 세포 암종에서 3개의 림프절 병기 결정 시스템의 예측 성능 비교

DOI:

10.3791/67941

⸱

April 18th, 2025

Jinyan Jia*¹^,²^,³, Zixuan Yu*¹^,²^,³^,⁴, Maorun Zhang¹^,²^,³, Fang Hu⁴, Gang Liu¹^,²^,³

¹Department of General Surgery, Tianjin Medical University General Hospital, ²China Tianjin General Surgery Institute, ³Tianjin Key Laboratory of Precise Vascular Reconstruction and Organ Function Repair, ⁴Department of Nursing, Tianjin Medical University General Hospital

* 이 저자들은 동등하게 기여했습니다

Please note that all translations are automatically generated. Click here for the English version.

요약

이 연구는 기계 학습 모델과 경쟁 위험 분석을 사용하여 결장직장 인장 고리 세포 암종 환자의 예후 시스템을 평가합니다. 양성 림프절의 로그 확률을 pN 병기와 비교하여 우수한 예측 변수로 식별하여 강력한 예측 성능을 입증하고 강력한 생존 예측 도구를 통해 임상 의사 결정을 지원합니다.

초록

림프절 상태는 환자에게 중요한 예후 예측 변수입니다. 그러나 대장직장 인장 고리 세포 암종(SRCC)의 예후는 제한된 관심을 받고 있습니다. 이 연구는 경쟁 위험 모델과 함께 기계 학습 모델(Random Forest, XGBoost 및 Neural Network)을 사용하여 SRCC 환자의 양성 림프절(LODDS), 림프절 비율(LNR) 및 pN 병기의 로그 확률의 예후 예측 능력을 조사합니다. 관련 데이터는 SEER(Surveillance, Epidemiology, and End Results) 데이터베이스에서 추출되었습니다. 머신러닝 모델의 경우, 일변량 및 다변량 Cox 회귀 분석을 통해 암 특이적 생존(CSS)에 대한 예후 인자를 확인한 후, XGBoost, RF, NN의 세 가지 머신러닝 방법을 적용하여 최적의 림프절 병기 결정 시스템을 확인했습니다. 경쟁 위험 모델에서는 예후 요인을 식별하기 위해 일변량 및 다변량 경쟁 위험 분석을 사용했으며, SRCC 환자의 예후를 예측하기 위해 노모그램을 구성했습니다. 수신기 작동 특성 곡선(AUC-ROC) 및 보정 곡선 아래 영역을 사용하여 모델의 성능을 평가했습니다. 본 연구에는 총 2,409명의 SRCC 환자가 포함되었다. 모델의 효과를 검증하기 위해 SRCC를 제외한 15,122명의 대장암 환자로 구성된 추가 코호트가 외부 검증을 위해 포함되었습니다. 기계 학습 모델과 경쟁 위험 노모그램 모두 생존 결과를 예측하는 데 강력한 성능을 보였습니다. pN 병기와 비교했을 때, LODDS 병기 결정 시스템은 우수한 예후 능력을 보여주었습니다. 평가 결과, 머신 러닝 모델과 경쟁 위험 모델은 우수한 변별력, 보정 및 해석 가능성을 특징으로 하는 우수한 예측 성능을 달성했습니다. 본 연구의 연구 결과는 환자를 위한 임상적 의사 결정에 정보를 제공하는 데 도움이 될 수 있다.

서문

대장암(CRC)은 전 세계적으로 세 번째로 흔한 악성 종양으로 ^1,2,3 위를 차지합니다. CRC의 희귀한 아형인 인장 고리 세포 암종(SRCC)은 사례의 약 1%를 차지하며 세포핵^1,2,4를 대체하는 풍부한 세포 내 점액이 특징입니다. SRCC는 종종 젊은 환자와 관련이 있고, 여성에서 유병률이 더 높으며, 진단 시 종양 단계가 진행되어 있습니다. SRCC는 결장직장 선암종과 비교했을 때 분화가 잘 되지 않고, 원격 전이 위험이 높으며, 5년 생존율이 12%-20^%5,6에 불과하다. SRCC에 대한 정확하고 효과적인 예후 모델을 개발하는 것은 치료 전략을 최적화하고 임상 결과를 개선하는 데 매우 중요합니다.

이 연구는 기계 학습(ML) 및 경쟁 위험 모델을 포함한 고급 통계 접근 방식을 사용하여 SRCC 환자에 대한 강력한 예후 모델을 구축하는 것을 목표로 합니다. 이러한 방법론은 임상 데이터의 복잡한 관계를 수용할 수 있으며, 개별화된 위험 평가를 제공하고 예측 정확도에서 기존 방법을 능가합니다. 랜덤 포레스트(Random Forest), XGBoost 및 뉴럴 네트워크(Neural Networks)와 같은 머신 러닝 모델은 고차원 데이터를 처리하고 복잡한 패턴을 식별하는 데 탁월합니다. 연구에 따르면 AI 모델은 대장암의 생존 결과를 효과적으로 예측하며, 임상 응용 분야에서 ML의 잠재력을 강조합니다 ^7,8. ML을 보완하는 경쟁 위험 모델은 생존 분석을 구체화하기 위해 암 관련 사망률 대 다른 사망 원인과 같은 여러 이벤트 유형을 해결합니다. Kaplan-Meier 추정기와 같은 전통적인 방법과 달리, 경쟁 위험 모델은 경쟁 위험이 존재하는 상황에서 사건의 한계 확률을 정확하게 추정하여 보다 정확한 생존 평가를 제공합니다⁸. ML과 경쟁사의 위험 분석을 통합하면 예측 성능이 향상되어 SRCC ^9,10,11에서 개인화된 예측 도구를 위한 강력한 프레임워크를 제공할 수 있습니다.

림프절 전이는 CRC 환자의 예후와 재발에 큰 영향을 미칩니다. TNM 분류에서 N단계 평가가 중요하지만, 48%-63%의 사례에서 보고된 부적절한 림프절 검사는 질병을 과소 평가하게 만들 수 있습니다. 이 문제를 해결하기 위해 림프절 비율(LNR) 및 양성 림프절 로그 승산(LODDS)과 같은 대체 접근법이 도입되었습니다. 총 림프절(TLN) 대비 양성 림프절(PLN)의 비율인 LNR은 TLN 수의 영향을 덜 받으며 CRC의 예후 인자로 작용합니다. LODDS(LODDS)는 PLN과 음성 림프절(NLN)의 대수 비율로, 위암 SRCC와 대장암 모두에서 우수한 예측 능력을 보여주었습니다^10,11. 기계 학습은 유방암, 전립선암 및 폐암을 포함한 다양한 암에 대한 위험 계층화 및 예후 예측을 개선하는 모델을 통해 종양학에 점점 더 많이 적용되고 있습니다 12,13,14. 그러나 결장직장 SRCC에 대한 적용은 여전히 제한적입니다.

이 연구는 LODDS를 ML 및 경쟁 위험 모델과 통합하여 포괄적인 예측 도구를 생성함으로써 이러한 격차를 해소하고자 합니다. 본 연구는 LODDS의 예후 가치를 평가하고 첨단 예측 기법을 활용하여 SRCC 환자의 임상적 의사결정을 강화하고 결과를 개선하는 것을 목표로 합니다.

Access restricted. Please log in or start a trial to view this content.

프로토콜

본 연구는 윤리적 승인 및 참여에 대한 동의를 의미하지 않는다. 이 연구에 사용된 데이터는 데이터베이스에서 얻은 것입니다. 2004년부터 2015년까지 결장직장 시그넷 고리 세포 암종과 다른 유형의 결장직장암 진단을 받은 환자를 포함했습니다. 제외 기준에는 생존 기간이 1개월 미만인 환자, 임상병리학적 정보가 불완전한 환자, 사망 원인이 불분명하거나 특정되지 않은 사례가 포함되었다.

1. 데이터 수집

SEER를 다운로드합니다. SEER 데이터베이스 웹 사이트(http://seer.cancer.gov/about/overview.html)에서 통계 8.4.3 소프트웨어를 구합니다. 소프트웨어에 로그인한 후 Case List Session(케이스 목록 세션) > 데이터를 선택하고 Incidence SEER Research Plus Data, 17 Registries, Nov 2021 Sub (2000-2019) 데이터베이스를 선택합니다.
>편집을 클릭하고 {인종, 성별, 연도 dx. 진단 연도} = '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015' 및 {사이트 및 형태. 사이트 코드 변경 ICD-O-3/WHO 2008} = '8490/3'.
그런 다음 테이블을 클릭하고 사용 가능한 변수 인터페이스에서 단일 연령 및 100+, 성별, 결혼, 사이트 코드 재설정 ICD-O-3/WHO 2008, CS 종양 크기, 지역 nodes_examined(1988+), 지역 nodes_positive(1988+), 파생 AJCC 단계 그룹, 6판(2004-2015), 파생 AJCC T, 6판(2004-2015), 파생 AJCC N, 6판(2004-2015), 파생 AJCC M, 6판(2004-2015), CEA, 방사선 재코드, 화학 요법 재코드(예, 아니오/unk), SEER 원인별 사망 분류, 활력 상태 재코드(연구 컷오프 사용), 생존 개월, 진단 연도.
마지막으로 Output(출력)을 클릭하고 데이터 이름을 지정한 다음 Execute( 실행 )를 클릭하여 데이터를 출력하고 저장합니다. 자세한 포함 프로세스는 그림 1에 나와 있습니다.
후속 외부 검증을 위해 SRCC를 제외한 대장암 환자의 데이터를 다운로드할 수 있습니다. >선택을 클릭하고 {인종, 성별, 연도 dx. 진단 연도} = '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015' AND {Primary Site - labeled} = 'C18-C20'을 선택합니다. 1.3단계와 1.4단계를 반복하여 임상 병리학적 정보를 얻고 {Site and Morphology]로 샘플을 제외합니다. 다운로드한 파일에서 ICD-O-3/WHO 2008} = '8490/3' 사이트를 다시 코딩합니다.
비교를 위해 여러 변수를 처리합니다. 림프절 비율(Lymph Node Ratio, LNR)과 양성 림프절 확률 로그(Logarithm of the Odds of Positive Lymph Nodes, LODDS)를 모두 사용하여 림프절 상태를 분류합니다.
1. LNR을 총 림프절(TLN)에 대한 양성 림프절(PLN)의 비율로 정의합니다. 다음 공식을 사용하여 LODDS 값을 계산합니다.
  loge(PLN 수 + 0.5) / (음성 림프절(NLN) 수 + 0.5)
  여기서 무한 결과를 방지하기 위해 0.5가 추가되었습니다. LNR, LODDS 및 종양 크기에 대한 컷오프 값은 최소 P-값 방법을 기반으로 X-타일 소프트웨어(버전 3.6.1)를 사용하여 결정되었습니다.
X-tile 소프트웨어를 열고 파일 > 열기를 클릭한 다음 데이터 파일을 선택하여 소프트웨어로 가져옵니다. 데이터가 로드되면 변수를 매핑합니다. censor는 생존 상태에 해당하고, 생존 시간은 생존 시간에 해당하며, marker1은 분석할 변수이므로 데이터가 올바르게 일치하는지 확인합니다.
그런 다음 Do > Kaplan-Meier > Marker1 을 클릭하여 Kaplan-Meier 생존 해석을 수행하고 생존 곡선을 생성합니다. Kaplan-Meier 생존 곡선의 분리, 통계적 유의성(예: p-값) 및 임상적 관련성을 기반으로 최적의 컷오프 값을 결정하고 최종적으로 분석 결과를 기록하거나 내보냅니다.
1. LNR을 LNR 1(≤0.16), LNR 2(0.16 - 0.78), LNR 3(≥ 0.78)의 세 그룹으로 나눕니다. LODD에 따라 환자를 LODDS 1(≤ -1.44), LODDS 2(-1.44 - 0.86) 및 LODDS 3(≥ 0.86)의 세 그룹으로 분류합니다.
2. 종양 크기를 ≤ 3.5cm, 3.5 - 5.5cm, ≥ 5.5cm의 세 가지 범주로 분류합니다. 나이를 연속형 변수에서 범주형 변수로 변환합니다. 최초 진단 당시 환자의 연령을 ≥60세와 <60세로 분류합니다. 인장 고리 세포 암종(SRCC) 종양의 분포에 따라 종양 위치를 오른쪽 결장, 왼쪽 결장 및 직장으로 분류합니다. 오른쪽 결장에는 맹장, 상행 결장, 간 굴곡 및 횡 결장이 포함되고 왼쪽 결장에는 비장 굴곡, 하행 결장, S상 결장 및 직장 결장이 포함됩니다.
이 연구를 위해 SRCC가 있는 총 2409명의 적격 환자 데이터를 7:3 비율로 교육 코호트(N = 1686) 및 검증 코호트(N = 723)에 무작위로 할당합니다. 다음 코드를 사용하여 SEER 데이터베이스의 data.csv 무작위로 분할합니다. 무작위 분할 후 생성된 파일은 추가 분석에 사용됩니다.
라이브러리(캐럿)
데이터 <- read.csv("data.csv")
세트.시드(123)
train_indices <- createDataPartition(데이터$변수, p = 0.7, 목록 = FALSE)
train_data <- 데이터[train_indices, ]
test_data <- 데이터[-train_indices, ]
write.csv(train_data, "traindata.csv", row.names = FALSE)
write.csv(test_data, "testdata.csv", row.names = FALSE)

2. ML 모델 개발 및 검증

RStudio(2024.04.2+764) 및 R 소프트웨어(4.4.1)를 다운로드합니다. RStudio를 열어 R 소프트웨어를 실행합니다. 새 파일을 클릭하고 R 스크립트를 선택하여 새 R 프로그래밍 인터페이스를 만듭니다. 코드 편집기에 관련 코드를 입력하고 실행을 클릭하여 코드를 실행합니다.
다음 코드를 사용하여 Cox 회귀 분석에 의해 ML 모델에 포함된 변수를 스크리닝합니다. 또한 LODDS, LNR 및 pN 병기가 SRCC 환자의 암 특이적 생존(CSS)에 미치는 영향을 살펴봅니다. traindata.csv은 SEER 데이터베이스에서 가져온 데이터입니다.
library("생존")
라이브러리("SurvMiner")
라이브러리("RMS")
라이브러리("dplyr")
데이터 <- read.csv("traindata.csv")
데이터$시간=as.숫자(데이터$시간)
data$status=as.numeric(데이터$상태)
변수 <- c("Sex", "Age", "Race", "Marital", "Stage", "T", "N", "M", "Tumor_size", "LNR", "LODDS", "CEA", "Radiation", "Chemotherapy", "Site")
데이터 <- 데이터 %>%
mutate(across(all_of(변수), as.factor))
cox=coxph(Surv(시간, 상태) ~ 데이터$T, 데이터 = 데이터)
cox$계수
pval=anova(콕스)$Pr[2]
clean_data=데이터[,c(1:12, 14:18)]
get_coxVariable=function(your_data,index){cox_list=c():k=1
for (i in 1:index) {mod=coxph(Surv(시간, 상태) ~ your_data[,i],data=your_data) pval=anova(mod)$Pr[2] print(pval) print(colnames(your_data)[i]) if (pval<0.05) {cox_list[k]=colnames(your_data)[i] k=k+1}}return(cox_list)}
variable_select=get_coxVariable(clean_data,15)
for(i in 1:15){print(variable_select[i])}
for (var in variable_select) {수식 <- as.formula(paste("Surv(시간, 상태) ~", var))cox_model <- coxph(수식, 데이터 = 데이터) print(요약(cox_model))
GG포레스트(콕스)
변수 <- c("Sex", "Age", "Race", "Marital", "Stage", "T", "N", "M", "Tumor_size", "LNR", "LODDS", "Chemotherapy")
데이터 <- 데이터 %>%
mutate(across(all_of(변수), as.factor))
cox=coxph(Surv(시간, 상태) ~ 성별+나이+인종+결혼+T+N+M+Tumor_size+LNR+
LODDS+화학요법,데이터 = 데이터)
ggForest(cox,데이터 = 데이터)
ggplot_forest <- ggforest(cox, 데이터 = 데이터)
다음 코드를 사용하여 훈련, 검증 및 외부 검증 코호트에서 3개의 LN 시스템(LODDS, LNR 및 pN 스테이징)의 예후 예측 능력을 비교할 수 있습니다.
라이브러리(RMS)
라이브러리(생존)
라이브러리(SurvMiner)
라이브러리(riskRegression)
라이브러리(gt)
train_data <- read.csv("train_data123.csv")
validation_data <- read.csv("test_data123.csv")
dd <- datadist(train_data)
옵션(datadist = "dd")
model_LNR <- cph(Surv(시간, 상태) ~ LNR, 데이터 = train_data, x = TRUE, y = TRUE)
model_LODDS <- cph(Surv(시간, 상태) ~ LODDS, 데이터 = train_data, x = TRUE, y = TRUE)
model_pN <- cph(Surv(시간, 상태) ~ N, 데이터 = train_data, x = TRUE, y = TRUE)
calculate_performance <- 함수(모델, 데이터) {pred <- predict(model, newdata = data) c_index_result <- concordance(Surv(data$time, data$status) ~ pred) c_index <- c_index_result$concordance aic <- AIC(model) bic <- BIC(model) return(c(C_index = round(c_index, 3), AIC = round(aic, 2), BIC = round(bic, 2)))}
calculate_performance <- function(model, data) {pred <- predict(model, newdata = data, type = "lp") concordance_result <- concordancefit(Surv(data$time, data$status), x = pred) c_index <- concordance_result$concordance ci_lower <- c_index - 1.96 * sqrt(concordance_result$var) ci_upper <- c_index + 1.96 * sqrt(concordance_result$var) aic <- AIC(model) bic <- BIC(model) return(c(C_Index = round(c_index, 3), CI_Lower = round(ci_lower, 3), CI_Upper = round(ci_upper, 3), AIC = round(aic, 2), BIC = round(bic, 2)))}
train_LNR <- calculate_performance(model_LNR, train_data)
train_LODDS <- calculate_performance(model_LODDS, train_data)
train_pN <- calculate_performance(model_pN, train_data)
model_LNR_val <- cph(Surv(시간, 상태) ~ LNR, 데이터 = validation_data, x = TRUE, y = TRUE)
model_LODDS_val <- cph(Surv(시간, 상태) ~ LODDS, 데이터 = validation_data, x = TRUE, y = TRUE)
model_pN_val <- cph(Surv(시간, 상태) ~ N, 데이터 = validation_data, x = TRUE, y = TRUE)
val_LNR <- calculate_performance(model_LNR_val, validation_data)
val_LODDS <- calculate_performance(model_LODDS_val, validation_data)
val_pN <- calculate_performance(model_pN_val, validation_data)
결과 <- data.frame(Variable = c("LNR", "LODDS", "pN"), Training_C_Index = c(paste(train_LNR["C_Index"], "(", train_LNR["CI_Lower"], ", ", train_LNR["CI_Upper"], ")", sep = ""), paste(train_LODDS["C_Index"], "(", train_LODDS["CI_Lower"], ", train_LODDS["CI_Upper"], ")", sep = ""), paste(train_pN["C_Index"], "(", train_pN["CI_Lower"], ", ", train_pN["CI_Upper"], ")", sep = "")), Training_AIC = c(train_LNR["AIC"], train_LODDS["AIC"], train_pN["AIC"]), Training_BIC = c(train_LNR["BIC"], train_LODDS["BIC"], train_pN["BIC"]), Validation_C_Index = c(paste(val_LNR["C_Index"], "(", val_LNR["CI_Lower"], ", val_LNR["CI_Upper"], ")", sep = ""), paste(val_LODDS["C_Index"], "(", val_LODDS["CI_Lower"], ", val_LODDS["CI_Upper"], ")", sep = ""), paste(val_pN["C_Index"], "(", val_pN["CI_Lower"], ", ", val_pN["CI_Upper"], ")", sep = "")), Validation_AIC = c(val_LNR["AIC"], val_LODDS["AIC"], val_pN["AIC"]), Validation_BIC = c(val_LNR["BIC"], val_LODDS["BIC"], val_pN["BIC"]))
results_table <- gt(결과) %>%
tab_header(제목 = "3개의 림프 결절 병기 결정 시스템의 예측 성능") %>%
cols_label(변수 = "변수",Training_C_Index = "C-index(95% CI)(교육)", Training_AIC = "AIC(교육)", Training_BIC = "BIC(교육)", Validation_C_Index = "C-index(95% CI)(검증)", Validation_AIC = "AIC(검증)", Validation_BIC = "BIC(검증)")
write.csv(결과, "prediction_performance.csv", row.names = FALSE)
다음 코드를 사용하여 XGBoost 모델을 구축하고 변수의 상대적 중요도에 대한 막대 그래프를 생성하여 세 LN 시스템의 중요도를 비교합니다. 마찬가지로 ROC 곡선과 검량선을 생성합니다. 데이터는 SEER 데이터베이스에서 가져옵니다.
라이브러리(XGBoost)
라이브러리(캐럿)
라이브러리(pROC)
train_data <- read.csv("train_data.csv")
test_data <- read.csv("test_data.csv")
train_matrix <- xgb. DMatrix(data = as.matrix(train_data[, c('Age', 'T', 'N', 'M', 'LODDS', 'Chemotherapy')]), label = train_data$status)
test_matrix <- xgb. DMatrix(data = as.matrix(test_data[, c('Age', 'T', 'N', 'M', 'LODDS', 'Chemotherapy')]), label = test_data$status)
params <- list(booster = "gbtree", objective = "binary:logistic", eval_metric = "auc", eta = 0.1, max_depth = 6, 하위 샘플 = 0.8, colsample_bytree = 0.8)
xgb_model <- xgb.train(params = params, data = train_matrix, nrounds = 100, watchlist= list(train = train_matrix), verbose = 1)
pred_probs <- predict(xgb_model, newdata = test_matrix)
pred_labels <- ifelse(pred_probs > 0.5, 1, 0)
conf_matrix <- confusionMatrix(as.factor(pred_labels), as.factor(test_data$status))
roc_curve <- roc(test_data$상태, pred_probs)
auc_value <- AUC(roc_curve)
ci_auc <- ci.auc (roc_curve)
민감도 <- conf_matrix$byClass["민감도"]
특이성 <- conf_matrix$byClass["특이성"]
정확도 <- conf_matrix$overall["정확도"]
ppv <- conf_matrix$byClass["위치 선행 값"]
npv <- conf_matrix$byClass["네거티브 프레드 값"]
result_table <- data.frame(Model = "XGBoost", AUC = sprintf("%.3f (%.3f-%.3f)", auc_value, ci_auc[1], ci_auc[3]), 민감도 = sprintf("%.3f", 민감도), 특이도 = sprintf("%.3f", 특이도), 정확도 = sprintf("%.3f", accuracy), PPV = sprintf("%.3f", ppv), NPV = sprintf("%.3f", npv))
write.csv(result_table, "xgboost_model_performance.csv", row.names = FALSE)
roc_df <- data.frame(FPR = 1 - roc_curve$특이도, TPR = roc_curve$민감도)
roc_plot <- ggplot(roc_df, aes(x = FPR, y = TPR)) +geom_line(색상 = "steelblue", 크기 = 1.2) + geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray") + annotate("text", x = 0.9, y = 0.2, label = paste("AUC =", round(auc_value, 3)), size = 5, color = "black") + labs(title = "XGBoost 모델에 대한 ROC 곡선", x = "거짓 긍정 비율", y = "참 긍정 비율") + theme_minimal() + theme(panel.border = element_rect(색상 = "검은색", 채우기 = NA, 크기 = 1))
calibration_data <- data.frame(상태 = as.factor(test_data$status), pred_probs = pred_probs)
calib_model <- 캘리브레이션(상태 ~ pred_probs, 데이터 = calibration_data, 클래스 = "1", 컷 = 5)
ggplot(calib_model$data, aes(x = 중간점, y = 퍼센트)) + geom_line(색상 = "steelblue", 크기 = 1) + geom_point(색상 = "빨간색", 크기 = 2) + geom_abline(절편 = 0, 기울기 = 1, 선종류 = "파선", 색상 = "검은색") +labs(title = "XGBoost 모델에 대한 보정 곡선", x = "예측 확률", y = "관찰된 비율") + theme_minimal() + theme(panel.border = element_rect(색상 = "검은색", 채우기 = NA, 크기 = 0.5))
다음 코드를 사용하여 RF 모델을 구축하고 변수의 상대적 중요도에 대한 막대 그래프를 생성하여 세 LN 시스템의 중요도를 비교합니다. 마찬가지로 ROC 곡선과 검량선을 생성합니다. 데이터는 SEER database.library(randomForest)에서 가져옵니다.
라이브러리(DPLYR)
라이브러리(ggplot2)
라이브러리(pROC)
라이브러리(캐럿)
라이브러리(RMS)
기차 세트 <- read.csv("train_data.csv")
테스트됨 <- read.csv("test_data.csv")
trainset$status=factor(기차집합$상태)
변수1 <- c("나이", "T", "N", "M", "LODDS", "화학 요법")
기차 집합 <- 기차 집합 %>%
mutate(across(all_of(변수1), as.numeric))
testsed$status=factor(테스트된$상태)
테스트됨 <- 테스트됨 %>%
mutate(across(all_of(변수1), as.numeric))
RF=randomForest(trainset$status ~ Age + T + N + M + LODDS + 화학 요법, data=trainset,ntree=100,importance=TRUE,proximity=TRUE)
imp=중요도(RF)
varImpPlot(RF)
impvar=rownames(imp)[order(imp[,4],감소 = TRUE)]
importance_df <- as.data.frame(임프)
importance_df$변수 <- rownames(importance_df)
importance_plot <- ggplot(importance_df, aes(x = reorder(Variables, MeanDecreaseAccuracy), y = MeanDecreaseAccuracy)) +geom_bar(stat = "identity", fill = "steelblue") +coord_flip() + labs(title = "Variable Importance", x = "Variables", y = "Mean Decrease Accuracy") + theme_minimal()
pred_probs <- predict(RF, testset, type = "prob")[,2]
roc_obj <- ROC(testSet$상태, pred_probs)
auc_value <- AUC(roc_obj)
roc_plot <- ggplot() +geom_line(aes(x = 1 - roc_obj$specificities, y = roc_obj$sensitivities), color = "steelblue", size = 1.2) +geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray") + annotate("text", x = 0.8, y = 0.2, label = paste("AUC =", round(auc_value, 3)), color = "black", size = 5, hjust = 0) + labs(title = "랜덤 포레스트 모델에 대한 ROC 곡선", x = "거짓 긍정 비율", y = "참 긍정 비율") +theme_minimal() + theme(panel.border = element_rect(색상 = "검정색", 채우기 = NA, 크기 = 1))
calibration_data <- data.frame(pred_probs = pred_probs, 상태 = 테스트됨$상태)
calib_model <- 캘리브레이션(상태 ~ pred_probs, 데이터 = calibration_data, 클래스 = "1", 컷 = 5)
calib_df <- as.data.frame(calib_model[["데이터"]])
calib_df$중간 <- calib_df$중간 점
calib_df$% <- calib_df$%
calibration_plot <- ggplot(calib_df, aes(x = mid, y = 퍼센트)) + geom_line(색상 = "steelblue", 크기 = 1.2) + geom_point(색상 = "steelblue", 크기 = 3) + geom_abline(intercept = 0, 기울기 = 1, 선종류 = "파선", 색상 = "black", size = 0.8) + labs(title = "Calibration Curve for Random Forest", x = "예측 확률", y = "실제 확률") + theme_minimal() + theme(panel.border = element_rect(color = "black", 채우기 = NA, 크기 = 1), plot.title = element_text(hjust = -0.05, vjust = -1.5, 얼굴 = "굵게", 크기 = 12) )
rf_probs <- predict(RF, newdata=testsed, type="prob")[, 2]
rf_auc <- roc(테스트됨$상태, rf_probs)
auc_value <- AUC(rf_auc)
ci_auc <- ci.auc (rf_auc)
rf_predictions <- predict(RF, newdata=테스트됨)
conf_matrix <- confusionMatrix(rf_predictions, 테스트됨$상태)
민감도 <- conf_matrix$byClass["민감도"]
특이성 <- conf_matrix$byClass["특이성"]
정확도 <- conf_matrix$overall["정확도"]
ppv <- conf_matrix$byClass["위치 선행 값"]
npv <- conf_matrix$byClass["네거티브 프레드 값"]
result_table <- data.frame(Model = "RF", AUC = sprintf("%.3f (%.3f-%.3f)", auc_value, ci_auc[1], ci_auc[3]), 민감도 = sprintf("%.3f", 민감도), 특이도 = sprintf("%.3f", 특이도), 정확도 = sprintf("%.3f", accuracy), PPV = sprintf("%.3f", ppv), NPV = sprintf("%.3f", npv))
write.csv(result_table, "RF_model_performance.csv", row.names = FALSE)
다음 코드를 사용하여 NN 모델을 구축하고 변수의 상대적 중요도에 대한 막대 그래프를 생성하여 세 LN 시스템의 중요도를 비교합니다. 마찬가지로 ROC 곡선과 검량선을 생성합니다. 데이터는 SEER database.library(nnet)에서 가져옵니다.
라이브러리(캐럿)
라이브러리(pROC)
라이브러리(ggplot2)
train_data <- read.csv("train_data.csv")
test_data <- read.csv("test_data.csv")
train_data$status <- as.factor(train_data$상태)
test_data$status <- as.factor(test_data$상태)
특징 <- c("나이", "T", "N", "M", "LODDS", "화학 요법")
x_train <- train_data[, 특징]
y_train <- train_data$상태
x_test <- test_data[, 기능]
y_test <- test_data$상태
nn_model <- nnet(상태 ~ 연령 + T + N + M + LODDS + 화학 요법, 데이터 = train_data, 크기 = 5, 감쇠 = 0.01, 최대 = 200)
pred_probs <- predict(nn_model, newdata = x_test, type = "raw")
pred_labels <- ifelse(pred_probs > 0.5, 1, 0)
roc_curve <- roc(as.numeric(y_test), pred_probs)
auc_value <- AUC(roc_curve)
auc_ci <- ci.auc (roc_curve)
auc_text <- paste0(round(auc_value, 3), " (", round(auc_ci[1], 3), "-", round(auc_ci[3], 3), ")")
conf_matrix <- confusionMatrix(as.factor(pred_labels), y_test)
정확도 <- conf_matrix$overall["정확도"]
민감도 <- conf_matrix$byClass["민감도"]
특이성 <- conf_matrix$byClass["특이성"]
ppv <- conf_matrix$byClass["위치 선행 값"]
npv <- conf_matrix$byClass["네거티브 프레드 값"]
performance_table <- data.frame(Metric = c("AUC(95% CI)", "정확도", "민감도", "특이도", "PPV", "NPV"),값 = c(auc_text, round(accuracy, 3), round(민감도, 3), round(specificity, 3), round(ppv, 3), round(npv, 3)))
write.csv(performance_table, "NN_performance_table.csv", row.names = FALSE)
roc_curve <-록(y_test, pred_probs)
auc_value <- AUC(roc_curve)
roc_plot <- ggplot() + geom_line(aes(x = 1 - roc_curve$specificities, y = roc_curve$sensitivities), color = "steelblue", size = 1.2) +geom_abline(intercept = 0, slope = 1, linetype = "dashe", color = "gray") + annotate("text", x = 0.8, y = 0.2, label = paste("AUC =", round(auc_value, 3)), color = "black", size = 5, hjust = 0) + labs(title = "신경망 모델에 대한 ROC 곡선", x = "거짓 긍정 비율", y = "참 긍정 비율") + theme_minimal() + theme(panel.border = element_rect(색상 = "검정색", 채우기 = NA, 크기 = 1))
calibration_data <- data.frame(pred_probs = pred_probs, status = as.numeric(y_test) - 1)
calibration_data$pred_probs <- as.numeric(calibration_data$pred_probs)
calibration_data$calibration_bin <- cut(calibration_data$pred_probs, breaks = seq(0, 1, by = 0.2), include.lowest = TRUE)
calibration_summary <- 집계(상태 ~ calibration_bin, 데이터 = calibration_data, FUN = 평균)
calibration_summary$pred_mean <- 집계(pred_probs ~ calibration_bin, 데이터 = calibration_data, FUN = 평균)$pred_probs
calibration_plot <- ggplot(calibration_summary, aes(x = pred_mean, y = status)) + geom_line(color = "steelblue", size = 1.2) + geom_point(color = "red", size = 3) + geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black", size = 0.8) + labs(title = "Calibration Curve for Neural Network", x = "Predicted Probability", y = "Actual Probability") + theme_minimal() + theme(panel.border = element_rect(color = "black", 채우기 = NA, 크기 = 1))
nn_var_importance <- varImp(nn_model)
importance_df <- data.frame(기능 = rownames(nn_var_importance), 중요도 = nn_var_importance$overall )
importance_plot <- ggplot(importance_df, aes(x = reorder(기능, 중요도), y = 중요도)) + geom_bar(stat = "identity", fill = "steelblue") + coord_flip() + labs(title = "신경망의 변수 중요도", x = "기능", y = "중요도") + theme_minimal()

3. 경쟁 리스크 모델 개발 및 검증

다음 코드를 사용하여 일변량 분석을 수행하고 누적 입사 함수(CIF) 곡선을 플로팅합니다. data.csv은 SEER 데이터베이스에서 가져온 데이터입니다. 후속 이미지를 저장하는 방법은 이 단계와 동일합니다. 코드의 Site를 다른 요인으로 하나씩 대체하여 모든 요인에 대해 일변량 분석을 수행합니다.
도서관(Tidycmprsk)
라이브러리(GTLuffect)
라이브러리(ggplot2)
라이브러리(GGSURVFIT)
라이브러리(GGPRISM)
AA <- read.csv("data.csv")
cif2 <- tidycmprsk::cuminc(Surv(시간, 상태1) ~사이트, 데이터 = aa)
깔끔한(cif2,times = c(12,24,36,48,60))
tbl_cuminc(cif2, times =c(12,24,36,48,60), results = c("CSS", "OSS"),estimate_fun = NULL, label_header = "**{time/12}-year cuminc**") %>%
add_p() %>%
add_n(위치 = "수준")
cuminc_plot <- ggcuminc(cif2, outcome = c("CSS", "OSS"), size = 1.5) + labs(x = "time") +add_quantile(y_value = 0.20, size = 1) + scale_x_continuous(breaks = seq(0, 84, by = 12), limits = c(0, 84)) +scale_y_continuous(label = scales::p ercent, breaks = seq(0, 1, by = 0.2), limits = c(0, 1)) + theme_prism() + theme(legend.position = c(0.2, 0.8), panel.grid = element_blank(),panel.grid.major.y = element_line(색상 = "grey80")) + 테마(legend.spacing.x = 단위(0.1, "cm"), legend.spacing.y = 단위(0.01, "cm")) + 테마(axis.ticks.length.x = 단위(-0.2, "cm"), axis.ticks.x = element_line(색상 = "검정", 크기 = 1, lineend = 1)) + 테마(axis.ticks.length.y = 단위(-0.2, "cm"), axis.ticks.y = element_line(색상 = "검정", 크기 = 1, lineend = 1))
다음 코드를 사용하여 다변량 분석 및 시각화를 수행합니다. data1.csv는 이전 코드의 결과에서 가져옵니다. 코드를 실행한 후 내보내기를 클릭한 다음 PDF로 저장을 클릭하고 마지막으로 저장을 클릭하여 이미지를 저장합니다.
도서관(Tidycmprsk)
라이브러리(GTLuffect)
AA <-read.csv('data1.csv')
for (i in names(aa)[c(1:16, 19)]){aa[,i] <- as.factor(aa[,i])}
mul1표 2 <- mul1 % > %
gtsummary::tbl_regression(지수 = TRUE) %>%
add_n(위치 = "수준"); 표2
table_df <- as_tibble(표2)
탭 <- 테이블2$table_body
tab1 <- 탭[,c(12,19,20,22:29)]
다음 코드를 사용하여 노모그램, ROC 곡선 및 보정 곡선을 플로팅합니다. 학습 코호트의 데이터를 사용하여 모델을 학습한 후 validation 및 external validation 코호트 데이터를 사용하여 model.library(QHScrnomo)의 유효성을 검사합니다. 외부 코호트 데이터는 1.4단계에서 선택된 고리 세포 암종 이외의 대장암 샘플로 구성됩니다.
라이브러리(RMS)
라이브러리(timeROC)
라이브러리(생존)
AA <-read.csv('data3.csv')
for (i in names(aa)[c(1:16, 19)]){aa[,i] <- as.factor(aa[,i])}
dd <- 데이터디스트(aa)
옵션(datadist = "dd")
mul <- cph(Surv(시간, 상태1 == 1) ~ T + N + M + LODDS + 사이트, 데이터 = aa, x = TRUE, y = TRUE, surv = TRUE)
m3 <- crr.fit(mul, failcode = 1, cencode = 0)
nomo <-newlabels (fit = m3, labels = c (T = "T", N = "N", M = "M", LODDS = "LODDS", Site = "Site"))
nomoc("N0", "N1", "N2"),M=c("M0", "M1"),LODDS=c
("LODDS1", "LODDS2", "LODDS3"),사이트=
c("RSC", "LSC", "직장")))
nomogram.crr(fit =nomo , lp = F, xfrac = 0.3, fun.at =seq(from=0, to=1, by= 0.1) , failtime =c(12,36,60), funlabel = c("1년 CSS 누적 발생률","3년 CSS 누적 발생률","5년 CSS 누적 발생률"))
time_points <-c(12, 36, 60)
pred_risks_list <- lapply (time_points, function (time_point) {predict (m3, newdata = aa, type = "risk", time = time_point)})
pred_risks_df <- data.frame(do.call(cbind, pred_risks_list))
colnames(pred_risks_df) <- paste("risk_at", time_points, "개월", sep = "_")
roc_1year <- timeROC(T = aa$time, delta = ifelse(aa$Status1 == "CSS", 1, 0), 마커 = pred_risks_df$risk_at_12_months, 원인 = 1, times = 12, iid = TRUE)
roc_3year <- timeROC(T = aa$time, delta = ifelse(aa$Status1 == "CSS", 1, 0), 마커 = pred_risks_df$risk_at_36_months, 원인 = 1, times = 36, iid = TRUE)
roc_5year <- timeROC(T = aa$time, delta = ifelse(aa$Status1 == "CSS", 1, 0), 마커 = pred_risks_df$risk_at_60_months, 원인 = 1, times = 60, iid = TRUE)
legend("오른쪽 하단",범례 = c("1년 CSS", "3년 CSS", "5년 CSS"), col = c("#BF1D2D", "#262626", "#397FC7"), lwd = 2)
sas.cmprsk(m3,시간 = 36)
세트.시드(123)
AA$Pro <- tenf.crr(m3,시간 = 36)
cindex(prob = aa$pro, fstatus = aa$Status1, ftime = aa$time, 유형 = "crr", failcode = 1, cencode = 0, tol = 1e-20)
groupci(x=aa$pro, ftime = aa$time, fstatus = aa$Status1, failcode = 1, cencode = 0, ci = TRUE, g = 5, m = 1000, u = 36, xlab = "예측 확률", ylab = "실제 확률", lty=1, lwd=2, col="#262626", xlim=c(0,1.0), ylim=c(0,1.0), 추가 =TRUE)

Access restricted. Please log in or start a trial to view this content.

결과

환자 특성
이 연구는 2004년부터 2015년까지의 SEER 데이터베이스 데이터를 사용하여 결장직장 SRCC로 진단된 환자에 초점을 맞췄습니다. 제외 기준에는 생존 기간이 1개월 미만인 환자, 임상병리학적 정보가 불완전한 환자, 사망 원인이 불분명하거나 특정되지 않은 사례가 포함되었다. 포함 기준을 충족한 총 2409명의 대장직장 SRCC 환자를 훈련 코호트(N=1686)와 ?...

Access restricted. Please log in or start a trial to view this content.

토론

대장암(CRC) SRCC는 예후가 좋지 않은 희귀하고 특별한 대장암 하위 유형입니다. 따라서 SRCC 환자의 예후에 더 많은 주의를 기울일 필요가 있다. SRCC 환자의 정확한 생존 예측은 환자의 예후를 결정하고 개별화된 치료 결정을 내리는 데 매우 중요합니다. 본 연구에서는 SRCC 환자의 임상적 특징과 예후의 관계를 조사하고 SEER 데이터베이스에서 SRCC 환자에 대한 최적의 LN 병기 ?...

Access restricted. Please log in or start a trial to view this content.

공개

저자는 공개할 재정적 이해 상충이 없습니다.

감사의 말

없음

Access restricted. Please log in or start a trial to view this content.

자료

Name	Company	Catalog Number	Comments
SEER database	National Cancer institiute at NIH
X-tile software	Yale school of medicine
R-studio	Posit

참고문헌

Siegel, R. L., Giaquinto, A. N., Jemal, A. Cancer statistics, 2024. CA Cancer J Clin. 74 (1), 12-49 (2024).
Korphaisarn, K., et al. Signet ring cell colorectal cancer: Genomic insights into a rare subpopulation of colorectal adenocarcinoma. Br J Cancer. 121 (6), 505-510 (2019).
Willauer, A. N., et al. Clinical and molecular characterization of early-onset colorectal cancer. Cancer. 125 (12), 2002-2010 (2019).
Watanabe, A., et al. A case of primary colonic signet ring cell carcinoma in a young man which preoperatively mimicked Phlebosclerotic colitis. Acta Med Okayama. 73 (4), 361-365 (2019).
Kim, H., Kim, B. H., Lee, D., Shin, E. Genomic alterations in signet ring and mucinous patterned colorectal carcinoma. Pathol Res Pract. 215 (10), 152566(2019).
Deng, X., et al. Neoadjuvant radiotherapy versus surgery alone for stage II/III mid-low rectal cancer with or without high-risk factors: A prospective multicenter stratified randomized trial. Ann Surg. 272 (6), 1060-1069 (2020).
Buk Cardoso, L., et al. Machine learning for predicting survival of colorectal cancer patients. Sci Rep. 13 (1), 8874(2023).
Monterrubio-Gómez, K., Constantine-Cooke, N., Vallejos, C. A. A review on statistical and machine learning competing risks methods. Biom J. 66 (2), e2300060(2024).
Kim, H. J., Choi, G. S. Clinical implications of lymph node metastasis in colorectal cancer: Current status and future perspectives. Ann Coloproctol. 35 (3), 109-117 (2019).
Xu, T., et al. Log odds of positive lymph nodes is an excellent prognostic factor for patients with rectal cancer after neoadjuvant chemoradiotherapy. Ann Transl Med. 9 (8), 637(2021).
Chen, Y. R., et al. Prognostic performance of different lymph node classification systems in young gastric cancer. J Gastrointest Oncol. 12 (4), 285-1300 (2021).
Bouvier, A. M., et al. How many nodes must be examined to accurately stage gastric carcinomas? Results from a population based study. Cancer. 94 (11), 2862-2866 (2002).
Coburn, N. G., Swallow, C. J., Kiss, A., Law, C. Significant regional variation in adequacy of lymph node assessment and survival in gastric cancer. Cancer. 107 (9), 2143-2151 (2006).
Li Destri, G., Di Carlo, I., Scilletta, R., Scilletta, B., Puleo, S. Colorectal cancer and lymph nodes: the obsession with the number 12. World J Gastroenterol. 20 (8), 1951-1960 (2014).
Dinaux, A. M., et al. Outcomes of persistent lymph node involvement after neoadjuvant therapy for stage III rectal cancer. Surgery. 163 (4), 784-788 (2018).
Sun, Y., Zhang, Y., Huang, Z., Chi, P. Prognostic implication of negative lymph node count in ypN+ rectal cancer after neoadjuvant chemoradiotherapy and construction of a prediction nomogram. J Gastrointest Surg. 23 (5), 1006-1014 (2019).
Xu, Z., Jing, J., Ma, G. Development and validation of prognostic nomogram based on log odds of positive lymph nodes for patients with gastric signet ring cell carcinoma. Chin J Cancer Res. 32 (6), 778-793 (2020).
Scarinci, A., et al. The impact of log odds of positive lymph nodes (LODDS) in colon and rectal cancer patient stratification: a single-center analysis of 323 patients. Updates Surg. 70 (1), 23-31 (2018).
Nitsche, U., et al. Prognosis of mucinous and signet-ring cell colorectal cancer in a population-based cohort. J Cancer Res Clin Oncol. 142 (11), 2357-2366 (2016).
Kang, H., O'Connell, J. B., Maggard, M. A., Sack, J., Ko, C. Y. A 10-year outcomes evaluation of mucinous and signet-ring cell carcinoma of the colon and rectum. Dis Colon Rectum. 48 (6), 1161-1168 (2005).
Sung, C. O., et al. Clinical significance of signet ring cells in colorectal mucinous adenocarcinoma. Mod Pathol. 21 (12), 1533-1541 (2008).
Alvi, M. A., et al. Molecular profiling of signet ring cell colorectal cancer provides a strong rationale for genomic targeted and immune checkpoint inhibitor therapies. Br J Cancer. 117 (2), 203-209 (2017).
Brownlee, S., et al. Evidence for overuse of medical services around the world. Lancet. 390 (10090), 156-168 (2017).

Access restricted. Please log in or start a trial to view this content.

재인쇄 및 허가

JoVE'article의 텍스트 или 그림을 다시 사용하시려면 허가 살펴보기

허가 살펴보기

더 많은 기사 탐색

218

This article has been published

Video Coming Soon

Keep me updated: