A subscription to JoVE is required to view this content. Sign in or start your free trial.
Here, a new model for thyroid nodule detection in ultrasound images is proposed, which uses Swin Transformer as the backbone to perform long-range context modeling. Experiments prove that it performs well in terms of sensitivity and accuracy.
In recent years, the incidence of thyroid cancer has been increasing. Thyroid nodule detection is critical for both the detection and treatment of thyroid cancer. Convolutional neural networks (CNNs) have achieved good results in thyroid ultrasound image analysis tasks. However, due to the limited valid receptive field of convolutional layers, CNNs fail to capture long-range contextual dependencies, which are important for identifying thyroid nodules in ultrasound images. Transformer networks are effective in capturing long-range contextual information. Inspired by this, we propose a novel thyroid nodule detection method that combines the Swin Transformer backbone and Faster R-CNN. Specifically, an ultrasound image is first projected into a 1D sequence of embeddings, which are then fed into a hierarchical Swin Transformer.
The Swin Transformer backbone extracts features at five different scales by utilizing shifted windows for the computation of self-attention. Subsequently, a feature pyramid network (FPN) is used to fuse the features from different scales. Finally, a detection head is used to predict bounding boxes and the corresponding confidence scores. Data collected from 2,680 patients were used to conduct the experiments, and the results showed that this method achieved the best mAP score of 44.8%, outperforming CNN-based baselines. In addition, we gained better sensitivity (90.5%) than the competitors. This indicates that context modeling in this model is effective for thyroid nodule detection.
The incidence of thyroid cancer has increased rapidly since 1970, especially among middle-aged women1. Thyroid nodules may predict the emergence of thyroid cancer, and most thyroid nodules are asymptomatic2. The early detection of thyroid nodules is very helpful in curing thyroid cancer. Therefore, according to current practice guidelines, all patients with suspected nodular goiter on physical examination or with abnormal imaging findings should undergo further examination3,4.
Thyroid ultrasound (US) is a common method used to detect and characterize thyroid lesions5,6. US is a convenient, inexpensive, and radiation-free technology. However, the application of US is easily affected by the operator7,8. Features such as the shape, size, echogenicity, and texture of thyroid nodules are easily distinguishable on US images. Although certain US features-calcifications, echogenicity, and irregular borders-are often considered criteria for identifying thyroid nodules, the presence of interobserver variability is unavoidable8,9. The diagnosis results of radiologists with different levels of experience are different. Inexperienced radiologists are more likely to misdiagnose than experienced radiologists. Some characteristics of US such as reflections, shadows, and echoes can degrade the image quality. This degradation in image quality caused by the nature of US imaging makes it difficult for even experienced physicians to locate nodules accurately.
Computer-aided diagnosis (CAD) for thyroid nodules has developed rapidly in recent years and can effectively reduce errors caused by different physicians and help radiologists diagnose nodules quickly and accurately10,11. Various CNN-based CAD systems have been proposed for thyroid US nodule analysis, including segmentation12,13, detection14,15, and classification16,17. CNN is a multilayer, supervised learning model18, and the core modules of CNN are the convolution and pooling layers. The convolution layers are used for feature extraction, and the pooling layers are used for downsampling. The shadow convolutional layers can extract primary features such as the texture, edges, and contours, while deep convolutional layers learn high-level semantic features.
CNNs have had great success in computer vision19,20,21. However, CNNs fail to capture long-range contextual dependencies due to the limited valid receptive field of the convolutional layers. In the past, backbone architectures for image classification mostly used CNNs. With the advent of Vision Transformer (ViT)22,23, this trend has changed, and now many state-of-the-art models use transformers as backbones. Based on non-overlapping image patches, ViT uses a standard transformer encoder25 to globally model spatial relationships. The Swin Transformer24 further introduces shift windows to learn features. The shift windows not only bring greater efficiency but also greatly reduce the length of the sequence because self-attention is calculated in the window. At the same time, the interaction between two adjacent windows can be made through the operation of shifting (movement). The successful application of the Swin Transformer in computer vision has led to the investigation of transformer-based architectures for ultrasound image analysis26.
Recently, Li et al. proposed a deep learning approach28 for thyroid papillary cancer detection inspired by Faster R-CNN27. Faster R-CNN is a classic CNN-based object detection architecture. The original Faster R-CNN has four modules-the CNN backbone, the region proposal network (RPN), the ROI pooling layer, and the detection head. The CNN backbone uses a set of basic conv+bn+relu+pooling layers to extract feature maps from the input image. Then, the feature maps are fed into the RPN and the ROI pooling layer. The role of the RPN network is to generate region proposals. This module uses softmax to determine whether anchors are positive and generates accurate anchors by bounding box regression. The ROI pooling layer extracts the proposal feature maps by collecting the input feature maps and proposals and feeds the proposal feature maps into the subsequent detection head. The detection head uses the proposal feature maps to classify objects and obtain accurate positions of the detection boxes by bounding box regression.
This paper presents a new thyroid nodule detection network called Swin Faster R-CNN formed by replacing the CNN backbone in Faster R-CNN with the Swin Transformer, which results in the better extraction of features for nodule detection from ultrasound images. In addition, the feature pyramid network (FPN)29 is used to improve the detection performance of the model for nodules of different sizes by aggregating features of different scales.
This retrospective study was approved by the institutional review board of the West China Hospital, Sichuan University, Sichuan, China, and the requirement to obtain informed consent was waived.
1. Environment setup
2. Data preparation
3. Swin Faster RCNN configuration
4. Training the Swin Faster R-CNN
5. Performing thyroid nodule detection on new images
The thyroid US images were collected from two hospitals in China from September 2008 to February 2018. The eligibility criteria for including the US images in this study were conventional US examination before biopsy and surgical treatment, diagnosis with biopsy or postsurgical pathology, and age ≥ 18 years. The exclusion criteria were images without thyroid tissues.
The 3,000 ultrasound images included 1,384 malignant and 1,616 benign nodules. The majority (90%) of the malignant nodules...
This paper describes in detail how to perform the environment setup, data preparation, model configuration, and network training. In the environment setup phase, one needs to pay attention to ensure that the dependent libraries are compatible and matched. Data processing is a very important step; time and effort must be spent to ensure the accuracy of the annotations. When training the model, a "ModuleNotFoundError" may be encountered. In this case, it is necessary to use the "pip install" command to inst...
The authors declare no conflicts of interest.
This study was supported by the National Natural Science Foundation of China (Grant No.32101188) and the General Project of Science and Technology Department of Sichuan Province (Grant No. 2021YFS0102), China.
Name | Company | Catalog Number | Comments |
GPU RTX3090 | Nvidia | 1 | 24G GPU |
mmdetection2.11.0 | SenseTime | 4 | https://github.com/open-mmlab/mmdetection.git |
python3.8 | — | 2 | https://www.python.org |
pytorch1.7.1 | 3 | https://pytorch.org |
Request permission to reuse the text or figures of this JoVE article
Request PermissionThis article has been published
Video Coming Soon
Copyright © 2025 MyJoVE Corporation. All rights reserved