The authors would like to thank Tamara Bucher from the University of Newcastle, Australia, for providing the fake-food image dataset. This work was supported by the European Union&#39;s Horizon 2020 research and innovation programs (grant numbers 863059 - FNS-Cloud, 769661 - SAAM); and the Slovenian Research Agency (grant number P2-0098). The European Union and Slovenian Research Agency had no role in the design, analysis or writing of this article.

1. Food image recognition with NutriNet
<ol>
	<li>Obtaining the food image dataset
	<ol>
		<li>Gather a list of different foods and beverages that will be the outputs of the food image recognition model. A varied list of popular foods and beverages is preferred, as that will allow the training of a robust food image recognition model.</li>
		<li>Save the food and beverage list in a text file (e.g., &#39;txt&#39; or &#39;csv&#39;). 
		NOTE: The text file used by the authors of this article can be found in the supplemental files (&#39;food_items.txt&#39;) and includes a list of 520 Slovenian food items.</li>
		<li>Write or download a Python43 script that uses the Google Custom Search API44 to download images of each food item from the list and saves them into a separate folder for each food item. 
		&#8203;NOTE: The Python script used by the authors of this article can be found in the supplemental files (&#39;download_images.py&#39;). If this script is used, the Developer Key (variable &#39;developerKey&#39;, line 8 in the Python script code) and Custom Search Engine ID (variable &#39;cx&#39;, line 28 in the Python script code) need to be replaced with values specific to the Google account being used.</li>
		<li>Run the Python script from step 1.1.3 (e.g., with the command: &#39;python download_images.py&#39;).</li>
	</ol>
	</li>
	<li>(Optional) Cleaning the food image dataset
	<ol>
		<li>Train a food image detection model in the same way as in section 1.4, except use only two outputs (food, non-food) as opposed to the list of outputs from step 1.1.1. 
		NOTE: The authors of this article used images combined from recipe websites and the ImageNet dataset45 to train the food image detection model. Since the focus here is on food image recognition and this is an optional step for cleaning the recognition dataset, further details are omitted. Instead, more details about this approach can be found in Mezgec et al.2.</li>
		<li>Run the detection model from step 1.2.1 on the food image dataset that is the result of step 1.1.4.</li>
		<li>Delete every image that was tagged as non-food by the detection model from step 1.2.1.</li>
		<li>Manually check the food image dataset for other erroneous or low-quality images, and for image duplicates.</li>
		<li>Delete images found in step 1.2.4.</li>
	</ol>
	</li>
	<li>Augmenting the food image dataset
	<ol>
		<li>Create a new version of each image from the food image dataset by rotating it by 90&#176; using the CLoDSA library46 (lines 19 to 21 in the included Python script). 
		NOTE: The Python script containing all the CLoDSA commands used by the authors of this article can be found in a file included in the supplemental files (&#39;nutrinet_augmentation.py&#39;). If this script is used, the Input Path (variable &#39;INPUT_PATH&#39;, line 8 in the Python script code) and Output Path (variable &#39;OUTPUT_PATH&#39;, line 11 in the Python script code) need to be replaced with paths to the desired folders.</li>
		<li>Create a new version of each image from the food image dataset by rotating it by 180&#176; using the CLoDSA library (lines 19 to 21 in the included Python script).</li>
		<li>Create a new version of each image from the food image dataset by rotating it by 270&#176; using the CLoDSA library (lines 19 to 21 in the included Python script).</li>
		<li>Create a new version of each image from the food image dataset by flipping it horizontally using the CLoDSA library (lines 23 and 24 in the included Python script).</li>
		<li>Create a new version of each image from the food image dataset by adding random color noise to it using the CLoDSA library (lines 26 and 27 in the included Python script).</li>
		<li>Create a new version of each image from the food image dataset by zooming into it by 25% using the CLoDSA library (lines 29 and 30 in the included Python script).</li>
		<li>Save images from steps 1.3.1-1.3.6, along with the original images (lines 16 and 17 in the included Python script), into a new food image dataset (in total, 7 variants per food image). This is done by executing the command in line 32 of the included Python script.</li>
	</ol>
	</li>
	<li>Performing food image recognition
	<ol>
		<li>Import the food image dataset from step 1.3.7 into the NVIDIA DIGITS environment47, dividing the dataset into training, validation and testing subsets in the NVIDIA DIGITS user interface.</li>
		<li>Copy and paste the definition text of the NutriNet architecture2 into NVIDIA DIGITS as a custom network. 
		NOTE: The NutriNet architecture definition text can be found in the supplemental files (&#39;nutrinet.prototxt&#39;).</li>
		<li>(Optional) Define training hyperparameters in the NVIDIA DIGITS user interface. 
		NOTE: Hyperparameters are parameters that are used to define the training process prior to its start. The hyperparameters used by the authors of this article can be found in a file included in the supplemental files (&#39;nutrinet_hyperparameters.prototxt&#39;). While experimentation is needed for each dataset to find the optimal hyperparameters, the file contains a hyperparameter configuration which can be copied into the NVIDIA DIGITS user interface. Furthermore, NVIDIA DIGITS populates the hyperparameters with default values which can be used as a baseline. This step is therefore optional.</li>
		<li>Run the training of the NutriNet model.</li>
		<li>After training is complete, take the best-performing NutriNet model iteration. This model is then used for testing the performance of this approach. 
		&#8203;NOTE: There are multiple ways to determine the best-performing model iteration. A straightforward way to do this is as follows. NVIDIA DIGITS outputs a graph of accuracy measures for each training epoch. Check which epoch achieved the lowest loss value for the validation subset of the food image dataset - that model iteration can be considered best-performing. An optional step in determining the best-performing model iteration is to observe how the loss value for the training subset changes from epoch to epoch and if it starts to drop continuously while the loss value for the validation subset remains the same or rises continuously, take the epoch prior to this drop in training loss value, as that can signal when the model started overfitting on the training images.</li>
	</ol>
	</li>
</ol>
2. Food image segmentation with FCNs
<ol>
	<li>Obtaining the fake-food image dataset
	<ol>
		<li>Obtain a dataset of fake-food images. Fake-food images are gathered by researchers conducting behavioral studies using food replicas. 
		NOTE: The authors of this article received images of fake food that were collected in a lab environment18.</li>
		<li>Manually annotate every food image on a pixel level - each pixel in the image must contain information about which food class it belongs to. The result of this step is one annotation image for each image from the food image dataset, where each pixel represents one of the food classes. 
		&#8203;NOTE: There are many tools to achieve this - the authors of this article used JavaScript Segment Annotator48.</li>
	</ol>
	</li>
	<li>Augmenting the fake-food image dataset
	<ol>
		<li>Perform the same steps as in section 1.3, but only on images from the training subset of the food image dataset. 
		NOTE: With the exception of step 1.3.5, all data augmentation steps need to be performed on corresponding annotation images as well. If the script from section 1.3 is used, the Input Path (variable &#39;INPUT_PATH&#39;, line 8 in the Python43 script code) and Output Path (variable &#39;OUTPUT_PATH&#39;, line 11 in the Python script code) need to be replaced with paths to the desired folders. In addition, set the Problem (variable &#39;PROBLEM&#39;, line 6 in the Python script code) to &#39;instance_segmentation&#39; and the Annotation Mode (variable &#39;ANNOTATION_MODE&#39;, line 7 in the Python script code) and Output Mode (variable &#39;OUTPUT_MODE&#39;, line 10 in the Python script code) to &#39;coco&#39;.</li>
	</ol>
	</li>
	<li>Performing fake-food image segmentation
	<ol>
		<li>Perform the same steps as in section 1.4, with the exception of step 1.4.2. In place of that step, perform steps 2.3.2 and 2.3.3. 
		NOTE: Hyperparameters are parameters that are used to define the training process prior to its start. The training hyperparameters used by the authors of this article for the optional step 1.4.3 can be found in a file included in the supplemental files (&#39;fcn-8s_hyperparameters.prototxt&#39;). While experimentation is needed for each dataset to find the optimal set of hyperparameters, the file contains a hyperparameter configuration which can be copied into the NVIDIA DIGITS47 user interface. Furthermore, NVIDIA DIGITS populates the hyperparameters with default values which can be used as a baseline.</li>
		<li>Copy and paste the definition text of the FCN-8s architecture15 into the NVIDIA DIGITS environment as a custom network. 
		NOTE: The FCN-8s architecture definition text is publicly available on GitHub49.</li>
		<li>Enter the path to the pre-trained FCN-8s model weights into the NVIDIA DIGITS user interface. 
		&#8203;NOTE: These model weights were pre-trained on the PASCAL VOC dataset17 and can be found on the Internet49.</li>
	</ol>
	</li>
</ol>
3. Food image segmentation with HTC ResNet
<ol>
	<li>Obtaining the food image dataset
	<ol>
		<li>Download the food image dataset from the FRC website19.</li>
	</ol>
	</li>
	<li>Augmenting the food image dataset
	<ol>
		<li>Perform steps 1.3.1-1.3.4. 
		NOTE: The Python43 script containing all the CLoDSA46 commands used by the authors of this article can be found in a file included in the supplemental files (&#39;frc_augmentation.py&#39;). If this script is used, the Input Path (variable &#39;INPUT_PATH&#39;, line 8 in the Python script code) and Output Path (variable &#39;OUTPUT_PATH&#39;, line 11 in the Python script code) need to be replaced with paths to the desired folders.</li>
		<li>Create a new version of each image from the food image dataset by adding Gaussian blur to it using the CLoDSA library (lines 26 and 27 in the included Python script).</li>
		<li>Create a new version of each image from the food image dataset by sharpening it using the CLoDSA library (lines 29 and 30 in the included Python script).</li>
		<li>Create a new version of each image from the food image dataset by applying gamma correction to it using the CLoDSA library (lines 32 and 33 in the included Python script).</li>
		<li>Save images from steps 3.2.1-3.2.4, along with the original images (lines 16 and 17 in the included Python script), into a new food image dataset (in total, 8 variants per food image). This is done by executing the command in line 35 of the included Python script.</li>
		<li>Save images from steps 3.2.2-3.2.4, along with the original images (lines 16 and 17 in the included Python script), into a new food image dataset (in total, 4 variants per food image). This is done by deleting lines 19 to 24 of the included Python script and executing the command in line 35.</li>
	</ol>
	</li>
	<li>Performing food image segmentation
	<ol>
		<li>Modify the existing HTC20 ResNet-101 architecture16 definition from the MMDetection library50 in sections &#39;model settings&#39; and &#39;dataset settings&#39; of the architecture definition file so that it accepts the food image datasets from steps 3.1.1, 3.2.5 and 3.2.6.</li>
		<li>(Optional) Modify the HTC ResNet-101 architecture definition from step 3.3.1 to define training hyperparameters: batch size in section &#39;dataset settings&#39;, solver type and learning rate in section &#39;optimizer&#39;, learning policy in section &#39;learning policy&#39; and number of training epochs in section &#39;runtime settings&#39; of the architecture definition file. 
		&#8203;NOTE: The modified HTC ResNet-101 architecture definition file can be found in the supplemental files (&#39;htc_resnet-101.py&#39;). Hyperparameters are parameters that are used to define the training process prior to its start. While experimentation is needed for each dataset to find the optimal set of hyperparameters, the file already contains a hyperparameter configuration which can be used without modification. This step is therefore optional.</li>
		<li>Run the training of the HTC ResNet-101 model on the food image dataset from step 3.1.1 using the MMDetection library (e.g., with the command: &#39;python mmdetection/tools/train.py htc_resnet-101.py&#39;).</li>
		<li>After the training from step 3.3.3 is complete, take the best-performing HTC ResNet-101 model iteration and fine-tune it by running the next phase of training on the food image dataset from step 3.2.5. 
		&#8203;NOTE: There are multiple ways to determine the best-performing model iteration. A straightforward way to do this is as follows. The MMDetection library outputs values of accuracy measures for each training epoch in the command line interface. Check which epoch achieved the lowest loss value for the validation subset of the food image dataset - that model iteration can be considered best-performing. An optional step in determining the best-performing model iteration is to observe how the loss value for the training subset changes from epoch to epoch and if it starts to drop continuously while the loss value for the validation subset remains the same or rises continuously, take the epoch prior to this drop in training loss value, as that can signal when the model started overfitting on the training images.</li>
		<li>After the training from step 3.3.4 is complete, take the best-performing HTC ResNet-101 model iteration and fine-tune it by running the next phase of training on the food image dataset from step 3.2.6. 
		&#8203;NOTE: See note for step 3.3.4.</li>
		<li>After the training from step 3.3.5 is complete, take the best-performing HTC ResNet-101 model iteration and fine-tune it by again running the next phase of training on the food image dataset from step 3.2.5. 
		&#8203;NOTE: See note for step 3.3.4.</li>
		<li>After the training from step 3.3.6 is complete, take the best-performing HTC ResNet-101 model iteration. This model is then used for testing the performance of this approach. 
		NOTE: See note for step 3.3.4. Steps 3.3.3-3.3.7 yielded the best results for the purposes defined by the authors of this article. Experimentation is needed for each dataset to find the optimal sequence of training and data augmentation steps.</li>
	</ol>
	</li>
</ol>

The authors have nothing to disclose.

In recent years, deep neural networks have been validated multiple times as a suitable solution for recognizing food images10,11,12,21,23,25,26,29,31,33. Our work presented in this article serves to further prove this1,2. The single-output food image recognition approach is straightforward and can be used for simple applications where images with only one food or beverage item are expected2.
The food image segmentation approach seems particularly suitable for recognizing food images in general, without any restriction on the number of food items1. Because it works by classifying each individual pixel of the image, it is able to not only recognize any number of food items in the image, but also specify where a food item is located, as well as how large it is. The latter can then be used to perform food weight estimation, particularly if used with either a reference object or a fixed-distance camera.
There has been some work done regarding the availability of food image datasets3,22,27,30,36,37,38,39,40,41,42, and we hope more will be done in the future, particularly when it comes to aggregating food image datasets from different regions across the world, which would enable more robust solutions to be developed. Currently, the accuracy of automatic food image recognition solutions has not yet reached human-level accuracy35, and this is likely in large part due to the usage of food image datasets of insufficient size and quality.
In the future, our goal will be to further evaluate the developed procedures on real-world images. In general, datasets in this field often contain images taken in controlled environments or images that were manually optimized for recognition. This is why it is important to gather a large and diverse real-world food image dataset to encompass all the different food and beverage items that individuals might want to recognize. The first step towards this was provided by the Food Recognition Challenge, which included a dataset of real-world food images19, but further work needs to be done to validate this approach on food images from all around the world and in cooperation with dietitians.

Dietary assessment is a crucial step in determining actionable areas of an individual&#39;s diet. However, performing dietary assessment using traditionally manual approaches is associated with considerable costs. These approaches are also prone to errors as they often rely on self-reporting by the individual. Automated dietary assessment addresses these issues by providing a simpler way to quantify and qualify food intake. Such an approach can also alleviate some of the errors present in manual approaches, such as missed meals, inability to accurately assess food volume, etc. Therefore, there are clear benefits to automating dietary assessment by developing solutions that identify different foods and beverages and quantify food intake1. These solutions can also be used to enable an estimation of nutritional values of food and beverage items (henceforth &#39;food items&#39;). Consequently, automated dietary assessment is useful for multiple applications - from strictly medical uses, such as allowing dietitians to more easily and accurately track and analyze their patients&#39; diets, to the usage inside well-being apps targeted at the general population.
Automatically recognizing food items from images is a challenging computer vision problem. This is due to foods being typically deformable objects, and due to the fact that a large amount of the food item&#39;s visual information can be lost during its preparation. Additionally, different foods can appear to be very similar to each other, and the same food can appear to be substantially different on multiple images2. Furthermore, the recognition accuracy depends on many more factors, such as image quality, whether the food item is obstructed by another item, distance from which the image was taken, etc. Recognizing beverage items presents its own set of challenges, the main one being the limited amount of visual information that is available in an image. This information could be the beverage color, beverage container color and structure, and, under optimal image conditions, the beverage density2.
To successfully recognize food items from images, it is necessary to learn features of each food and beverage class. This was traditionally done using manually-defined feature extractors3,4,5,6 that perform recognition based on specific item features like color, texture, size, etc., or a combination of these features. Examples of these feature extractors include multiple kernel learning4, pairwise local features5 and the bag-of-features model6. Due to the complexity of food images, these approaches mostly achieved a low classification accuracy - between 10% and 40%3,4,5. The reason for this is that the manual approach is not robust enough to be sufficiently accurate. Because a food item can vary significantly in appearance, it is not feasible to encompass all these variances manually. Higher classification accuracy can be achieved with manually-defined feature extractors when either the number of food classes is reduced5, or different image features are combined6, thus indicating that there is a need for more complex solutions to this problem.
This is why deep learning proved to be so effective for the food image recognition problem. Deep learning, or deep neural networks, was inspired by biological brains, and allows computational models composed of multiple processing layers to automatically learn features through training on a set of input images7,8. Because of this, deep learning has substantially improved the state of the art in a variety of research fields7, with computer vision, and subsequently food image recognition, being one of them2.
In particular, deep convolutional neural networks (DCNNs) are most popular for food image recognition - these networks are inspired by the visual system of animals, where individual neurons try to gain an understanding of the visual input by reacting to overlapping regions in the visual field9. A convolutional neural network takes the input image and performs a series of operations in each of the network layers, the most common of which are convolutional, fully-connected and pooling layers. Convolutional layers contain learnable filters that respond to certain features in the input data, whereas fully-connected layers compose output data from other layers to gain higher-level knowledge from it. The goal of pooling layers is to down-sample the input data2. There are two approaches to using deep learning models that proved popular: taking an existing deep neural network definition10,11, referred to as a deep learning architecture in this article, or defining a new deep learning architecture12,13, and training either one of these on a food image dataset. There are strengths and weaknesses to both approaches - when using an existing deep learning architecture, an architecture that performed well for other problems can be chosen and fine-tuned for the desired problem, thus saving time and ensuring that a validated architecture has been chosen. Defining a new deep learning architecture, on the other hand, is more time-intensive, but allows the development of architectures that are specifically made to take into account the specifics of a problem and thus theoretically perform better for that problem.
In this article, we present both approaches. For the food image recognition problem, we developed a novel DCNN architecture called NutriNet2, which is a modification of the well-known AlexNet architecture14. There are two main differences compared to AlexNet: NutriNet accepts 512x512-pixel images as input (as opposed to 256x256-pixel images for AlexNet), and NutriNet has an additional convolutional layer at the beginning of the neural network. These two changes were introduced in order to extract as much information from the recognition dataset images as possible. Having higher-resolution images meant that there is more information present on images and having more convolutional layers meant that additional knowledge could be extracted from the images. Compared to AlexNet&#39;s around 60 million parameters, NutriNet contains less parameters: approximately 33 million. This is because of the difference in dimensionality at the first fully-connected layer caused by the additional convolutional layer2. Figure 1 contains a diagram of the NutriNet architecture. The food images that were used to train the NutriNet model were gathered from the Internet - the procedure is described in the protocol text.
For the food image segmentation problem, we used two different existing architectures: fully convolutional networks (FCNs)15 and deep residual networks (ResNet)16, both of which represented the state of the art for image segmentation when we used them to develop their respective food image segmentation solutions. There are multiple FCN variants that were introduced by Long et al.: FCN-32s, FCN-16s and FCN-8s15. FCN-32s outputs a pixel map based on the predictions by the FCN&#39;s final layer, whereas the FCN-16s variant combines these predictions with those by an earlier layer. FCN-8s considers yet another layer&#39;s predictions and is therefore able to make predictions at the finest grain, which is why it is suitable for food image recognition. The FCN-8s that we used was pre-trained on the PASCAL Visual Object Classes (PASCAL VOC) dataset17 and trained and tested on images of food replicas (henceforth &#39;fake food&#39;)18 due to their visual resemblance to real food and due to a lack of annotated images of real food on a pixel level. Fake food is used in different behavioral studies and images are taken for all dishes from all study participants. Because the food contents of these images are known, it makes the image dataset useful for deep learning model training. Dataset processing steps are described in the protocol text.
The ResNet-based solution was developed in the scope of the Food Recognition Challenge (FRC)19. It uses the Hybrid Task Cascade (HTC)20 method with a ResNet-10116 backbone. This is a state-of-the-art approach for the image segmentation problem that can use different feature extractors, or backbones. We considered other backbone networks as well, particularly other ResNet variants such as ResNet-5016, but ResNet-101 was the most suitable due to its depth and ability to represent input images in a complex enough manner. The dataset used for training the HTC ResNet-101 model was the FRC dataset with added augmented images. These augmentations are presented in the protocol text.
This article is intended as a resource for machine learning experts looking for information about which deep learning architectures and data augmentation steps perform well for the problems of food image recognition and segmentation, as well as for nutrition researchers looking to use our approach to automate food image recognition for use in dietary assessment. In the paragraphs below, deep learning solutions and datasets from the food image recognition field are presented. In the protocol text, we detail how each of the three approaches was used to train deep neural network models that can be used for automated dietary assessment. Additionally, each protocol section contains a description of how the food image datasets used for training and testing were acquired and processed.
DCNNs generally achieved substantially better results than other methods for food image recognition and segmentation, which is why the vast majority of recent research in the field is based on these networks. Kawano et al. used DCNNs to complement manual approaches21 and achieved a classification accuracy of 72.26% on the UEC-FOOD100 dataset22. Christodoulidis et al. used them exclusively to achieve a higher accuracy of 84.90% on a self-acquired dataset23. Tanno et al. developed DeepFoodCam - a smartphone app for food image recognition that uses DCNNs24. Liu et al. presented a system that performs an Internet of Things-based dietary assessment using DCNNs25. Martinel et al. introduced a DCNN-based approach that exploits the specifics of food images26 and reported an accuracy of 90.27% on the Food-101 dataset27. Zhou et al. authored a review of deep learning solutions in the food domain28.
Recently, Zhao et al. proposed a network specifically for food image recognition in mobile applications29. This approach uses a smaller &#39;student&#39; network that learns from a larger &#39;teacher&#39; network. With it, they managed to achieve an accuracy of 84% on the UEC-FOOD25630 and an accuracy of 91.2% on the Food-101 dataset27. Hafiz et al. used DCNNs to develop a beverage-only image recognition solution and reported a very high accuracy of 98.51%31. Shimoda et al. described a novel method for detecting plate regions in food images without the usage of pixel-wise annotation32. Ciocca et al. introduced a new dataset containing food items from 20 different food classes in 11 different states (solid, sliced, creamy paste, etc.) and presented their approach for training recognition models that are able to recognize the food state, in addition to the food class33. Knez et al. evaluated food image recognition solutions for mobile devices34. Finally, Furtado et al. conducted a study on how the human visual system compares to the performance of DCNNs and found that human recognition still outperforms DCNNs with an accuracy of 80% versus 74.5%35. The authors noted that with a small number of food classes, the DCNNs perform well, but on a dataset with hundreds of classes, human recognition accuracy is higher35, highlighting the complexity of the problem.
Despite its state-of-the-art results, deep learning has a major drawback - it requires a large input dataset to train the model on. In the case of food image recognition, a large food image dataset is required, and this dataset needs to encompass as many different real-world scenarios as possible. In practice this means that for each individual food or beverage item, a large collection of images is required, and as many different items as possible need to be present in the dataset. If there are not enough images for a specific item in the dataset, that item is unlikely to be recognized successfully. On the other hand, if only a small number of items is covered by the dataset, the solution will be limited in scope, and only able to recognize a handful of different foods and beverages.
Multiple datasets were made available in the past. The Pittsburgh Fast-Food Image Dataset (PFID)3 was introduced to encourage more research in the field of food image recognition. The University of Electro-Communications Food 100 (UEC-FOOD100)22 and University of Electro-Communications Food 256 (UEC-FOOD256)30 datasets contain Japanese dishes, expanded with some international dishes in the case of the UEC-FOOD256 dataset. The Food-101 dataset contains popular dishes acquired from a website27. The Food-5036 and Video Retrieval Group Food 172 (VireoFood-172)37 datasets are Chinese-based collections of food images. The University of Milano-Bicocca 2016 (UNIMIB2016) dataset is composed of images of food trays from an Italian canteen38. Recipe1M is a large-scale dataset of cooking recipes and food images39. The Food-475 dataset40 collects four previously published food image datasets27,30,36,37 into one. The Beijing Technology and Business University Food 60 (BTBUFood-60) is a dataset of images meant for food detection41. Recently, the ISIA Food-500 dataset42 of miscellaneous food images was made available. In comparison to other publicly available food image datasets, it contains a large number of images, divided into 500 food classes, and is meant to advance the development of multimedia food recognition solutions42.

<table><tbody><tr><td>&lt;strong&gt;HARDWARE&lt;/strong&gt;</td><td></td><td></td><td></td></tr><tr><td>NVIDIA GPU</td><td>NVIDIA</td><td>N/A</td><td>An NVIDIA GPU is needed as some of the software frameworks below will not work otherwise. https://www.nvidia.com</td></tr><tr><td>&lt;strong&gt;SOFTWARE&lt;/strong&gt;</td><td></td><td></td><td></td></tr><tr><td>Caffe</td><td>Berkeley AI Research</td><td>N/A</td><td>Caffe is a deep learning framework. https://caffe.berkeleyvision.org</td></tr><tr><td>CLoDSA</td><td>J&amp;oacute;nathan Heras</td><td>N/A</td><td>CLoDSA is a Python image augmentation library. https://github.com/joheras/CLoDSA</td></tr><tr><td>Google API Client</td><td>Google</td><td>N/A</td><td>Google API Client is a Python client library for Google&#039;s discovery-based APIs. https://github.com/googleapis/google-api-python-client</td></tr><tr><td>JavaScript Segment Annotator</td><td>Kota Yamaguchi</td><td>N/A</td><td>JavaScript Segment Annotator is a JavaScript image annotation tool. https://github.com/kyamagu/js-segment-annotator</td></tr><tr><td>MMDetection</td><td>Multimedia Laboratory, CUHK</td><td>N/A</td><td>MMDetection is an object detection toolbox based on PyTorch. https://github.com/open-mmlab/mmdetection</td></tr><tr><td>NVIDIA DIGITS</td><td>NVIDIA</td><td>N/A</td><td>NVIDIA DIGITS is a wrapper for Caffe that provides a graphical web interface. https://developer.nvidia.com/digits</td></tr><tr><td>OpenCV</td><td>Intel</td><td>N/A</td><td>OpenCV is a library for computer vision. https://opencv.org</td></tr><tr><td>Python</td><td>Python Software Foundation</td><td>N/A</td><td>Python is a programming language. https://www.python.org</td></tr><tr><td>PyTorch</td><td>Facebook AI Research</td><td>N/A</td><td>PyTorch is a machine learning framework. https://pytorch.org</td></tr><tr><td>Ubuntu OS</td><td>Canonical</td><td>N/A</td><td>Ubuntu 14.04 is the OS used by the authors and offers compatibility with all of the software frameworks and tools above. https://ubuntu.com</td></tr></tbody></table>

deep neural networks for image-based dietary assessment

Due to the issues and costs associated with manual dietary assessment approaches, automated solutions are required to ease and speed up the work and increase its quality. Today, automated solutions are able to record a person's dietary intake in a much simpler way, such as by taking an image with a smartphone camera. In this article, we will focus on such image-based approaches to dietary assessment. For the food image recognition problem, deep neural networks have achieved the state of the art in recent years, and we present our work in this field. In particular, we first describe the method for food and beverage image recognition using a deep neural network architecture, called NutriNet. This method, like most research done in the early days of deep learning-based food image recognition, is limited to one output per image, and therefore unsuitable for images with multiple food or beverage items. That is why approaches that perform food image segmentation are considerably more robust, as they are able to identify any number of food or beverage items in the image. We therefore also present two methods for food image segmentation - one is based on fully convolutional networks (FCNs), and the other on deep residual networks (ResNet).

NutriNet was tested against three popular deep learning architectures of the time: AlexNet14, GoogLeNet51 and ResNet16. Multiple training parameters were also tested for all architectures to define the optimal values2. Among these is the choice of solver type, which determines how the loss function is minimized. This function is the primary quality measure for training neural networks as it is better suited for optimization during training than classification accuracy. We tested three solvers: Stochastic Gradient Descent (SGD)52, Nesterov&#39;s Accelerated Gradient (NAG)53 and the Adaptive Gradient algorithm (AdaGrad)54. The second parameter is batch size, which defines the number of images that are processed at the same time. The depth of the deep learning architecture determined the value of this parameter, as deeper architectures require more space in the GPU memory - the consequence of this approach was that the memory was completely filled with images for all architectures, regardless of depth. The third parameter is learning rate, which defines the speed with which the neural network parameters are being changed during training. This parameter was set in unison with the batch size, as the number of concurrently processed images dictates the convergence rate. AlexNet models were trained using a batch size of 256 images and a base learning rate of 0.02; NutriNet used a batch size of 128 images and a rate of 0.01; GoogLeNet 64 images and a rate of 0.005; and ResNet 16 images and a rate of 0.00125. Three other parameters were fixed for all architectures: learning rate policy (step-down), step size (30%) and gamma (0.1). These parameters jointly describe how the learning rate is changing in every epoch. The idea behind this approach is that the learning rate is being gradually lowered to fine-tune the model the closer it gets to the optimal loss value. Finally, the number of training epochs was also fixed to 150 for all deep learning architectures2.
The best result among all the parameters tested that NutriNet achieved was a classification accuracy of 86.72% on the recognition dataset, which was around 2% higher than the best result for AlexNet and slightly higher than GoogLeNet&#39;s best result. The best-performing architecture overall was ResNet (by around 1%), however the training time for ResNet is substantially higher compared to NutriNet (by a factor of approximately five), which is important if models are continuously re-trained to improve accuracy and the number of recognizable food items. NutriNet, AlexNet and GoogLeNet achieved their best results using the AdaGrad solver, whereas ResNet&#39;s best model used the NAG solver. NutriNet was also tested on the publicly available UNIMIB2016 food image dataset38. This dataset contains 3,616 images of 73 different food items. NutriNet achieved a recognition accuracy of 86.39% on this dataset, slightly outperforming the baseline recognition result of the authors of the dataset, which was 85.80%. Additionally, NutriNet was tested on a small dataset of 200 real-world images of 115 different food and beverage items, where NutriNet achieved a top-5 accuracy of 55%.
To train the FCN-8s fake-food image segmentation model, we used Adam55 as the solver type, as we found that it performed optimally for this task. The base learning rate was set very low - to 0.0001. The reason for the low number is the fact that only one image could be processed at a time, which is a consequence of the pixel-level classification process. The GPU memory requirements for this approach are significantly greater than image-level classification. The learning rate thus had to be set low so that the parameters were not being changed too fast and converge to less optimal values. The number of training epochs was set to 100, while the learning rate policy, step size and gamma were set to step-down, 34% and 0.1, respectively, as these parameters produced the most accurate models.
Accuracy measurements of the FCN-8s model were performed using the pixel accuracy measure15, which is analogous to the classification accuracy of traditional deep learning networks, the main difference being that the accuracy is computed on the pixel level instead of on the image level:
<img alt="Equation 1" src="/files/ftp_upload/61906/61906eq01.jpg" />
where PA is the pixel accuracy measure, nij&#160;is the number of pixels from class i predicted to belong to class j and ti =&#160;&#931;j nij is the total number of pixels from class i&#160;in the ground-truth labels1. In other words, the pixel accuracy measure is computed by dividing correctly predicted pixels by the total number of pixels. The final accuracy of the trained FCN-8s model was 92.18%. Figure 2 shows three example images from the fake-food image dataset (one from each of the training, validation and testing subsets), along with the corresponding ground-truth and model prediction labels.
The parameters to train the HTC20 ResNet-101 model for food image segmentation were set as follows: the solver type used was SGD because it outperformed other solver types. The base learning rate was set to 0.00125 and the batch size to 2 images. The number of training epochs was set to 40 per training phase, and multiple training phases were performed - first on the original FRC dataset without augmented images, then on the 8x-augmented and 4x-augmented FRC dataset multiple times in an alternating fashion, each time taking the best-performing model and fine-tuning it in the next training phase. More details on the training phases can be found in section 3.3&#160;of the protocol text. Finally, the step-down learning policy was used, with fixed epochs for when the learning rate decreased (epochs 28 and 35 for the first training phase). An important thing to note is that while this sequence of training phases produced the best results in our testing in the scope of the FRC, using another dataset might require a different sequence to produce optimal results.
This ResNet-based solution for food image segmentation was evaluated using the following precision measure19:
<img alt="Equation 2" src="/files/ftp_upload/61906/61906eq02.jpg" />
where P is precision, TP is the number of true positive predictions by the food image segmentation model, FP is the number of false positive predictions and IoU is Intersection over Union, which is computed with this equation:
<img alt="Equation 3" src="/files/ftp_upload/61906/61906eq03.jpg" />
where Area of Overlap&#160;represents the number of predictions by the model that overlap with the ground truth, and Area of Union represents the total number of predictions by the model together with the ground truth, both on a pixel level and for each individual food class. Recall is used as a secondary measure and is calculated in a similar way, using the following formula19:
<img alt="Equation 4" src="/files/ftp_upload/61906/61906eq04.jpg" />
where R is recall and FN is the number of false negative predictions by the food image segmentation model. The precision and recall measures are then averaged across all classes in the ground truth. Using these measures, our model achieved an average precision of 59.2% and an average recall of 82.1%, which ranked second in the second round of the Food Recognition Challenge19. This result was 4.2% behind the first place and 5.3% ahead of the third place in terms of the average precision measure. Table 1 contains the results for the top-4 participants in the competition.
<img alt="Figure 1" class="xfigimg" src="/files/ftp_upload/61906/61906fig01.jpg" /> 
Figure 1: Diagram of the NutriNet deep neural network architecture. This figure has been published in Mezgec et al.2. <a href="https://www.jove.com/files/ftp_upload/61906/61906fig01large.jpg" target="_blank">Please click here to view a larger version of this figure.</a>
<img alt="Figure 2" class="xfigimg" src="/files/ftp_upload/61906/61906fig02.jpg" /> 
Figure 2: Images from the fake-food image dataset. Original images (left), manually-labelled ground-truth labels (middle) and predictions from the FCN-8s model (right). This figure has been published in Mezgec et al.1. <a href="https://www.jove.com/files/ftp_upload/61906/61906fig02large.jpg" target="_blank">Please click here to view a larger version of this figure.</a>
<table border="1" fo:keep-together.within-page="1" fo:keep-with-next.within-page="always">
	<tbody>
		<tr>
			<td>Team Name</td>
			<td>Placement</td>
			<td>Average Precision</td>
			<td>Average Recall</td>
		</tr>
		<tr>
			<td>rssfete</td>
			<td>1</td>
			<td>63.4%</td>
			<td>88.6%</td>
		</tr>
		<tr>
			<td>simon_mezgec</td>
			<td>2</td>
			<td>59.2%</td>
			<td>82.1%</td>
		</tr>
		<tr>
			<td>arimboux</td>
			<td>3</td>
			<td>53.9%</td>
			<td>73.5%</td>
		</tr>
		<tr>
			<td>latentvec</td>
			<td>4</td>
			<td>48.7%</td>
			<td>71.1%</td>
		</tr>
	</tbody>
</table>
Table 1: Top-4 results from the second round of the Food Recognition Challenge. Average precision is taken as the primary performance measure and average recall as a secondary measure. Results are taken from the official competition leaderboard19.
Supplemental Files. <a href="https://www.jove.com/files/ftp_upload/61906/simon_mezgec_supplemental_files_revision1_.zip" target="_blank">Please click here to download this File.</a>

Watch this Scientific Journal Video about Deep Neural Networks for Image-Based Dietary Assessment at JoVE.com

Deep Neural Networks for Image-Based Dietary Assessment

The goal of the work presented in this article is to develop technology for automated recognition of food and beverage items from images taken by mobile devices. The technology comprises of two different approaches - the first one performs food image recognition while the second one performs food image segmentation.

Due to the issues and costs associated with manual dietary assessment approaches, automated solutions are required to ease and speed up the work and ...

deep-neural-networks-for-image-based-dietary-assessment

Research

JoVE Journal

Engineering

8.9K Views.  Jožef Stefan International Postgraduate School. The goal of the work presented in this article is to develop technology for automated recognition of food and beverage items from images taken by mobile devices. The technology comprises of two different approaches - the first one performs food image recognition while the second one performs food image segmentation.

Video: Deep Neural Networks for Image-Based Dietary Assessment

Design and Evaluation of Smart Glasses for Food Intake and Physical Activity Classification

This study presents a series of protocols of designing and manufacturing a glasses-type wearable device that detects the patterns of temporalis muscle activities during food intake and other physical activities. We fabricated a 3D-printed frame of the glasses and a load cell-integrated printed circuit board (PCB) module inserted in both hinges of the frame. The module was used to acquire the force signals, and transmit them wirelessly. These procedures provide the system with higher mobility, which can be evaluated in practical wearing conditions such as walking and waggling. A performance of the classification is also evaluated by distinguishing the patterns of food intake from those physical activities. A series of algorithms were used to preprocess the signals, generate feature vectors, and recognize the patterns of several featured activities (chewing and winking), and other physical activities (sedentary rest, talking, and walking). The results showed that the average F1 score of the classification among the featured activities was 91.4%. We believe this approach can be potentially useful for automatic and objective monitoring of ingestive behaviors with higher accuracy as practical means to treat ingestive problems.

This study presents a protocol of designing and manufacturing a glasses-type wearable device that detects the patterns of food intake and other featured physical activities using load cells inserted in both hinges of the glasses.

This study presents a series of protocols of designing and manufacturing a glasses-type wearable device that detects the patterns of temporalis muscle ...

Iterative Development of an Innovative Smartphone-Based Dietary Assessment Tool: Traqq

To collect dietary intake data in a fast and reliable manner, a flexible and innovative smartphone application (app) called Traqq was developed (iOS/Android). This app can be used as a food record and&#160;24-h recall (or shorter recall periods). Different sampling schemes can be created on either prespecified or random days/times within a predetermined period for both methods, with&#160;push notifications to urge the participants to register their food intake. In case of non-response, notifications are automatically rescheduled to ensure complete data collection. For use as a food record, respondents can access the app and log their food intake throughout the day. Food records close automatically at the end of the day; recalls close after submission of the consumed items. The recall as well as the food record module provide access to an extensive food list based on the Dutch food composition database (FCDB), which can be accustomed to fit different research purposes. When selecting a food item, respondents are simultaneously prompted to insert portion size, i.e., in household measures (e.g., cups, spoons, glasses), standard portion sizes (e.g., small, medium, large), or weight in grams, and eating occasion/time of consumption. Portion size options can be adjusted, e.g., only entry in grams in case of a weighed food record or time of consumption instead of eating occasion). The app also includes a My Dishes function, which allows the respondent to create their own recipes or product combinations (e.g., a daily breakfast) and only report the total quantity consumed. Subsequently, the app accounts for yield and retention factors. The data are stored on a secure server. If desired, additional questions, i.e., in general or those related to specific food items or eating occasions can be incorporated. This paper describes the development of the system (app and backend), including expert evaluations and usability testing.

This article describes the protocol for the development of an innovative smartphone-based dietary assessment application Traqq, including expert evaluations and usability testing.

To collect dietary intake data in a fast and reliable manner, a flexible and innovative smartphone application (app) called Traqq was developed ...

Behavior

Concept Development and Use of an Automated Food Intake and Eating Behavior Assessment Method

The vast majority of dietary and eating behavior assessment methods are based on self-reports. They are burdensome and also prone to measurement errors. Recent technological innovations allow for the development of more accurate and precise dietary and eating behavior assessment tools that require less effort for both the user and the researcher. Therefore, a new sensor-based device to assess food intake and eating behavior was developed. The device is a regular dining tray equipped with a video camera and three separate built-in weighing stations. The weighing stations measure the weight of the bowl, plate, and drinking cup continuously over the course of a meal. The video camera positioned to the face records eating behavior characteristics (chews, bites), which are analyzed using artificial intelligence (AI)-based automatic facial expression software. The tray weight and the video data are transported at real-time to a personal computer (PC) using a wireless receiver. The outcomes of interest, such as the amount eaten, eating rate and bite size, can be calculated by subtracting the data of these measures at the timepoints of interest. The information obtained by the current version of the tray can be used for research purposes, an upgraded version of the device would also facilitate the provision of more personalized advice on dietary intake and eating behavior. Contrary to the conventional dietary assessment methods, this dietary assessment device measures food intake directly within a meal and is not dependent on memory or the portion size estimation. Ultimately, this device is therefore suited for daily main meal food intake and eating behavior measures. In the future, this technology based dietary assessment method can be linked to health applications or smart watches to obtain a complete overview of exercise, energy intake, and eating behavior.

This protocol shows and explains a new technology-based dietary assessment method. The method consists of a dining tray with multiple built-in weighing scales and a video camera. The device is unique in the sense that it incorporates automated measures of food and drink intake and eating behavior over the course of a meal.

The vast majority of dietary and eating behavior assessment methods are based on self-reports. They are burdensome and also prone to measurement errors. ...

Application of Deep Learning-Based Medical Image Segmentation via Orbital Computed Tomography

Recently, deep learning-based segmentation models have been widely applied in the ophthalmic field. This study presents the complete process of constructing an orbital computed tomography (CT) segmentation model based on U-Net. For supervised learning, a labor-intensive and time-consuming process is required. The method of labeling with super-resolution to efficiently mask the ground truth on orbital CT images is introduced. Also, the volume of interest is cropped as part of the pre-processing of the dataset. Then, after extracting the volumes of interest of the orbital structures, the model for segmenting the key structures of the orbital CT is constructed using U-Net, with sequential 2D slices that are used as inputs and two bi-directional convolutional long-term short memories for conserving the inter-slice correlations. This study primarily focuses on the segmentation of the eyeball, optic nerve, and extraocular muscles. The evaluation of the segmentation reveals the potential application of segmentation to orbital CT images using deep learning methods.

Application of Deep Learning-Based Medical Image Segmentation via Orbital Computed Tomography

An object segmentation protocol for orbital computed tomography (CT) images is introduced. The methods of labeling the ground truth of orbital structures by using super-resolution, extracting the volume of interest from CT images, and modeling multi-label segmentation using 2D sequential U-Net for orbital CT images are explained for supervised learning.

Recently, deep learning-based segmentation models have been widely applied in the ophthalmic field. This study presents the complete process of ...

Alcohol Diminishes Wasabi Reactions: Behavioral and AI Analysis in Mice

Integration of Animal Behavioral Assessment and Convolutional Neural Network to Study Wasabi-Alcohol Taste-Smell Interaction

The commercial wasabi pastes commonly used for food preparation contain a homologous compound of chemosensory isothiocyanates (ITCs) that elicit an irritating sensation upon consumption. The impact of sniffing dietary alcoholic beverages on the sensation of wasabi spiciness has never been studied. While most sensory evaluation studies focus on individual food and beverages separately, there is a lack of research on the olfactory study of sniffing liquor while consuming wasabi. Here, a methodology is developed that combines the use of an animal behavioral study and a convolutional neural network to analyze the facial expressions of mice when they simultaneously sniff liquor and consume wasabi. The results demonstrate that the trained and validated deep learning model recognizes 29% of the images depicting co-treatment of wasabi and alcohol belonging to the class of the wasabi-negative liquor-positive group without the need for prior training materials filtering. Statistical analysis of mouse grimace scale scores obtained from the selected video frame images reveals a significant difference (P &lt; 0.01) between the presence and absence of liquor. This finding suggests that dietary alcoholic beverages might have a diminishing effect on the wasabi-elicited reactions in mice. This combinatory methodology holds potential for individual ITC compound screening and sensory analyses of spirit components in the future. However, further study is required to investigate the underlying mechanism of alcohol-induced suppression of wasabi pungency.

This article describes a set of methods for measuring the suppressive ability of sniffing alcoholic beverages on the wasabi-elicited stinging sensation.

The commercial wasabi pastes commonly used for food preparation contain a homologous compound of chemosensory isothiocyanates (ITCs) that elicit an ...

Enhanced Precision in Salient Object Detection Using Deep Neural Networks

End-To-End Deep Neural Network for Salient Object Detection in Complex Environments

Salient object detection has emerged as a burgeoning area of interest within the realm of computer vision. However, prevailing algorithms exhibit diminished precision when tasked with detecting salient objects within intricate and multifaceted environments. In light of this pressing concern, this article presents an end-to-end deep neural network that aims to detect salient objects within complex environments. The study introduces an end-to-end deep neural network that aims to detect salient objects within complex environments. Comprising two interrelated components, namely a pixel-level multiscale full convolutional network and a deep encoder-decoder network, the proposed network integrates contextual semantics to produce visual contrast across multiscale feature maps while employing deep and shallow image features to improve the accuracy of object boundary identification. The integration of a fully connected conditional random field (CRF) model further enhances the spatial coherence and contour delineation of salient maps. The proposed algorithm is extensively evaluated against 10 contemporary algorithms on the SOD and ECSSD databases. The evaluation results demonstrate that the proposed algorithm outperforms other approaches in terms of precision and accuracy, thereby establishing its efficacy in salient object detection within complex environments.

Author Spotlight: Enhancement of Salient Object Detection for Smart Grid Applications

The present protocol describes a novel end-to-end salient object detection algorithm. It leverages deep neural networks to enhance the precision of salient object detection within intricate environmental contexts.

Salient object detection has emerged as a burgeoning area of interest within the realm of computer vision. However, prevailing algorithms exhibit ...

Swin-PSAxialNet: A Streamlined Approach for Segmenting Multiple Organs

Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique

Abdominal multi-organ segmentation is one of the most important topics in the field of medical image analysis, and it plays an important role in supporting clinical workflows such as disease diagnosis and treatment planning. In this study, an efficient multi-organ segmentation method called Swin-PSAxialNet based on the nnU-Net architecture is proposed. It was designed specifically for the precise segmentation of 11 abdominal organs in CT images. &#65279;The proposed network has made the following improvements compared to nnU-Net. Firstly, Space-to-depth (SPD) modules and parameter-shared axial attention (PSAA) feature extraction blocks were introduced, enhancing the capability of 3D image feature extraction. Secondly, a multi-scale image fusion approach was employed to capture detailed information and spatial features, improving the capability of extracting subtle features and edge features. Lastly, a parameter-sharing method was introduced to reduce the model's computational cost and training speed. &#65279;The proposed network achieves an average Dice coefficient of 0.93342 for the segmentation task involving 11 organs. Experimental results indicate the notable superiority of Swin-PSAxialNet over previous mainstream segmentation methods. The method shows excellent accuracy and low computational costs in segmenting major abdominal organs.

The present protocol describes an efficient multi-organ segmentation method called Swin-PSAxialNet, which has achieved excellent accuracy compared to previous segmentation methods. The key steps of this procedure include dataset collection, environment configuration, data preprocessing, model training and comparison, and ablation experiments.

Abdominal multi-organ segmentation is one of the most important topics in the field of medical image analysis, and it plays an important role in ...

Objectification of Tongue Diagnosis in Traditional Medicine, Data Analysis, and Study Application

Tongue diagnosis is an essential technique of traditional Chinese medicine (TCM) diagnosis, and the need for objectifying tongue images through image processing technology is growing. The present study provides an overview of the progress made in tongue objectification over the past decade and compares segmentation models. Various deep learning models are constructed to verify and compare algorithms using real tongue image sets. The strengths and weaknesses of each model are analyzed. The findings indicate that the U-Net algorithm outperforms other models regarding precision accuracy (PA), recall, and mean intersection over union (MIoU) metrics. However, despite the significant progress in tongue image acquisition and processing, a uniform standard for objectifying tongue diagnosis has yet to be established. To facilitate the widespread application of tongue images captured using mobile devices in tongue diagnosis objectification, further research could address the challenges posed by tongue images captured in complex environments.

The present study employed U-Net and other deep learning algorithms to segment a tongue image and compared the segmentation results to investigate the objectification of tongue diagnosis.

Tongue diagnosis is an essential technique of traditional Chinese medicine (TCM) diagnosis, and the need for objectifying tongue images through image ...

Medicine

A Swin Transformer-Based Model for Thyroid Nodule Detection in Ultrasound Images

In recent years, the incidence of thyroid cancer has been increasing. Thyroid nodule detection is critical for both the detection and treatment of thyroid cancer. Convolutional neural networks (CNNs) have achieved good results in thyroid ultrasound image analysis tasks. However, due to the limited valid receptive field of convolutional layers, CNNs fail to capture long-range contextual dependencies, which are important for identifying thyroid nodules in ultrasound images. Transformer networks are effective in capturing long-range contextual information. Inspired by this, we propose a novel thyroid nodule detection method that combines the Swin Transformer backbone and Faster R-CNN. Specifically, an ultrasound image is first projected into a 1D sequence of embeddings, which are then fed into a hierarchical Swin Transformer. 
The Swin Transformer backbone extracts features at five different scales by utilizing shifted windows for the computation of self-attention. Subsequently, a feature pyramid network (FPN) is used to fuse the features from different scales. Finally, a detection head is used to predict bounding boxes and the corresponding confidence scores. Data collected from 2,680 patients were used to conduct the experiments, and the results showed that this method achieved the best mAP score of 44.8%, outperforming CNN-based baselines. In addition, we gained better sensitivity (90.5%) than the competitors. This indicates that context modeling in this model is effective for thyroid nodule detection.

Here, a new model for thyroid nodule detection in ultrasound images is proposed, which uses Swin Transformer as the backbone to perform long-range context modeling. Experiments prove that it performs well in terms of sensitivity and accuracy.

In recent years, the incidence of thyroid cancer has been increasing. Thyroid nodule detection is critical for both the detection and treatment of thyroid ...

Artificial Intelligence-Based Integration of Dental Images

Reliability of Artificial Intelligence-Based Cone Beam Computed Tomography Integration with Digital Dental Images

This study aimed to introduce cone-beam computed tomography (CBCT) digitization and integration of digital dental images (DDI) based on artificial intelligence (AI)-based registration (ABR) and to evaluate the reliability and reproducibility using this method compared with those of surface-based registration (SBR). This retrospective study consisted of CBCT images and DDI of 17 patients who had undergone computer-aided bimaxillary orthognathic surgery. The digitization of CBCT images and their integration with DDI were repeated using an AI-based program. CBCT images and DDI were integrated using a point-to-point registration. In contrast, with the SBR method, the three landmarks were identified manually on the CBCT and DDI, which were integrated with the iterative closest points method.
After two repeated integrations of each method, the three-dimensional coordinate values of the first maxillary molars and central incisors and their differences were obtained. Intraclass coefficient (ICC) testing was performed to evaluate intra-observer reliability with each method&#39;s coordinates and compare their reliability between the ABR and SBR. The intra-observer reliability showed significant and almost perfect ICC in each method. There was no significance in the mean difference between the first and second registrations in each ABR and SBR and between both methods; however, their ranges were narrower with ABR than with the SBR method. This study shows that AI-based digitization and integration are reliable and reproducible.

Author Spotlight: Advancing CBCT and Digital Dental Image Integration with AI-Assisted Digitization

A process of registering cone-beam computed tomography scans and digital dental images has been presented using artificial intelligence (AI) -assisted identification of landmarks and merging. A comparison with surface-based registration shows that AI-based digitization and integration are reliable and reproducible.

Deep Neural Networks for Image-Based Dietary Assessment

Summary

Explore More Videos

Chapters in this video

Design and Evaluation of Smart Glasses for Food Intake and Physical Activity Classification

Iterative Development of an Innovative Smartphone-Based Dietary Assessment Tool: Traqq

Concept Development and Use of an Automated Food Intake and Eating Behavior Assessment Method

Application of Deep Learning-Based Medical Image Segmentation via Orbital Computed Tomography

Integration of Animal Behavioral Assessment and Convolutional Neural Network to Study Wasabi-Alcohol Taste-Smell Interaction

End-To-End Deep Neural Network for Salient Object Detection in Complex Environments

Swin-PSAxialNet: An Efficient Multi-Organ Segmentation Technique

Objectification of Tongue Diagnosis in Traditional Medicine, Data Analysis, and Study Application

A Swin Transformer-Based Model for Thyroid Nodule Detection in Ultrasound Images

Reliability of Artificial Intelligence-Based Cone Beam Computed Tomography Integration with Digital Dental Images