This content is Open Access.
8.5K Views
•
13:19 min
•
March 13th, 2021
DOI :
March 13th, 2021
•0:00
Introduction
0:44
Food Image Recognition with NutriNet
5:08
Food Image Segmentation with FCNs
7:22
Food Image Segmentation with HTC ResNet
11:24
Representative Results
12:22
Conclusion
Transcript
Due to the issues and costs associated with manual dietary assessment approaches automated solutions are required to ease and speed up the work and increase its quality. Today, automated solutions are able to record a person's dietary intake in a much simpler way, such as by taking an image with a smartphone camera. In this article, we will focus on such image-based approaches to dietary assessment using deep neural networks, which represent the state of the art in the field.
In particular, we will present three solutions, one for food image recognition, one for image segmentation of food replicas, or fake food, and one for image segmentation of real food. Gather a list of different foods and beverages that will be the outputs of the food image recognition model. Save the food and beverage list in a text file, such as TXT or CSV.
Note that the text file used by the authors of this article can be found in the supplemental files under food items dot TXT and includes a list of 520 Slovenian food items. Write or download a Python script that uses the Google custom search API to download images of each food item from the list and saves them into a separate folder for each food item. Note that the Python script used by the authors of this article can be found in the supplemental files under download images dot pi.
If this script is used the developer key variable developer key line eight in the Python script code and custom search engine ID variable CX line 28 in the Python script code need to be replaced with values specific to the Google account being used. Run the Python script from step 1.1.3. Create a new version of each image from the food image data set by rotating it by 90 degrees, using the CLoDSA library.
Note that the Python script containing all the CLoDSA commands used by the authors of this article can be found in a file included in the supplemental files under NutriNet underscore augmentation dot pi. Create a new version of each image from the food image data set by rotating it by 180 degrees, using the CLoDSA library. Create a new version of each image from the food image data set by rotating it by 270 degrees, using the CLoDSA library.
Create a new version of each image from the food image data set by flipping it horizontally, using the CLoDSA library. Create a new version of each image from the food image data set by adding random color noise to it, using the CLoDSA library. Create a new version of each image from the food image data set by zooming into it by 25%using the CLoDSA library.
Save images from steps 1.3.1 to 1.3.6 along with the original images into a new food image data set. In total, seven variants per food image. Import the food image data set from step 1.3.7 into the NVIDI digits environment, dividing the data set into training, validation, and testing subsets.
Copy and paste the definition text of the NutriNet architecture into NVIDIA digits. Note that the NutriNet architecture definition can be found in the supplemental files under NutriNet dot proto TXT. Optionally, define training hyper-parameters in NVIDIA digits or use the default values.
The hyper-parameters used by the authors of this article can be found in a file included in the supplemental files under NutriNet underscore hyper-parameters dot proto TXT. Run the training of the NutriNet model. After training is complete, take the best performing NutriNet model iteration.
This model is then used for testing the performance of this approach. Note that there are multiple ways to determine the best performing model iteration. Refer to the article texts for more details.
Obtain a data set of fake food images. Note that the authors of this article received images of fake food that were collected in a lab environment. Manually annotate every food image on the pixel level.
Each pixel in the image must contain information about which food class it belongs to. Note that there are many tools to achieve this. The authors of this article used JavaScript segment annotator.
The result of this step is one annotation image for each image from the food image data set, where each pixel represents one of the food classes. Perform the same steps as in section 1.3, but only on images from the training subset of the food image data set. Note that with the exception of step 1.3.5, all data augmentation steps need to be performed on corresponding annotation images as well.
Perform the same steps as in section 1.4 with the exception of step 1.4.2. In place of that step perform steps 2.3.2 and 2.3.3. Note that the training hyper-parameters used by the authors of this article can be found in the file included in the supplemental files under FCN-8S underscore hyper-parameters dot proto TXT.
Copy and paste the definition text of the FCN-8S architecture into NVIDIA digits. Enter the pretrained FCN-8S model weights into NVIDIA digits. Note that these model weights were pretrained on the Pascal visual object classes data set and can be found on the internet.
Download the food image data set from the Food Recognition Challenge website. Perform steps 1.3.1 to 1.3.4. Note that the Python script containing all the CLoDSA commands used by the authors of this article can be found in the file included in the supplemental files under FRC underscore augmentation dot pi.
Create a new version of each image from the food image data set by adding gaussian blur to it, using the CLoDSA library. Create a new version of each image from the food image data set by sharpening it, using the CLoDSA library. Create a new version of each image from the food image data set by applying gamma correction to it, using the CLoDSA library.
Save images from steps 3.2.1 to 3.2.4 along with the original images into a new food image data set. In total, eight variants per food image. Save images from steps 3.2.2 to 3.2.4 along with the original images into a new food image data set.
In total, four variants per food image. Modify the existing HTC ResNet 101 architecture definition from the MM Detection library so that it accepts the food image data sets from steps 3.1.1, 3.2.5, and 3.2.6. Optionally, modify the HTC ResNet 101 architecture definition from step 3.3.1 to define training hyper-parameters or use the default values.
Note that the modified HTC ResNet 101 architecture definition can be found in the supplemental files under HTC underscore ResNet 101 dot pi. Run the training of the HTC ResNet 101 model on the food image data sets from step 3.1.1 using the MM Detection library. After the training from step 3.3.3 is complete, take the best performing HTC ResNet 101 model iteration and fine tune it by running the next phase of training on the food image data set from step 3.2.5.
Note that there are multiple ways to determine the best performing model iteration. Refer to the article texts for more details. This is relevant for the next steps as well.
After the training from step 3.3.4 is complete, take the best performing HTC ResNet 101 model iteration and fine tune it by running the next phase of training on the food image data set from step 3.2.6. After the training from step 3.3.5 is complete, take the best performing HTC ResNet 101 model iteration and fine tune it by again running the next phase of training on the food image data set from step 3.2.5. After the training from step 3.3.6 is complete, take the best performing HTC ResNet 101 model iteration.
This model is then used for testing the performance of this approach. Note that steps 3.3.3 to 3.3.7 yielded the best results for the purposes defined by the authors of this article. Experimentation is needed for each data set to find the optimal sequence of training and data augmentation steps.
After testing the trend model NutriNet achieved a classification accuracy of 86.72%on the recognition data set, which was around 2%higher than AlexNet and slightly higher than GoogLeNet, which were popular deep neural network architectures of the time. To measure the accuracy of the FCN-8S fake food image segmentation model, the pixel accuracy measure was used. The accuracy of the trained FCN-8S model was 92.18%The ResNet-based solution for food image segmentation was evaluated using the precision measure defined in the Food Recognition Challenge.
Using this measure the train model achieved an average precision of 59.2%which ranked second in the Food Recognition Challenge. In recent years, deep neural networks have been validated multiple times as a suitable solution for recognizing food images. Our work presented in this article serves to further prove this.
The single output food image recognition approach is straightforward and can be used for simple applications. Whereas the food image segmentation approach requires more work in preparing annotated images, but it's much more applicable to real world images. In the future, our goal will be to further evaluate the developed procedures on real world images.
The first step towards real world validation was provided by the Food Recognition Challenge, which included a data set of real world food images. But, further work needs to be done to validate this approach on food images from all around the world and in cooperation with dieticians.
The goal of the work presented in this article is to develop technology for automated recognition of food and beverage items from images taken by mobile devices. The technology comprises of two different approaches - the first one performs food image recognition while the second one performs food image segmentation.
ABOUT JoVE
Copyright © 2025 MyJoVE Corporation. All rights reserved
We use cookies to enhance your experience on our website.
By continuing to use our website or clicking “Continue”, you are agreeing to accept our cookies.