TOP  >  Thesis/Dissertation  >  Ingredient Recognition using a Combination of Image, Vibration Sound and Load Data

Ingredient Recognition using a Combination of Image, Vibration Sound and Load Data

Nowadays, many homemakers use a web service in which they can publish their original recipes. The web service allows such personal users to publish their recipes easily. However, describing recipes requires great care. For helping the users to describe recipes, a system which can describe the recipes automatically through observing food preparation is useful. Since a recipe is a set of directions, each of which consists of culinary task and its target ingredient, it is important for that system to recognize the ingredients and the tasks through observing food preparation. In this paper we focus on the ingredient recognition.

Many of related works use Image data as an input of the ingredient recognition process, and some others use Vibration Sound and Load data at the cook's cutting actions. Although these modalities are used separately in the related works, each modality has its own drawback, and fails in the recognition. This can hardly be solved sololy, but would be solved with other modalities mutually. For example, Image is not suitable to distinguish cibols and radishes because they have similar colors, whereas Load suits because of their different hardness. By contraries, Load hardly discriminates radishes from carrots, whereas Image can. In this paper, we propose an ingredient recognition method by combinative use of these three modalities, which solves misrecognition caused by drawbacks from each modality.

Combining multi-modal time series data simply has the following problem: Each modality's data has effective parts and ineffective parts in its time series, which are different from those of the other modalities. For instance, the performance of Vibration Sound and Load will become extremely high when the cook cuts the ingredients, whereas that of Image will become low at that time because the ingredients are hidden by the cook's hands and their colors cannot be observed. Therefore, in this paper we separately extract the effective parts of each modality and combine them.

Since the performance of Load and Vibration Sound will become high when the ingredients are cut, the cook's cutting actions should be detected first. In this paper, the detection is achieved by using Load data and an interval T corresponding to the cutting part is gotten, from which 10-dimensional vector of Load feature is extracted. The interval T is also used for extracting Vibration Sound features as follows: A moment t_c when a knife and a cutting board collide with each other is detected in the interval T, and 16-dimensional vector of Vibration Sound feature is extracted from the interval T_S=[t_c-0.2, t_c]. As for Image, we focus on a moment when the cook puts the ingredient on the cutting board before cutting it, and extracted 64-dimensional vector of Image feature at the moment.

To examine the effectiveness of the proposed method, we conducted an experiment, in which we evaluated the recognition accuracy of the proposed method and compared the accuracy to those of uni-modal methods only using Image, Vibration Sound or Load data, respectively. 23 kinds of ingredients which often appear in ordinary home cooking were observed and the data obtained by the observation was processed by Ivanov's combination method.The results are as follows: By Image-based method, green bell peppers were often misrecognized as cucumbers and the misrecognition rate was 61.1%. However the rate was improved to 3.8% with the proposed method which combined Vibration Sound and Load with Image.In addition, radishes and carrots were sometimes misrecognized as each other by the Load-based method, and a rate of misregonizing radishes as carrots was 12.9%. With our method, the rate was improved to 0.3%. A rate of misrecognizing carrots as radishes was also improved from 9.0% to 0.0%. These results indicated that the misrecognition of each modality is successfully solved by the proposed multi-model method using the complementary relation between the three modalities.

The proposed method currently extracts two or more feature vectors from a single individual ingredient for each modality.However, some of them sometimes have largely different from each other due to a variety of cutting ways, especially for Vibration Sound and Load modalities.To solve this problem is one of the feature works. One solution will be to select the most effective features among the whole features extracted from each single ingredient.