TOP  >  Thesis/Dissertation  >  Detecting and Tracking Handled-Objects for Progress Management in Food Preparation

Detecting and Tracking Handled-Objects for Progress Management in Food Preparation

Daily meal is an important factor for our quality of life. For enriching our food experience, plenty of recipes have been published via books and web sites; however, trying an unfamiliar recipe is not an easy task. As a result the recipes are not fully utilized.

In order to break through such situations, we aim to develop a supporting system for trying unfamiliar recipes. It is not uncommon to try unfamiliar recipes. For example, many annual events have their ceremonial foods. A nutritional restriction by doctor to our family forces us to change our food habit. And more generally, we want to eat different dishes every day.

In such cases, the system will be used nearly every day. A daily tool must be easy to use. In order to achieve the high degree of usability, our research group have proposed a user-centric concept for the system supporting food preparation. The system requires the user to do only a minimum operation, which is to choose recipes, and no other extra operations.

The system should be useful for various users. For a complete novice cook, such as a child, the system should give a step-by-step guidance. On the contrary, a professional does not need such bothering support, but only requires recipe specific information, such as amount of seasoning and right heat level.

More typical users are between the complete novice and the professional. Such users require step-by-step guidance for some parts of the recipe, but do not for parts which they are familiar to. Unlike step-by-step guidance, the recipe specific information is useful for any level of cooks.

We aim to make a suggestion from the step-by-step guidance to be ignor able. This strategy is from the idea of car navigation systems. Even when the user turns away from the direction suggested by the system, the system re-calculate a new route for user. With such system, the users always have the choice to follow the guidance, or ignore it. We call this kind of guidance navigating guidance in the sense that its purpose is to suggest a next step that the users can perform. Navigating guidance should be shown directly after the completion of the previous step.

Recipe-specific information should be shown directly before the related culinary action rather in the navigating guidance. This is because the user can ignore the navigation. When it is ignored, the user does not have a chance to hear the navigating guidance for the step that the user decided to do next. Also, there are no guarantee that the user does the step directly after hear ing the navigation. Recipe specific information contains many quantitative information such as amounts and heat levels. It is difficult to memorize such information for a long time. Hence we suppose to have another guidance, detailed guidance, which contains the recipe specific information. It should not be ignored, then, should attract attention of cooks directly before the related action is performed.

For realizing our user-centric concept, there are two main technical agen das. The first one is to make navigating guidance ignorable. The second one is to detect the appropriate timing for showing the two types of guidance. These agendas must be overcome without any extra operations. We approach both agendas by extracting object-handling events, which happens naturally as a part of ordinal food preparing activity.

We describe jth handling event to object oi as eji = {oi , tsj , tej}, where tsj is the timing when oi has been put, and tej is when oi has been put. In our user-centric concept, a recipe must be described by a set of steps. Each step consists of a culinary process performed to one food item by one cooking device. Using such description, tsj and tej will often correspond to start and end of the step. Hence, tsj and tej are good candidates for showing detailed guidance and navigating guidance, respectively.

The category of taken object oi provides good evidence regarding the next step to be performed. This allows us to show detailed guidance directly before the related action is performed. To confirm the user actually did the suggested action, we extract motion features from {tsj, tej}. This enables to suggest a new step based on the current progress in food preparation.

The goal of this thesis is to extract object-handling events from observed video sequences. This task serves as a base for realizing our user-centric concept. In order to extract object-handling events, we need the following processes: (a) detecting regions of activity on a cooking counter, (b) differentiating between human movement, taken object and put object, and (c) matching put object and an previously taken object.

When detecting regions of activity (a), the main target is displaced objects' regions. Traditional methods relying on strong classifier and discrim inant features are not adapted in our case because of the poorly textured food items and extremely large intra-class appearance change. Hence, we adopt background subtraction, which is applicable to general object detection.

Cast shadows and secondary reflection are the main source of troubles in background subtraction. In our case, a cook standing by the counter will cast shadows and secondary reflection on the cooking table. Such disturbance will interfere with the object detection process, making it obsolete. Hence we need a method robust to changes in lighting condition.

As such a method, we propose TexCut, which subtracts background by combination of texture comparison and smoothing by graph cut technique. Textural feature represents relative pixel values in a region, and robust to the changes in lighting condition. A number of methods using textural feature has been proposed; however any of those traditional methods cannot overcome following two essential problems on texture comparison. The first problem derives from shortage of signal power in comparison at homogeneous region. The second problem is pseudo texture appearing at strong shadow's contour. In the proposed method, a difference of texture at each region is weighted by its homogeneity. The weight controls the reliability of the textural difference at each region. If the region is homogeneous, the texture is not reliable for comparison. Segmentation in such regions is strongly smoothed.

This locally active smoothing is done by graph cut. This enables to get subtraction result while avoiding the comparison at homogeneous region. TexCut distinguishes pseudo texture from a real object's texture by its sharpness. Because the light source in indoor scene has non-negligible size, a cast shadow always produces blurred contours. Hence, we avoid the textural comparison at the region with only blurred edges, just as the homogeneous regions. This enables TexCut to ignore the pseudo texture at cast shadow's contour.

Using TexCut, we can detect changes on cooking counter while ignoring cast shadows and secondary reflections. The next problem is how to differentiate between cook's movements, taken object and put object regions (b). The cook region is identified by a simple trick using the edge of cooking counter. Because only the cook can come across the counter's edges, the foreground region intersecting the edge will be a cook region.

Taken objects are distinguished from put objects using patch-based background model. Because a background subtraction cannot detect overlapping objects independently, we designed the background model to detect fore ground regions only when objects are taken from, or put to the regions. This is realized by incorporating detected object into the background. The incorporated object is not detected as foreground region as long as it remains at the same location. When the object has been taken, the corresponding object region is detected again as foreground region. The background model keeps the previous apperance of a background region when incorporating an object put on the region. The previous region will reappear when the object is taken away. Hence, keeping old background appearance enables to differ entiate the foreground region detected by a put object from that by taken object.

We can now detect object oi1 which has been taken at tsi1, and object oi2 which has been put at tei2 . If oi1 and oi2 are the same object, then an object-handling event ej is obtained as {oi1 (= oi2), tsi1 , tei2 }. The index j is obtained by counting how many times oi1 has been handled before tsi1 . In order to count up j and judge whether oi1 and oi2 are the same object or not, we need to track objects on cooking counter (c).

While object oi1 is on the counter and not handled, oi1 does not move and we do not need to track it. Tracking is required when oi1 is handled. General tracking methods supposes target's appearance consistency and location/motion continuity; however, if oi1 is peeled or cut during a handling, serious problems can occur. Peeling and cutting change appearance of oi1 drastically. It is done with holding ingredient. Small ingredients can be hidden completely. So, appearance consistency and location/motion continuity are disturbed in food preparing context. We need to identify oi1 and oi2 to be the same object with another approach.

We approaches this problem by replacing the identification process with a classification process. How the ingredient's color changes depends on the combination of its skin color and inside color. Thus, the changing patterns can be learned preliminary. Hence, oi1 seems to be oi2 if the color combination of oi1 and oi2 matches to a skin-inside color combination of an ingredient.

This inference does not lead us to the correct tracking results when we consider the color combinations of every ingredients. Fortunately appearing ingredients are listed up by the recipe, which is chosen at the first operation in our user-centric system. This list efficiently reduces the extra color combinations from ingredients that do not appear in current food preparation, and the inference gets work. In this way of tracking, the appearance continuity is not assumed any more. Instead, it is assumed that the appearing ingredients are known, and can be preliminary learned.

Instead of location/motion continuity, we also use another contextual information in the observed data, which is "taken order." This idea is from a natural suggestion that mankind cannot hold so many objects at the same time.

By solving the aforementioned problems, we considered how to extract object-handling event eji from a video sequence. One of our future works is to consolidate our methods with object categorization and motion categoriza tion technique. We also intend to implement a prototype of the user-centric supporting system. This is important to clarify what percent of accuracy is required, and how immediately the system should response when user puts/takes an object, for a satisfying degree of usability. These evaluating processes will lead further agendas in pattern recognition field and in cognitive science field.