Recently, many static cameras have been installed for fixed-point surveillance in our daily environment. Surveillance systems are proposed and their videos are used for security at various places such as malls, stations, and streets.
What do the videos captured by the surveillance cameras contain? They contain stationary objects such as walls, roads and the ground. There are also variations in the videos. We divide the variations into two groups in terms of their causes. One group is the set of variations of color, texture, and luminance of objects. This group contains illumination changes, moving objects such as swaying trees and flapping flags, and objects which are relocated by people such as panels and benches. Recently, there are many displays in our surroundings for information and digital signage. Such displays are also in this group. The other group is the set of variations caused by new objects in the observation areas such as pedestrians and cars. In this dissertation, we treat the stationary objects and the objects which cause the former group of the variations as the background, and the objects which cause the latter group as the foreground.
Foreground objects need to be separated from the background before further processing in many scenarios. To detect only foreground regions, Background Subtraction is used. It is not enough for getting both of foreground regions and the background image, because at least the regions which are occluded by foreground objects should be estimated. Therefore, in this dissertation, we focus on estimating the background image of images captured by a static camera.
To estimate such a background image, we need to know the value of each pixel in the image. Therefore, we use the term of Background Estimation to denote the process of estimating the pixel values of the background image. Intuitively, for a given static camera, images with foreground objects should share the same background image with those without foreground objects. However, because the background is occluded by the foreground objects, it is difficult to estimate the background image from only an image which contains foreground objects. Therefore, existing background estimation methods use a set of images, which are captured beforehand. We follow this strategy.
There are many kinds of variations of the background in an observation area. We classify them in terms of occurrence patterns. Briefly speaking, there are two patterns; statistically biased and not statistically biased. Based on the regularity and repetition, we regard the illumination changes as cyclic variations, variations of objects which are stationary but can be relocated by people as step function like variations, and variations of moving objects such as swaying trees and flapping flags as continuous variations. Contrarily, the appearances of the displays for digital signage change arbitrary. Therefore, we believe that they do not have such biases. We introduce two approaches to estimate the background for each pattern. One is to select images for modeling the background and the other is to build a specific precise model for the background.
To begin with, we try to estimate the background for a scene which contains the former pattern of background variations, which have statistical biases. Using many images to model the background, each variation is buried in the other variations. The basic idea here is to select images for modeling the background appropriately.
Firstly, we tackle the first two background variations; the cyclic variations caused by the illumination and the step function like variations caused by objects which are replaced by people. Here, we ignore the background objects moving all of the time. We assume that the objects moving step function like are replaced day by day and they are stationary and not replaced during a day. In this study, we name the variations caused by such objects as the structure variations and also name the illumination changes as the lighting variations. Under these assumptions, we propose a background estimation technique by preserving the lighting variation and the structure variation. We propose an image model which can describe these variations. The model consists of the stationary component, the lighting difference component, the structure difference component, and the foreground component. Using this model, we realize a background estimation which can preserve the lighting and structure variations while at the same time remove foreground objects. This method is based on extracting the lighting component of an image by Principal Component Analysis (PCA) using the images which are captured at the same time of different days. First, we remove the lighting component from the images captured in the same day using PCA for each time span. Then, the background image preserving the structure variation is generated by averaging the images which have the same spatial structure but different lighting components. After that, by recovering the lighting component of the target image respect to the averaged image, the background image preserving both of the lighting and structure variations is estimated. We experimentally evaluated the results of this approach using the data from several surveillance cameras and such results are better than those generated by the traditional background estimation methods.
Then, we tackle the rest variations; the continuous variations caused by moving objects such as swaying trees and flapping flags. We assume that such objects move repeatedly and we can observe them again and again. We also assume that the variations have 1st order Markov property and we concentrate on the neighboring frames. There are traditional approaches of the background estimation which parameterize the lighting component of an image and estimate it by temporal filtering such as Kalman Filtering. We extend them to a background estimation method which can deal with not only the lighting variation but also any other variations which are observed repeatedly. Dynamic Texture, which describes texture transition with Kalman Filter, is applicable for this purpose, however, it is difficult to deal with the non-linearity of the background transitions. There are some extensions to tackle the non-linearity by assuming multiple states, however, it is difficult to separate the background variations into such discrete states. To tackle the non-linearity, we propose an exemplar-based approach which focuses on parameters of neighboring frames by Kernel Density Estimation. With this process, we realize the background estimation which can preserve any background variations including those caused by moving objects. We evaluated the estimated background images qualitatively and quantatively, comparing to the ones generated by previous Eigen Background methods.
From the results of these two background estimation methods, we can conclude that the background of a scene can be estimated if we could limit the input images appropriately and if we could get enough prior information about the variations of the scene for comprehending the statistical biases of the background.
In contrast to the two methods introduced above, we also try to estimate the background for a scene which contains the latter pattern of background variations that does not have statistical biases. There are many displays for information and digital signage in our surroundings. The displays are observed by surveillance cameras from various orientations and various distances. And the appearances of the displays change arbitrarily as the displayed contents change. Here, we regard the displays as part of the background of the scene and try to estimate them. It is too difficult to estimate the background image because the contents shown on the displays are unknown. We think, however, even if we could know what contents are shown on the displays, it is still difficult to estimate the background image precisely because of their various orientations, distances, resolutions, and devices. Therefore, assuming that we have known what is shown on the display, we discuss precise background estimation in an experimental environment. To simplify the situation, we restrict it to a simple system of which a camera observes a display and the shown image on the display is known. Even under such a simple situation, there are still difficulties caused by relative positional relationship between the camera and the display. We propose a technique to estimate the background image in this situation, in other words, to estimate how we can observe the displayed contents by the camera. We realize it by geometric and photometric conversion. Although the display and the camera treat R, G and B pixels independently and each pixel has its own size, the interpolation in traditional geometric conversion treated a pixel which has a three-dimensional value as a point. Therefore, the interpolation in the traditional geometric conversion does not work accurately in this situation. We model the display and camera pixel structure, and then improve the interpolation with the model. We experimentally evaluated our proposed method and the traditional method by comparing the estimated image with the observed image.
From the results of the third background estimation method, we can also conclude that we can estimate the background well in that trial environment.
There are many possibilities to further intend our research. One future work is to determine the minimum number of the training images to model the background. We want to give a theoretical proof of how many images are enough for the background estimation. Using both the background and foreground models together can be considered as another future work, which is expected to improve the precision of the estimation. Because the third background estimation method has only concerned the trial environment, another future work is to extend our proposed method to handle more practical environments, especially mixture of the background variation patterns by make the image model more flexible.