Robust Clustering Algorithm by Inner Product Scaling

Thesis, Researches in Minoh Lab, Minoh Lab
The classification methods can be divided into the two types: supervised and unsupervised. In this thesis, we focus on the unsupervised classification (i.e. clustering). We will point out the properties that the clustering method should have, and propose a new clustering method with the properties.

The first property that the clustering method should have is the robustness against noise. The noises are defined as the isolated samples whose distances to the other samples are large. The robustness against noise is the property that the clustering result does not change when the noises are added to the sample set. The robustness against noise is important for the clustering method to be reliable even when the unexpected samples are included in the sample set.

The second property is the analytical computability. Usually, the clustering method is formulated as a nonlinear optimization problem and the clusters are obtained as the optimal solution. Since the optimization problem usually has many local minima, many initial values have to be tried, which takes a large computational time. When the clusters can be obtained by analytical computations without using the nonlinear optimization algorithm, the clustering method is said to be analytically computable. The analytical computability is important for keeping the computational time small.

In this thesis, we propose a clustering method which have both of the properties. We call this method ``analytically-computable robust clustering method (ARC method)''. The ARC method is derived as the extension of the extractive method, where the extension is performed by the mapping method called ``inner product scaling''.

In the extractive method, the cluster center is determined so that the number of the samples in the neighborhood is maximized. When the cluster center is determined, the samples around the center are extracted as a cluster. Then, the extracted samples are removed from the sample set, and the next cluster is extracted in the same way. The cluster extraction stops when the predetermined number of clusters are extracted or the sample set becomes empty. One feature of the extractive method is the shape of the clusters can be changed by using various distance measures. For example, when the Euclidean distance is used, the shape of the cluster is spherical. When the cone distance is used, the clusters are cone-shaped.

The extractive method has the robustness against noise. On the other hand, in most choices of the distance measure, it does not have the analytical computability. However, it can be shown that, when the cone distance is used, the optimization problem of the extractive method can be solved analytically. But, the method has the drawback that it can only extract cone-shaped clusters, although the actual clusters are considered to be spherical in many cases. So, the mapping method that converts the spherical clusters into the cone-shaped clusters is needed for the preprocessing.

For this purpose, the mapping method called ``inner product scaling'' is used. By this method, the samples in the feature space are mapped into another space so that the Gaussian similarity between the samples is reflected as the inner product. We will show that the inner product scaling can convert the spherical clusters into the cone-shaped clusters. The ARC method is the combination of the inner product scaling and the cone cluster extraction. Since the two elements are analytically computable, the ARC method can perform the spherical cluster extraction by analytical computations.

In regard to the robustness against noise, we compared the ARC method with the Noise Resistent C-Means method, which is a partitional clustering method specially modified to improve the robustness against noise. As a result, the ARC method achieved higher robustness.

We will present three applications of the ARC method. First, the ARC method is applied to image processing: extraction of lines from an image. Clustering line segments is an effective approach for line extraction. Line segments are obtained by applying a line fitting process to the output of edge detection process. The similarity between the line segments is defined so that two segments aligned in line have a large similarity. Then, lines are extracted as clusters of line segments. The line segments which are not aligned in line are considered as noises. The noise robustness of the ARC method works well in the line extraction task.

Second, the ARC method is applied to document clustering. We used document clustering for browsing a document database. The outline of the clustering-based browsing system is as follows: The documents are indexed by the term occurrence frequency and the similarity between documents is defined based on the number of common terms. Similar documents are clustered and a representative document of each cluster is shown to the user. The user can get the whole view of the database without examining documents one by one. In the document database, there are many documents whose contents are not similar to any document. Such documents are considered as noise documents in clustering, so the ARC method is also useful in document clustering.

Third, the ARC method is applied to the prototype generation for the nearest neighbor method. To reduce the computational cost of the nearest neighbor method, the reduction of the training samples is required. The training samples are partitioned into several clusters and the reduced training set (i.e. the prototype set) is obtained as the cluster centers. It is known that a ``noise'' training sample which is distant from the other training samples is harmful for classifiers. So, the robust clustering is needed to generate the prototypes that achieve high classification accuracy. We will show that the ARC method can generate better prototypes than the C-Means with respect to the classification accuracy of the nearest neighbor method.

Go back to Thesis Page