|
|
|
Spatial Clustering Methods in Data Mining
Plamen Tchimev, Joseph Rossi, Iren Valova Department of Computer and Information Sciences University of Massachusetts Dartmouth
1. Introduction Data clustering is under vigorous development. Contributing areas of research include data mining, statistics, machine learning, spatial database technology, biology, marketing, etc. Owing to the huge amounts of data collected in databases, cluster analysis has become an active topic in data mining research. The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be treated as one group in many applications. Clustering techniques apply when there is no class to be predicted but rather when the instances are to be divided into natural groups. These clusters presumably reflect some mechanism at work in the domain from which instances are drawn, a mechanism that causes some instances to bear a stronger resemblance to one another than they do to the remaining instances. Clustering naturally requires different techniques to the classification and association learning methods. Due to its immense applications in various areas, data clustering has been a highly active topic in data mining research, with fruitful, scalable clustering methods developed recently. These spatial clustering methods can be classified into five categories: partitioning method, hierarchical method, density-based method, grid-based method and model based methods, as presented on Fig 1. The characteristics of the data being clustered serve as important factor in determining the clustering algorithm being applied. These characteristics include:
Figure1 Data clustering methods classification. There are different ways in which the result of clustering can be expressed. The groups that are identified may be exclusive, so that any instance belongs in only one group. Or they may be overlapping, so that an instance may fall into several groups. Or they may be probabilistic, whereby an instance belongs to each group with a certain probability. Or they may be hierarchical, such that there is a crude division of instances into groups at the top level, and each of these groups is refined further-perhaps all the way down to individual instances. Really, the choice between these possibilities should be dictated by the nature of the mechanisms that are thought to underlie the particular clustering phenomenon. However, because these mechanisms are rarely known - the very existence of clusters is, after all, something that we're trying to discover - and for pragmatic reasons too, the choice is usually dictated by the clustering tools that are available (Fig. 2).
Figure 2 Different ways of cluster representation
In this paper we will look at the basics of the five clustering methods: partitioning method, hierarchical method, density-based method, grid-based method and model-based method.
2. Partitioning method The three major partitioning algorithms are k-means algorithm, EM(Expectation Maximization) algorithm, and k-medoid algorithm.
2.1 The k-means method The k-means method is simple and reasonably effective. In advance is specified how many clusters have been sought, and this will determine the parameter k. Then k points are chosen at random as cluster centers. Instances are assigned to their closets cluster center according to the ordinary Euclidian distance function. Next the centroid, or mean, of all instances in each cluster is calculated, this is the "means" part. These centroids are taken to be new center values for their respective clusters. Finally, the whole process is repeated with the new cluster centers. Iteration continues until the same points are assigned to each cluster in consecutive rounds, at which point the cluster centers have stabilized and will remain the same thereafter. The computational complexity of the algorithm is a function of: (number of objects, number of clusters, number of iterations). As with all practical clustering techniques, the final cluster centers do not represent a global minimum but only a local one, and completely different final clusters can arise from differences in the initial randomly chosen cluster centers. It is easy to imagine situations in which the algorithm will fail to find a reasonable clustering. Disadvantages: the k-means algorithm is very sensitive to noise, and time consuming because a substantial number of iterations may be necessary, each involving finding the distance of the k cluster centers from every instance to determine the closest.
2.2 The EM (Expectation Maximization) The EM (Expectation Maximization) algorithm represents each cluster using a probability distribution (For example: Gausian probability distribution). Similarly to k-means, first select the cluster parameters (or guess the classes of the instances), then iterate. Adjustment needed: we know cluster probabilities, not actual clusters for each instance. So, we use these probabilities as weights. For cluster A,
m_A = (w1*x1 + w2*x2 + .. + wn*xn)/(w1 + w2 + ... + wn), where wi is the probability that xi belongs to cluster A; standard deviation for A, A = [w1(x1-m)2 + w2(x2-m)2 + ... + wn(xn-m)2]/(w1 + w2 + ... + wn). When to stop iteration - maximizing overall likelihood that the data come form the dataset with the given parameters ("goodness" of clustering): product (or sum of logs) of P(A)*P(xi|A) over all instances xi and all classes A. Stop when the difference between two successive iteration becomes negligible.
2.3 The k-medoids method The objective of k-medoid clustering is to find a non-overlapping set of clusters such that each cluster has a most representative point, i.e., a point that is most centrally located with respect to some measure, e.g., distance. These representative points are called medoids. Once again, the algorithm is conceptually simple. Basic k-medoid Algorithm for finding K clusters. 1) Select K initial points. These points are the candidate medoids and are intended to be the most central points of their clusters. 2) Consider the effect of replacing one of the selected objects (medioids) with one of the non-selected objects. Conceptually, this is done in the following way. The distance of each non-selected point from the closest candidate medoid is calculated, and this distance is summed over all points. This distance represents the "cost" of the current configuration. All possible swaps of a non-selected point for a selected one are considered, and the cost of each configuration is calculated. 3) Select the configuration with the lowest cost. If this is a new configuration, then repeat step 2. 4) Otherwise, associate each non-selected point with its closest selected point (medoid) and stop. Step 3 requires more explanation. The ith medoid is evaluated by
using
3. Hierarchical Clustering In hierarchical clustering the goal is to produce a hierarchical series of nested clusters, ranging from clusters of individual points at the bottom to an all-inclusive cluster at the top. A diagram called a dendogram graphically represents this hierarchy and is an inverted tree that describes the order in which points are merged (bottom-up view) or clusters are split (top-down view). One of the attractions of hierarchical techniques is that they correspond to taxonomies that are very common in the biological sciences, e.g., kingdom, phylum, genus, species, . (Some cluster analysis work occurs under the name of "mathematical taxonomy.") Another attractive feature is that hierarchical techniques do not assume any particular number of clusters. Instead any desired number of clusters can be obtained by "cutting" the dendogram at the proper level. Finally, hierarchical techniques are thought to produce better quality clusters [1]. There are two basic approaches to generating a hierarchical clustering:
a) Agglomerative: Start with the points as individual clusters and, at each step, merge the closest pair of clusters. This requires defining the notion of cluster proximity.
b) Divisive: Start with one, all-inclusive cluster and, at each step, split a cluster until only singleton clusters of individual points remain. In this case, we need to decide which cluster to split at each step. The hierarchical agglomerative clustering methods are most commonly used. The construction of a hierarchical agglomerative classification can be achieved by the following general algorithm [3]. 1) Compute the proximity graph, if necessary. (Sometimes the proximity graph is all that is available.) 2) Merge the closest (most similar) two clusters. 3) Update the proximity matrix to reflect the proximity between the new cluster and the original clusters. 4) Repeat steps 3 and 4 until only a single cluster remains. The key step is the calculation of the proximity between two clusters, and this is where the various agglomerative hierarchical techniques differ. Any of the cluster proximities that we discuss in this section can be viewed as a choice of different parameters (in the Lance-Williams formula) for the proximity between clusters Q and R, where R is formed by merging clusters A and B.
p(R, Q) = áA p(A, Q) + áB p(B, Q) + â p(A, Q) + ã | p(A, Q) p(B, Q) |
In words, this formula says that after you merge clusters A and B to form cluster R, then the distance of the new cluster, R, to an existing cluster, Q, is a linear function of the distances of Q from the original clusters A and B. Any hierarchical technique that can be phrased in this way does not need the original points, only the proximity matrix, which is updated as clustering occurs. However, while a general formula is nice, it is often easier to understand the different hierarchical methods by looking directly at the definition of cluster distance that each method uses, and that is the approach that we shall take here.
Time and Space Complexity Hierarchical clustering techniques typically use a proximity matrix. This requires the computation and storage of n2 proximities, a factor that limits the size of data sets that can be processed. (It is possible to compute the proximities on the fly and save space, but this increases computation time.) Once the proximity matrix is available, the time required for hierarchical clustering is O(n2).
4. Density-based method Density-based algorithms search for regions of high density. Clusters are regions of a given radius (or shape) that contain at least a minimum number of points. This again demands tuning parameters for the shape and/or area of the region in advance. Alternatively, the algorithm explores a large prescribed list of choices of radius (for example, GAM). This alternative, however, suffers from the explosive computational cost. The algorithm grows regions with sufficiently high density into clusters and discovers clusters of arbitrary shape in spatial databases with noise.
4.1 DBSCAN: A density-based clustering method based on connected regions with sufficiently high density DBSCAN is a density based clustering algorithm that works with a number of different distance metrics. When DBSCAN has processed a set of data points, a point will either be in a cluster or will be classified as noise. DBSCAN is based on the concepts of a point being "density reachable" and "density connected" [2]. What follows is a more basic description. Conceptually, data points fall into three classes:
a) Core points. These are points that are at the interior of a cluster. A point is an interior point if there are enough points in its neighborhood, i.e., if the number of points within a given neighborhood around the point, as determined by the distance function and a supplied distance parameter, exceeds a certain threshold - also a supplied parameter. If two core points belong to each others neighborhoods, then the core points belong to the same cluster.
b) Border points. A border point is a point that is not a core point, i.e., there are not enough points in its neighborhood, but it falls within the neighborhood of a core point. Note that a border point may fall within the neighborhoods of core points from several different clusters. If so, the cluster that the border point is grouped with will depend on the details of the DBSCAN algorithm, and is order dependent.
c) Noise points. A noise point is any point that is not a core point or a border point. Thus, for DBSCAN, a cluster is the set of all core points whose neighborhoods transitively connect them together, along with some border points.
4.2 DENCLUE (DENsity CLUstEring) DENCLUE is a density clustering approach that takes a more formal approach to density based clustering by modeling the overall density of a set of points as the sum of "influence" functions associated with each point. The resulting overall density function will have local peaks, i.e., local density maxima, and these local peaks can be used to define clusters in a straightforward way. Specifically, for each data point, a hill climbing procedure finds the nearest peak associated with that point, and the set of all data points associated with a particular peak (called a local density attractor) becomes a (center defined) cluster. However, if the density at a local peak is too low, then the points in the associated cluster are classified as noise and discarded. Also, if a local peak can be connected to a second local peak by a path of data points, and the density at each point on the path is above a minimum density threshold, t, then the clusters associated with these local peaks are merged. Thus, clusters of any shape can be discovered [5].
5. Grid-base method To enhance the efficiency of clustering, a grid-based clustering approach uses a grid data structure. It quantizes the space into a finite number of cells which form a grid structure on which all of the operations for clustering are performed. The main advantage of the approach is its fast processing time which is typically independent of the number of data objects, yet dependent on only the number of cells in each dimension in the quantized space. Some typical examples of the grid-based approach include STING, which explores statistical information stored in the grid cells; WaveCluster, which clusters objects using a wavelet transform method; and CLIQUE, which represents a grid- and density based approach for clustering in high-dimensional data space.
5.1 STING STING: A Statistical Information Grid Approach Each cell at a high level is partitioned into a number of smaller cells in the next lower level. Statistical info of each cell is calculated and stored beforehand and is used to answer queries. Parameters of higher level cells can be easily calculated from parameters of lower level cell. Use a top-down approach to answer spatial data queries. Start from a pre-selected layer typically with a small number of cells. For each cell in the current level compute the confidence Interval. Remove the irrelevant cells from further consideration. When finish examining the current layer, proceed to the next lower level. Repeat this process until the bottom layer is reached Advantages: Query -independent, easy to parallelize, incremental update O(K), where K is the number of grid cells at the lowest levelDisadvantages: All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected.
5.2 WaveCluster WaveCluster is a clustering technique that interprets the original data as a two dimensional signal and then applies signal processing techniques ( the wavelet transform) to map the original data to a new space where cluster identification is more straightforward. More specifically, WaveCluster defines a uniform two-dimensional grid on the data and represents the points in each grid cell by the number of points. Thus, a collection of two-dimensional data points becomes an image, i.e., a set of "gray-scale" pixels, and the problem of finding clusters becomes one of image segmentation. The basic algorithm of WaveCluster is as follows: 1) Create a grid and assign each data object to a cell in the grid. The grid is uniform, but the grid size will vary for different scales of analysis. Each grid cell keeps track of the statistical properties of the points in that cell, but for wave clustering only the number of points in the cell is used. 2) Transform the data to a new space by applying the wavelet transform. This results in 4 "subimages" at several different levels of resolution an "average" image, an image that emphasizes the horizontal features, an image that emphasizes vertical features and an image that emphasizes corners. 3) Find the connected components in the transformed space. The average subimage is used to find connected clusters, which are just groups of connected "pixels," i.e., pixels which are connected to one another horizontally, vertically, or diagonally. 4) Map the clusters labels of points in the transformed space back to points in the original space. WaveCluster creates a lookup table that associates each point in the original with a point in the transformed space. Assignment of cluster labels to the original points is then straightforward. Complexity O(N). Detect arbitrary shaped clusters at different scales. Not sensitive to noise, not sensitive to input order. Only applicable to low dimensional data overlapping rectangular units.
5.3 CLIQUE CLIQUE is a clustering algorithm that finds high-density regions by partitioning the data space into cells (hyper-rectangles) and finding the dense cells. Clusters are found by taking the union of all adjacent, high-density cells. For simplicity and ease of use, clusters are described by expressing the cluster as a DNF (disjunctive normal form) expression and then simplifying the expression [4]. The Major Steps 1. Partition the data space and find the number of points that lie inside each cell of the partition. 2. Identify the subspaces that contain clusters using the apriori principle 3. Identify clusters :Determine dense units in all subspaces of interests Determine connected dense units in all subspaces of interests. 4. Generate minimal description for the clusters Determine maximal regions that cover a cluster of connected dense units for each cluster Determination of minimal cover for each cluster.
6. Model-based method 6.1 Statistics-Based Data Mining COBWEB: A popular a simple method of incremental conceptual learning. Creates a hierarchical clustering in the form of a classification tree. Each node refers to a concept and contains a probabilistic description of that concept.Limitations: The assumption that the attributes are independent of each other is often too strong because correlation may exist. Not suitable for clustering large database data skewed tree and expensive probability distributions.
6.2 Neural Network approach The Kohonen Self-Organizing Feature Map (SOM) is a clustering and data visualization technique based on neural networks. The "self-organizing" part of the name derives from the ability of neural nets to learn their configuration from training data. The "map" part of the name comes from the way that SOM neural nets work, i.e., when an input is fed into a SOM neural network, the output is the label of a particular neuron in the network, and thus, a SOM neural network can be viewed as mapping a point to neuron. Finally, the "feature" portion of the name comes from observation that, in a trained SOM network, a small set of neurons are often associated with a particular data feature. However, while SOM might seem very similar to K-means or other vector quantization approaches, there is a fundamental conceptual difference. During the training process, SOM uses each data point to update the closest reference vector and the reference vectors of nearby neurons. In this way, SOM produces an "ordered" set of reference vectors.
7. Conclusions We have presented an overview of clustering algorithms and categorized them into five categories: partitioning-based, hierarchical-based, density-based, grid-based and model-based, those which we believe are useful for the geographical community.. Clustering has also been investigated by statisticans, machine learning experts and people in other fields. Due to the limited scope of this paper, algorithms from these fields have not been covered in the discussion. We hope that readers will look into them if such situation arises.
8. References [1] L. Kaufman and P. J. Rousseeuw, (1990), Finding Groups in Data: an Introduction to Cluster Analysis, John Wiley and Sons. [2] Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu, (1996), A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, pp. 226-231. [3] Richard C. Dubes and Anil K. Jain, (1988), Algorithms for Clustering Data, Prentice Hall. [4] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopoulos, and Prabhakar Raghavan, (1998), Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, Technical Report, IBM Almaden Research Center. [5] Alexander Hinneburg Daniel. A. Keim and: An Efficient Approach to Clustering in Large Multimedia Databases with Noise, Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, AAAI Press. |
|