Logo.jpg (12030 bytes)ASEE  CSI_SEAL.GIF (5382 bytes)

Mid Atlantic Section

FALL 2001 REGIONAL CONFERENCE

PROGRAM AND PROCEEDINGS

Images from the 2001 Regional Conference

AMERICAN SOCIETY FOR ENGINEERING EDUCATION

MID-ATLANTIC SECTION

FALL 2001 REGIONAL CONFERENCE

NOVEMBER 2-3, 2001

COLLEGE OF STATEN ISLAND, CUNY

STATEN ISLAND, NEW YORK 10314

CONFERENCE THEME: "The 21st Century Engineer"

SPONSORSHIP

TELCORDIA TECHNOLOGIES, INC.

DISCOVERY CENTER OF THE COLLEGE OF STATEN ISLAND


Back Up Next

 

 

 

Spatial Clustering Methods in Data Mining

 

Plamen Tchimev, Joseph Rossi, Iren Valova

Department of Computer and Information Sciences

University of Massachusetts Dartmouth

 

Clustering algorithms have been employed in many fields of human knowledge that require finding a "natural association" among some specific data. The way the term "natural association" is defined depends on the field and the particular application that uses it, and can vary quite a lot. For example, in the life sciences, the data to be clustered consists of life forms and the clusters themselves are species or subspecies of a given species. The need for using clustering algorithms is determined by the scientists’ wish to study various characteristics of the life forms. For example, mammals are clustered by similarities in the proportion of three chemicals in their milk. The "natural association" of the animals depends on the percentages of these chemicals in their milk. A number of other domains use clustering algorithms to extract information by grouping a large collection of objects, such as the medical sciences (clustering of symptoms yields information about the diseases), the earth sciences (to determine land use patterns, for example), marketing research (to target advertisements and promotions towards specific consumer groups that have a high probability of responding to them). The scope of this paper is to provide an introduction to cluster analysis in the field of data mining, where we define data mining to be the discovery of useful information or patterns in large collections of data. We discuss a number of clustering representative algorithms that are developed specifically for data mining.

 

 

 

1. Introduction

Data clustering is under vigorous development. Contributing areas of research include data mining, statistics, machine learning, spatial database technology, biology, marketing, etc. Owing to the huge amounts of data collected in databases, cluster analysis has become an active topic in data mining research.

The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be treated as one group in many applications. Clustering techniques apply when there is no class to be predicted but rather when the instances are to be divided into natural groups. These clusters presumably reflect some mechanism at work in the domain from which instances are drawn, a mechanism that causes some instances to bear a stronger resemblance to one another than they do to the remaining instances. Clustering naturally requires different techniques to the classification and association learning methods.

Due to its immense applications in various areas, data clustering has been a highly active topic in data mining research, with fruitful, scalable clustering methods developed recently. These spatial clustering methods can be classified into five categories: partitioning method, hierarchical method, density-based method, grid-based method and model based methods, as presented on Fig 1.

The characteristics of the data being clustered serve as important factor in determining the clustering algorithm being applied. These characteristics include:

The types of data attributes: The similarity between two data objects is judged by the difference in their data attributes. When all these attributes are numeric, distance measures like Euclidean distance and Manhattan distance can be easily computed. However, when binary, categorical and ordinal attributes are involved, the computation of distance measure is much more complicated. Presently, most clustering algorithms are based on numeric attributes.
Dimensionality: The dimensionality of the data refers to the number of attributes in a data object. Many clustering algorithms which perform well on low-dimensional data degenerate when the number of dimensions increases. The degeneration can come in two forms: increase in running time or decrease in cluster quality. In order to choose a clustering algorithm for high-dimensional data clustering, the running time and cluster quality produced by the algorithm at high dimension must first be assessed to meet the requirement of the application.
Amount of noise in data: As some of the clustering algorithms are very sensitive to noise and outliers, careful choice must be made if the data in the application contains a large amount of noise.

 

 

 

 

 

 

 

Partitioning method

k-means algorithm

EM (Expectation Maximization)

algorithm

k-medoid algorithm

Hierarchical method

AGNES

BIRCH

CURE

CHAMELEON

Density-based method

DBSCAN

OPTICS

DENCLUE

Grid-based method

STING

WaveCluster

CLIQUE

Model-based method

Statistical approach

Neural network approach

COBWEB

CLASSIT

Competitive learning

Self-Organizing Mas(SOM’s)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure1 Data clustering methods classification.

There are different ways in which the result of clustering can be expressed. The groups that are identified may be exclusive, so that any instance belongs in only one group. Or they may be overlapping, so that an instance may fall into several groups. Or they may be probabilistic, whereby an instance belongs to each group with a certain probability. Or they may be hierarchical, such that there is a crude division of instances into groups at the top level, and each of these groups is refined further-perhaps all the way down to individual instances. Really, the choice between these possibilities should be dictated by the nature of the mechanisms that are thought to underlie the particular clustering phenomenon. However, because these mechanisms are rarely known - the very existence of clusters is, after all, something that we're trying to discover - and for pragmatic reasons too, the choice is usually dictated by the clustering tools that are available (Fig. 2).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

wpe1.jpg (10137 bytes)

 

Figure 2 Different ways of cluster representation

 

In this paper we will look at the basics of the five clustering methods: partitioning method, hierarchical method, density-based method, grid-based method and model-based method.

 

 

2. Partitioning method

The three major partitioning algorithms are k-means algorithm, EM(Expectation Maximization) algorithm, and k-medoid algorithm.

 

2.1 The k-means method

The k-means method is simple and reasonably effective. In advance is specified how many clusters have been sought, and this will determine the parameter k. Then k points are chosen at random as cluster centers. Instances are assigned to their closets cluster center according to the ordinary Euclidian distance function. Next the centroid, or mean, of all instances in each cluster is calculated, this is the "means" part. These centroids are taken to be new center values for their respective clusters. Finally, the whole process is repeated with the new cluster centers. Iteration continues until the same points are assigned to each cluster in consecutive rounds, at which point the cluster centers have stabilized and will remain the same thereafter. The computational complexity of the algorithm is a function of: (number of objects, number of clusters, number of iterations). As with all practical clustering techniques, the final cluster centers do not represent a global minimum but only a local one, and completely different final clusters can arise from differences in the initial randomly chosen cluster centers. It is easy to imagine situations in which the algorithm will fail to find a reasonable clustering.

Disadvantages: the k-means algorithm is very sensitive to noise, and time consuming because a substantial number of iterations may be necessary, each involving finding the distance of the k cluster centers from every instance to determine the closest.

 

2.2 The EM (Expectation Maximization)

The EM (Expectation Maximization) algorithm represents each cluster using a probability distribution (For example: Gausian probability distribution).

Similarly to k-means, first select the cluster parameters (or guess the classes of the instances), then iterate. Adjustment needed: we know cluster probabilities, not actual clusters for each instance. So, we use these probabilities as weights. For cluster A,

 

m_A = (w1*x1 + w2*x2 + .. + wn*xn)/(w1 + w2 + ... + wn),

where wi is the probability that xi belongs to cluster A; standard deviation for A,

A = [w1(x1-m)2 + w2(x2-m)2 + ... + wn(xn-m)2]/(w1 + w2 + ... + wn).

When to stop iteration - maximizing overall likelihood that the data come form the dataset with the given parameters ("goodness" of clustering): product (or sum of logs) of P(A)*P(xi|A) over all instances xi and all classes A. Stop when the difference between two successive iteration becomes negligible.

 

2.3 The k-medoids method

The objective of k-medoid clustering is to find a non-overlapping set of clusters such that each cluster has a most representative point, i.e., a point that is most centrally located with respect to some measure, e.g., distance. These representative points are called medoids. Once again, the algorithm is conceptually simple.

Basic k-medoid Algorithm for finding K clusters.

1) Select K initial points. These points are the candidate medoids and are intended to be the most central points of their clusters.

2) Consider the effect of replacing one of the selected objects (medioids) with one of the non-selected objects. Conceptually, this is done in the following way. The distance of each non-selected point from the closest candidate medoid is calculated, and this distance is summed over all points. This distance represents the "cost" of the current configuration. All possible swaps of a non-selected point for a selected one are considered, and the cost of each configuration is calculated.

3) Select the configuration with the lowest cost. If this is a new configuration, then repeat step 2.

4) Otherwise, associate each non-selected point with its closest selected point (medoid) and stop.

Step 3 requires more explanation. The ith medoid is evaluated by using where pij, is the proximity between the ith medoid and the jth point in the cluster. For similarities (dissimilarities) we want this sum to be as large (small) as possible. It can be seen that such an approach is not restricted to Euclidean spaces and is likely to be more tolerant of outliers. However, finding a better medoid requires trying all points that are currently not medoids and is computationally expensive.

 

 

3. Hierarchical Clustering

In hierarchical clustering the goal is to produce a hierarchical series of nested clusters, ranging from clusters of individual points at the bottom to an all-inclusive cluster at the top. A diagram called a dendogram graphically represents this hierarchy and is an inverted tree that describes the order in which points are merged (bottom-up view) or clusters are split (top-down view). One of the attractions of hierarchical techniques is that they correspond to taxonomies that are very common in the biological sciences, e.g., kingdom, phylum, genus, species, … . (Some cluster analysis work occurs under the name of "mathematical taxonomy.") Another attractive feature is that hierarchical techniques do not assume any particular number of clusters. Instead any desired number of clusters can be obtained by "cutting" the dendogram at the proper level. Finally, hierarchical techniques are thought to produce better quality clusters [1]. There are two basic approaches to generating a hierarchical clustering:

 

a) Agglomerative: Start with the points as individual clusters and, at each step, merge the closest pair of clusters. This requires defining the notion of cluster proximity.

 

b) Divisive: Start with one, all-inclusive cluster and, at each step, split a cluster until only singleton clusters of individual points remain. In this case, we need to decide which cluster to split at each step.

The hierarchical agglomerative clustering methods are most commonly used. The construction of a hierarchical agglomerative classification can be achieved by the following general algorithm [3].

1) Compute the proximity graph, if necessary. (Sometimes the proximity graph is all that is available.)

2) Merge the closest (most similar) two clusters.

3) Update the proximity matrix to reflect the proximity between the new cluster and the original clusters.

4) Repeat steps 3 and 4 until only a single cluster remains.

The key step is the calculation of the proximity between two clusters, and this is where the various agglomerative hierarchical techniques differ. Any of the cluster proximities that we discuss in this section can be viewed as a choice of different parameters (in the Lance-Williams formula) for the proximity between clusters Q and R, where R is formed by merging clusters A and B.

 

 

p(R, Q) = áA p(A, Q) + áB p(B, Q) + â p(A, Q) + ã | p(A, Q) – p(B, Q) |

 

In words, this formula says that after you merge clusters A and B to form cluster R, then the distance of the new cluster, R, to an existing cluster, Q, is a linear function of the distances of Q from the original clusters A and B. Any hierarchical technique that can be phrased in this way does not need the original points, only the proximity matrix, which is updated as clustering occurs. However, while a general formula is nice, it is often easier to understand the different hierarchical methods by looking directly at the definition of cluster distance that each method uses, and that is the approach that we shall take here.

 

Time and Space Complexity

Hierarchical clustering techniques typically use a proximity matrix. This requires the computation and storage of n2 proximities, a factor that limits the size of data sets that can be processed. (It is possible to compute the proximities on the fly and save space, but this increases computation time.) Once the proximity matrix is available, the time required for hierarchical clustering is O(n2).

 

 

4. Density-based method

Density-based algorithms search for regions of high density. Clusters are  regions of a given radius (or shape) that contain at least a minimum number of points. This again demands tuning parameters for the shape and/or area of the region in advance. Alternatively, the algorithm explores a large prescribed list of choices of radius (for example, GAM). This alternative, however, suffers from the explosive computational cost. The algorithm grows regions with sufficiently high density into clusters and discovers clusters of arbitrary shape in spatial databases with noise.

 

4.1 DBSCAN: A density-based clustering method based on connected regions with sufficiently high density

DBSCAN is a density based clustering algorithm that works with a number of different distance metrics. When DBSCAN has processed a set of data points, a point will either be in a cluster or will be classified as noise. DBSCAN is based on the concepts of a point being "density reachable" and "density connected" [2]. What follows is a more basic description. Conceptually, data points fall into three classes:

 

a) Core points. These are points that are at the interior of a cluster. A point is an interior point if there are enough points in its neighborhood, i.e., if the number of points within a given neighborhood around the point, as determined by the distance function and a supplied distance parameter, exceeds a certain threshold - also a supplied parameter. If two core points belong to each other’s neighborhoods, then the core points belong to the same cluster.

 

b) Border points. A border point is a point that is not a core point, i.e., there are not enough points in its neighborhood, but it falls within the neighborhood of a core point. Note that a border point may fall within the neighborhoods of core points from several different clusters. If so, the cluster that the border point is grouped with will depend on the details of the DBSCAN algorithm, and is order dependent.

 

c) Noise points. A noise point is any point that is not a core point or a border point. Thus, for DBSCAN, a cluster is the set of all core points whose neighborhoods transitively connect them together, along with some border points.

 

4.2 DENCLUE (DENsity CLUstEring)

DENCLUE is a density clustering approach that takes a more formal approach to density based clustering by modeling the overall density of a set of points as the sum of "influence" functions associated with each point. The resulting overall density function will have local peaks, i.e., local density maxima, and these local peaks can be used to define clusters in a straightforward way. Specifically, for each data point, a hill climbing procedure finds the nearest peak associated with that point, and the set of all data points associated with a particular peak (called a local density attractor) becomes a (center defined) cluster. However, if the density at a local peak is too low, then the points in the associated cluster are classified as noise and discarded. Also, if a local peak can be connected to a second local peak by a path of data points, and the density at each point on the path is above a minimum density threshold, t, then the clusters associated with these local peaks are merged. Thus, clusters of any shape can be discovered [5].

 

 

5. Grid-base method

To enhance the efficiency of clustering, a grid-based clustering approach uses a grid data structure. It quantizes the space into a finite number of cells which form a grid structure on which all of the operations for clustering are performed. The main advantage of the approach is its fast processing time which is typically independent of the number of data objects, yet dependent on only the number of cells in each dimension in the quantized space. Some typical examples of the grid-based approach include STING, which explores statistical information stored in the grid cells; WaveCluster, which clusters objects using a wavelet transform method; and CLIQUE, which represents a grid- and density based approach for clustering in high-dimensional data space.

 

5.1 STING

STING: A Statistical Information Grid Approach Each cell at a high level is partitioned into a number of smaller cells in the next lower level. Statistical info of each cell is calculated and stored beforehand and is used to answer queries. Parameters of higher level cells can be easily calculated from parameters of lower level cell. Use a top-down approach to answer spatial data queries. Start from a pre-selected layer typically with a small number of cells. For each cell in the current level compute the confidence Interval. Remove the irrelevant cells from further consideration. When finish examining the current layer, proceed to

the next lower level. Repeat this process until the bottom layer is reached

Advantages: Query -independent, easy to parallelize, incremental update O(K), where K is the number of grid cells at the lowest level

Disadvantages: All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected.

 

5.2 WaveCluster

WaveCluster is a clustering technique that interprets the original data as a two dimensional signal and then applies signal processing techniques ( the wavelet transform) to map the original data to a new space where cluster identification is more straightforward. More specifically, WaveCluster defines a uniform two-dimensional grid on the data and represents the points in each grid cell by the number of points. Thus, a collection of two-dimensional data points becomes an image, i.e., a set of "gray-scale" pixels, and the problem of finding clusters becomes one of image segmentation.

The basic algorithm of WaveCluster is as follows:

1) Create a grid and assign each data object to a cell in the grid. The grid is uniform, but the grid size will vary for different scales of analysis. Each grid cell keeps track of the statistical properties of the points in that cell, but for wave clustering only the number of points in the cell is used.

2) Transform the data to a new space by applying the wavelet transform. This results in 4 "subimages" at several different levels of resolution an "average" image, an image that emphasizes the horizontal features, an image that emphasizes vertical features and an image that emphasizes corners.

3) Find the connected components in the transformed space. The average subimage is used to find connected clusters, which are just groups of connected "pixels," i.e., pixels which are connected to one another horizontally, vertically, or diagonally.

4) Map the clusters labels of points in the transformed space back to points in the original space. WaveCluster creates a lookup table that associates each point in the original with a point in the transformed space. Assignment of cluster labels to the original points is then straightforward.

Complexity O(N). Detect arbitrary shaped clusters at different scales. Not sensitive to noise, not sensitive to input order. Only applicable to low dimensional data overlapping rectangular units.

 

5.3 CLIQUE

CLIQUE is a clustering algorithm that finds high-density regions by partitioning the data space into cells (hyper-rectangles) and finding the dense cells. Clusters are found by taking the union of all adjacent, high-density cells. For simplicity and ease of use, clusters are described by expressing the cluster as a DNF (disjunctive normal form) expression and then simplifying the expression [4].

The Major Steps

1. Partition the data space and find the number of points that lie inside each cell of the partition.

2. Identify the subspaces that contain clusters using the apriori principle

3. Identify clusters:

Determine dense units in all subspaces of interests

Determine connected dense units in all subspaces of interests.

4. Generate minimal description for the clusters

Determine maximal regions that cover a cluster of connected dense units for each cluster

Determination of minimal cover for each cluster.

 

 

6. Model-based method

6.1 Statistics-Based Data Mining

COBWEB: A popular a simple method of incremental conceptual learning. Creates a hierarchical clustering in the form of a classification tree. Each node refers to a concept and contains a probabilistic description of that concept.

Limitations: The assumption that the attributes are independent of each other is often too strong because correlation may exist. Not suitable for clustering large database data – skewed tree and expensive probability distributions.

 

6.2 Neural Network approach

The Kohonen Self-Organizing Feature Map (SOM) is a clustering and data visualization technique based on neural networks. The "self-organizing" part of the name derives from the ability of neural nets to learn their configuration from training data. The "map" part of the name comes from the way that SOM neural nets work, i.e., when an input is fed into a SOM neural network, the output is the label of a particular neuron in the network, and thus, a SOM neural network can be viewed as mapping a point to neuron. Finally, the "feature" portion of the name comes from observation that, in a trained SOM network, a small set of neurons are often associated with a particular data feature. However, while SOM might seem very similar to K-means or other vector quantization approaches, there is a fundamental conceptual difference. During the training process, SOM uses each data point to update the closest reference vector and the reference vectors of nearby neurons. In this way, SOM produces an "ordered" set of reference vectors.

 

 

7. Conclusions

We have presented an overview of clustering algorithms and categorized them into five categories: partitioning-based, hierarchical-based, density-based, grid-based and model-based, those which we believe are useful for the geographical community.. Clustering has also been investigated by statisticans, machine learning experts and people in other fields. Due to the limited scope of this paper, algorithms from these fields have not been covered in the discussion. We hope that readers will look into them if such situation arises.

 

 

8. References

[1] L. Kaufman and P. J. Rousseeuw, (1990), Finding Groups in Data: an Introduction to Cluster Analysis, John Wiley and Sons.

[2] Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu, (1996), A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, pp. 226-231.

[3] Richard C. Dubes and Anil K. Jain, (1988), Algorithms for Clustering Data, Prentice Hall.

[4] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopoulos, and Prabhakar Raghavan, (1998), Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, Technical Report, IBM Almaden Research Center.

[5] Alexander Hinneburg Daniel. A. Keim and: An Efficient Approach to Clustering in Large Multimedia Databases with Noise, Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, AAAI Press.


For problems or questions regarding this web contact the web site designer.
Web site under construction and last updated: November 09, 2001.