Lesson 14: Cluster Analysis

Cluster analysis is a data exploration (mining) tool for dividing a multivariate dataset into “natural” clusters (groups). We use the methods to explore whether previously undefined clusters (groups) exist in the dataset. For instance, a marketing department may wish to use survey results to sort its customers into categories (perhaps those likely to be most receptive to buying a product, those most likely to be against buying a product, and so forth).

Cluster Analysis is used when we believe that the sample units come from an unknown number of distinct populations or sub-populations. We also assume that the sample units come from a number of distinct populations, but there is no apriori definition of those populations. Our objective is to describe those populations with the observed data.

Cluster Analysis, until relatively recently, has had very little interest. This has changed because of the interest in bioinformatics and genome research. We will use an ecological example in our lesson.

Objectives

Upon completion of this lesson, you should be able to:

14.1 - Example: Woodyard Hammock Data

14.1 - Example: Woodyard Hammock Data

Example 14-1: Woodyard Hammock Data

We illustrate the various methods of cluster analysis using ecological data from Woodyard Hammock, a beech-magnolia forest in northern Florida. The data involve counts of the number of trees of each species in n = 72 sites. A total of 31 species were identified and counted, however, only p = 13 of the most common species were retained and are listed below. They are:

Variable Scientific Name Common Name
carcar Carpinus caroliniana Ironwood
corflo Cornus florida Dogwood
faggra Fagus grandifolia Beech
ileopa Ilex opaca Holly
liqsty Liquidambar styraciflua Sweetgum
maggra Magnolia grandiflora Magnolia
nyssyl Nyssa sylvatica Blackgum
ostvir Ostrya virginiana Blue Beech
oxyarb Oxydendrum arboreum Sourwood
pingla Pinus glabra Spruce Pine
quenig Quercus nigra Water Oak
quemic Quercus michauxii Swamp Chestnut Oak
symtin Symplocus tinctoria Horse Sugar

The first column gives the 6-letter code identifying the species, the second column gives its scientific name (Latin binomial), and the third column gives the common name for each species. The most commonly found of these species were the beech and magnolia.

What is our objective with this data?

We hope to group sample sites together into clusters that share similar species compositions as determined by some measure of association. There are several options to measure association. Two common measures are listed below:

  1. The measure of Association between Sample Units: We need some way to measure how similar two subjects or objects are to one another. This could be just about any type of measure of association. There is a lot of room for creativity here. However, SAS only allows Euclidean distance (defined later).
  2. The measureof Association between Clusters: How similar are two clusters? There are dozens of techniques that can be used here.

Many different approaches to the cluster analysis problem have been proposed. The approaches generally fall into three broad categories:

Hierarchical methods

Note 1: Agglomerative methods are used much more often than divisive methods.

Note 2: Hierarchical methods can be adapted to cluster variables rather than observations. This is a common use for hierarchical methods.

Non-hierarchical methods:

Model-based methods:

14.2 - Measures of Association for Continuous Variables

14.2 - Measures of Association for Continuous Variables

We use the standard notation that we have been using all along:

Johnson and Wichern list four different measures of association (similarity) that are frequently used with continuous variables in cluster analysis:

Some other distances also use a similar concept.

Euclidean Distance

Minkowski Distance

Here are two other methods for measuring association:

Canberra Metric

Czekanowski Coefficient

For each distance measure, similar subjects have smaller distances than dissimilar subjects. Similar subjects are more strongly associated.

Or, if you like, you can invent your own measure! However, whatever you invent, the measure of association must satisfy the following properties:

Symmetry

Positivity

Identity

14.3 - Measures of Association for Binary Variables

14.3 - Measures of Association for Binary Variables

In the Woodyard Hammock example, the observer recorded how many individuals belong to each species at each site. However, other research methods might only record whether or not the species was present at a site. In sociological studies, we might look at traits that some people have and others do not. Typically 1(0) signifies that the trait of interest is present (absent).

For sample units i and j, consider the following contingency table of frequencies of 1-1, 1-0, 0-1, and 0-0 matches across the variables:

Unit j
Unit i 1 0 Total
1 a b a + b
0 c d c + d
Total a + c b + d p = a + b + c + d

If we are comparing two subjects, subject I, and subject j, then a is the number of variables present for both subjects. In the Woodyard Hammock example, this is the number of species found at both sites. Similarly, b is the number (of species) found in subject i but not subject j, c is just the opposite, and d is the number that is not found in either subject.

From here we can calculate row totals, column totals, and a grand total.

Johnson and Wichern list the following Similarity Coefficients used for binary data:

0-0 matches are irrelevant

Double weights for 1-1 matches

Double weights for unmatched pairs

Ratio of 1-1 matches to mismatches

The first coefficient looks at the number of matches (1-1 or 0-0) and divides it by the total number of variables. If two sites have identical species lists, then this coefficient is equal to one because c = b = 0. The more species that are found at one and only one of the two sites, the smaller the value for this coefficient. If no species in one site are found in the other site, then this coefficient takes a value of zero because a = d = 0.

The remaining coefficients give different weights to matched (1-1 or 0-0) or mismatched (1-0 or 0-1) pairs. For example, the second coefficient gives matched pairs double the weight and thus emphasizes agreements in the species lists. In contrast, the third coefficient gives mismatched pairs double the weight, more strongly penalizing disagreements between the species lists. The remaining coefficients ignore species not found in either site.

The choice of coefficient will have an impact on the results of the analysis. Coefficients may be selected based on theoretical considerations specific to the problem at hand, or so as to yield the most parsimonious description of the data. For the latter, the analysis may be repeated using several of these coefficients. The coefficient that yields the most easily interpreted results is selected.

The main thing is that you need some measure of association between your subjects before the analysis can proceed. We will look next at methods of measuring distances between clusters.

14.4 - Agglomerative Hierarchical Clustering

14.4 - Agglomerative Hierarchical Clustering

Combining Clusters in the Agglomerative Approach

In the agglomerative hierarchical approach, we define each data point as a cluster and combine existing clusters at each step. Here are four different methods for this approach:

  1. Single Linkage: In single linkage, we define the distance between two clusters as the minimum distance between any single data point in the first cluster and any single data point in the second cluster. On the basis of this definition of distance between clusters, at each stage of the process, we combine the two clusters with the smallest single linkage distance.
  2. Complete Linkage: In complete linkage, we define the distance between two clusters to be the maximum distance between any single data point in the first cluster and any single data point in the second cluster. On the basis of this definition of distance between clusters, at each stage of the process, we combine the two clusters that have the smallest complete linkage distance.
  3. Average Linkage: In average linkage, we define the distance between two clusters to be the average distance between data points in the first cluster and data points in the second cluster. On the basis of this definition of distance between clusters, at each stage of the process we combine the two clusters that have the smallest average linkage distance.
  4. Centroid Method: In the centroid method, the distance between two clusters is the distance between the two mean vectors of the clusters. At each stage of the process we combine the two clusters that have the smallest centroid distance.
  5. Ward’s Method: This method does not directly define a measure of distance between two points or clusters. It is an ANOVA-based approach. One-way univariate ANOVAs are done for each variable with groups defined by the clusters at that stage of the process. At each stage, two clusters merge that provide the smallest increase in the combined error sum of squares.

In the following table, the mathematical form of the distances is provided. The graph gives a geometric interpretation.

Note! None of these four methods is uniformly the best. In practice, it’s advisable to try several methods and then compare the results to form an overall judgment about the final formation of clusters.

Linkage Methods or Measuring Association d12 Between Clusters 1 and 2

The following video shows the linkage method types listed on the right for a visual representation of how the distances are determined for each method.

14.5 - Agglomerative Method Example

14.5 - Agglomerative Method Example

Example: Woodyard Hammock Data

SAS uses the Euclidian distance metric and agglomerative clustering, while Minitab can use Manhattan, Pearson, Squared Euclidean, and Squared Pearson distances as well. Both SAS and Minitab use only agglomerative clustering.

Use the datafile wood.csv.

Cluster analysis is carried out in SAS using a cluster analysis procedure that is abbreviated as a cluster. We will look at how this is carried out in the SAS program below.

options ls=78; title 'Cluster Analysis - Woodyard Hammock - Complete Linkage'; data wood; infile 'D:\Statistics\STAT 505\data\wood.csv' firstobs=2 delimiter=','; input x y acerub carcar carcor cargla cercan corflo faggra frapen ileopa liqsty lirtul maggra magvir morrub nyssyl osmame ostvir oxyarb pingla pintae pruser quealb quehem quenig quemic queshu quevir symtin ulmala araspi cyrrac; ident=_n_; drop acerub carcor cargla cercan frapen lirtul magvir morrub osmame pintae pruser quealb quehem queshu quevir ulmala araspi cyrrac; run; proc sort data=wood; by ident; run; proc cluster data=wood method=complete outtree=clust1; var carcar corflo faggra ileopa liqsty maggra nyssyl ostvir oxyarb pingla quenig quemic symtin; id ident; run; proc tree data=clust1 horizontal nclusters=6 out=clust2; id ident; run; proc sort data=clust2; by ident; run; proc print data=clust2; run; data combine; merge wood clust2; by ident; run; proc glm data=combine; class cluster; model carcar corflo faggra ileopa liqsty maggra nyssyl ostvir oxyarb pingla quenig quemic symtin = cluster; means cluster; run; 

Download the SAS Program here: wood1.sas

the code: wood1.sas

Note: In the upper right-hand corner of the code block you will have the option of copying ( ) the code to your clipboard or downloading ( ) the file to your computer.

options ls=78; title 'Cluster Analysis - Woodyard Hammock - Complete Linkage'; /* After reading in the data, an ident variable is created as the * row number for each observation. This is neeed for the clustering algorithm. * The drop statement removes several variables not used for this analysis. */ data wood; infile 'D:\Statistics\STAT 505\data\wood.csv' firstobs=2 delimiter=','; input x y acerub carcar carcor cargla cercan corflo faggra frapen ileopa liqsty lirtul maggra magvir morrub nyssyl osmame ostvir oxyarb pingla pintae pruser quealb quehem quenig quemic queshu quevir symtin ulmala araspi cyrrac; ident=_n_; drop acerub carcor cargla cercan frapen lirtul magvir morrub osmame pintae pruser quealb quehem queshu quevir ulmala araspi cyrrac; run; /* The observations are sorted by their ident value. */ proc sort data=wood; by ident; run; /* The cluster procedure is for hierarchical clustering. * The method option specifies the cluster distance formula to use. * The outtree option saves the results. */ proc cluster data=wood method=complete outtree=clust1; var carcar corflo faggra ileopa liqsty maggra nyssyl ostvir oxyarb pingla quenig quemic symtin; id ident; run; /* The tree procedure generates a dendrogram of the heirarchical * clustering results and saves cluster label assignments if the * nclusters option is also specified. */ proc tree data=clust1 horizontal nclusters=6 out=clust2; id ident; run; /* The data are sorted by their ident value. */ proc sort data=clust2; by ident; run; /* The results from clust2 are printed. */ proc print data=clust2; run; /* This step combines the original wood data set with * the results of clust2, which allows the ANOVA statistics * to be calculated in the following glm procedure. */ data combine; merge wood clust2; by ident; run; /* The glm procedure views the cluster labels as ANOVA groups and * reports several statistics to assess variation between clusters * relative to variation within clusters. * The mean for each cluster is also reported. */ proc glm data=combine; class cluster; model carcar corflo faggra ileopa liqsty maggra nyssyl ostvir oxyarb pingla quenig quemic symtin = cluster; means cluster; run; 

Performing a cluster analysis

To perform cluster analysis:

  1. Open the ‘wood’ data set in a new worksheet.
  2. Stat > Multivariate > Cluster Observations
    1. Highlight and select the variables to use in the clustering. For this example, the following variables are selected (13 total): carcar corflo faggra ileopa liqsty maggra nyssyl ostvir oxyarb pingla quenig quemic symtin
    2. Choose Complete as the Linkage and Euclidean as the Distance.
    3. Choose Show dendrogram, and under Customize, choose Distance for Y Axis.

    Dendrograms (Tree Diagrams)

    The results of cluster analysis are best summarized using a dendrogram. In a dendrogram, distance is plotted on one axis, while the sample units are given on the remaining axis. The tree shows how the sample units are combined into clusters, the height of each branching point corresponding to the distance at which two clusters are joined.

    In looking at the cluster history section of the SAS (or Minitab) output, we see that the Euclidean distance between sites 33 and 51 was smaller than between any other pair of sites (clusters). Therefore, this pair of sites were clustered first in the tree diagram. Following the clustering of these two sites, there are a total of n - 1 = 71 clusters, and so, the cluster formed by sites 33 and 51 is designated "CL71". Note that the numerical values of the distances in SAS and in Minitab are different because SAS shows a 'normalized' distance. We are interested in the relative ranking for cluster formation, rather than the absolute value of the distance anyhow.

    The Euclidean distance between sites 15 and 23 was smaller than between any other pair of the 70 unclustered sites or the distance between any of those sites and CL71. Therefore, this pair of sites were clustered second. Its designation is "CL70".

    In the seventh step of the algorithm, the distance between site 8 and cluster CL67 was smaller than the distance between any pair of unclustered sites and the distances between those sites and the existing clusters. Therefore, site 8 was joined to CL67 to form the cluster of 3 sites designated as CL65.

    The clustering algorithm is completed when clusters CL2 and CL5 are joined.

    The plot below is generated by Minitab. In SAS the diagram is horizontal. The color scheme depends on how many clusters are created (discussed later).

    What do you do with the information in this tree diagram?

    We decide the optimum number of clusters and which clustering technique to use. We adapted the wood1.sas program to specify the use of the other clustering techniques. Here are links to these program changes. In Minitab also you may select other options instead of a single linkage from the appropriate box.

    File Name Description of Data
    wood1.sas specifies complete linkage
    wood2.sas is identical, except that it uses average linkage
    wood3.sas uses the centroid method
    wood4.sas uses the simple linkage

    As we run each of these programs we must remember to keep in mind that our goal is a good description of the data.

    Applying the Cluster Analysis Process

    First, we compare the results of the different clustering algorithms. Note that clusters containing one or only a few members are undesirable, as that will give rise to a large number of clusters, defeating the purpose of the whole analysis. That is not to say that we can never have a cluster with a single member! In fact, if that happens, we need to investigate the reason. It may indicate that the single-item cluster is completely different from the other members of the sample and is best left alone.

    To arrive at the optimum number of clusters we may follow this simple guideline. Select the number of clusters that have been identified by each method. This is accomplished by finding a breakpoint (distance) below which further branching is ignored. In practice, this is not necessarily straightforward. You will need to try a number of different cut points to see which is more decisive. Here are the results of this type of partitioning using the different clustering algorithm methods on the Woodyard Hammock data. A dendrogram helps determine the breakpoint.

    Cluster Analysis Linkage Type Cluster Yield
    Complete Linkage Partitioning into 6 clusters yields clusters of sizes 3, 5, 5, 16, 17, and 26.
    Average Linkage Partitioning into 5 clusters would yield 3 clusters containing only a single site each.
    Centroid Linkage Partitioning into 6 clusters would yield 5 clusters containing only a single site each.
    Single Linkage Partitioning into 7 clusters would yield 6 clusters containing only 1-2 sites each.

    For this example, complete linkage yields the most satisfactory result.

    For your convenience, the following screenshots demonstrate how alternative clustering procedures may be done in Minitab.

    14.6 - Cluster Description

    14.6 - Cluster Description

    The next step of cluster analysis is to describe the identified clusters.

    The SAS program shows how this is implemented.

    Download the SAS Program here: wood1.sas

    Notice that in the cluster procedure, we created a new SAS dataset called clust1.

    In the tree procedure, we chose to investigate 6 clusters with ncluster=6.