3.1 Cluster analysis

The previous chapter has shown how to replace the original data matrix with pairwise similarities, chosen to reflect the particular aspect of community similarity of interest for that study (similarity in counts of abundant species, similarity in location of rare species etc). Typically, the number of pairwise similarities is large, n(n–1)/2 for n samples, and it is difficult visually to detect a pattern in the triangular similarity matrix. Table 3.1 illustrates this for just part (roughly a quarter) of the similarity matrix for the Frierfjord macrofauna data {F}. Close examination shows that the replicates within site A generally have higher within-site similarities than do pairs of replicates within sites B and C, or between-site samples, but the pattern is far from clear. What is needed is a graphical display linking samples that have mutually high levels of similarity.

Table 3.1. Frierfjord macrofauna counts {F}. Bray-Curtis similarities, on $\sqrt{}\sqrt{}$-transformed counts, for every pair of replicate samples from sites A, B, C only (four replicate samples per site).

	A1	A2	A3	A4	B1	B2	B3	B4	C1	C2	C3	C4
A1	–
A2	61	–
A3	69	60	–
A4	65	61	66	–
B1	37	28	37	35	–
B2	42	34	31	32	55	–
B3	45	39	39	44	66	66	–
B4	37	29	29	37	59	63	60	–
C1	35	31	27	25	28	56	40	34	–
C2	40	34	26	29	48	69	62	56	56	–
C3	40	31	37	39	59	61	67	53	40	66	–
C4	36	28	34	37	65	55	69	55	38	64	74	–

Cluster analysis (or classification, see footnote on terminology on page 1.2) aims to find natural groupings of samples such that samples within a group are more similar to each other, generally, than samples in different groups. Cluster analysis is used in the present context in the following ways.

a) Different sites (or different times at the same site) can be seen to have differing community compositions by noting that replicate samples within a site form a cluster that is distinct from replicates within other sites. This can be an important hurdle to overcome in any analysis; if replicates for a site are clustered more or less randomly with replicates from every other site then further interpretation is likely to be dangerous. (A more formal statistical test for distinguishing sites is the subject of Chapter 6).

b) When it is established that sites can be distinguished from one another (or, when replicates are not taken, it is assumed that a single sample is representative of that site or time), sites or times can be partitioned into groups with similar community structure.

c) Cluster analysis of the species similarity matrix can be used to define species assemblages, i.e. groups of species that tend to co-occur in a parallel manner across sites.

Range of methods

Literally hundreds of clustering methods exist, some of them operating on similarity/dissimilarity matrices whilst others are based on the original data. Everitt (1980) and Cormack (1971) give excellent and readable reviews. Clifford & Stephenson (1975) is another well-established text from an ecological viewpoint.

Five classes of clustering methods can be distinguished, following the categories of Cormack (1971) .

Hierarchical methods. Samples are grouped and the groups themselves form clusters at lower levels of similarity.
Optimising techniques. A single set of mutually exclusive groups (usually a pre-specified number) is formed by optimising some clustering criterion, for example minimising a within-cluster distance measure in the species space.
Mode-seeking methods. These are based on considerations of density of samples in the neighbourhood of other samples, again in the species space.
Clumping techniques. The term ‘clumping’ is reserved for methods in which samples can be placed in more than one cluster.
Miscellaneous techniques.

Cormack (1971) also warned against the indiscriminate use of cluster analysis: “availability of … classification techniques has led to the waste of more valuable scientific time than any other ‘statistical’ innovation”. The ever larger number of techniques and their increasing accessibility on modern computer systems makes this warning no less pertinent today. The policy adopted here is to concentrate on a single technique that has been found to be of widespread utility in ecological studies, whilst emphasising the potential arbitrariness in all classification methods and stressing the need to perform a cluster analysis in conjunction with a range of other techniques (e.g. ordination, statistical testing) to obtain balanced and reliable conclusions.

	A1	A2	A3	A4	B1	B2	B3	B4	C1	C2	C3	C4
A1	–
A2	61	–
A3	69	60	–
A4	65	61	66	–
B1	37	28	37	35	–
B2	42	34	31	32	55	–
B3	45	39	39	44	66	66	–
B4	37	29	29	37	59	63	60	–
C1	35	31	27	25	28	56	40	34	–
C2	40	34	26	29	48	69	62	56	56	–
C3	40	31	37	39	59	61	67	53	40	66	–
C4	36	28	34	37	65	55	69	55	38	64	74	–

	A1	A2	A3	A4	B1	B2	B3	B4	C1	C2	C3	C4
A1	–
A2	61	–
A3	69	60	–
A4	65	61	66	–
B1	37	28	37	35	–
B2	42	34	31	32	55	–
B3	45	39	39	44	66	66	–
B4	37	29	29	37	59	63	60	–
C1	35	31	27	25	28	56	40	34	–
C2	40	34	26	29	48	69	62	56	56	–
C3	40	31	37	39	59	61	67	53	40	66	–
C4	36	28	34	37	65	55	69	55	38	64	74	–

0.1 Introduction

0.2 Acknowledgements

0.3 Citing this book

1.1 Introduction

1.2 Univariate techniques

1.3 Example: Frierfjord macrofauna

1.4 Distributional techniques

1.5 Example: Loch Linnhe macrofauna

1.6 Example: Garroch Head macrofauna

1.7 Multivariate techniques

1.8 Example: Nutrient enrichment experiment, Solbergstrand

1.9 Summary

2.1 Similarity for quantitative data matrices

2.2 Example: Loch Linnhe macrofauna

2.3 Presence/absence data

2.4 Species similarities

2.5 Dissimilarity coefficients

2.6 More on resemblance measures

3.1 Cluster analysis

3.2 Hierarchical agglomerative clustering

3.3 Example: Bristol Channel zooplankton

3.4 Recommendations

3.5 Similarity profiles (SIMPROF)

3.6 Binary divisive clustering

3.7 k-R clustering (non-hierarchical)

4.1 Ordinations

4.2 Principal components analysis

4.3 Example: Garroch Head macrofauna

4.4 PCA for environmental data

4.5 Example: Dosing experiment, Solbergstrand mesocosm

5.1 Other ordination methods

5.2 Non-metric multidimensional scaling (MDS)

5.3 Diagnostics: Adequacy of MDS representation

5.4 EXAMPLE: Dosing experiment, Solbergstrand

5.5 Example: Celtic Sea zooplankton

5.6 Example: Amoco-Cadiz oil spill, Morlaix

5.7 MDS strengths and weaknesses

5.8 Further nMDS/mMDS developments

5.9 Example: Okura estuary macrofauna

5.10 Example: Messolongi lagoon diatoms

5.11 Recommendations

6.1 Univariate tests and multivariate tests

6.2 ANOSIM for the one-way layout

6.3 Example: Frierfjord macrofauna

6.4 Example: Indonesian reef-corals

6.5 ANOSIM for two-way layouts

6.6 Example: Clyde nematodes (2-way nested case)

6.7 Example: Eaglehawk Neck meiofauna (two-way crossed case)

6.8 Example: Mesocosm experiment (two-way crossed case with no replication)

6.9 Example: Exe nematodes (no replication and missing data)

6.10 ANOSIM for ordered factors

6.11 Example: Ekofisk oil-field macrofauna

6.12 Two-way ordered ANOSIM designs

6.13 Example: Phuket coral-reef time series

6.14 Three-way ANOSIM designs

6.15 Example: King Wrasse fish diets, WA

6.16 Example: NZ kelp holdfast macrofauna

6.17 Example: Tees Bay macrofauna

6.18 Recommendations

7.1 Species clustering

7.2 Type 2 and type 3 SIMPROF tests

7.3 Example: Amoco-Cadiz oil spill

7.4 Shade plots

7.5 Example: Bristol Channel zooplankton

7.6 Example: Garroch Head macrofauna

7.7 Example: Ekofisk oil-field macrofauna

7.8 Species contributions to sample (dis)similarities – SIMPER

7.9 Example: Tasmanian meiofauna

7.10 Bubble plots (plus examples)

8.1 Univariate measures

8.2 Graphical/distributional plots

8.3 Examples: Garroch Head and Ekofisk macrofauna

8.4 Examples: Loch Linnhe and Garroch Head macrofauna

8.5 Multivariate tools used on univariate data

8.6 Example: Plymouth particle-size data

8.7 Multiple diversity indices

9.1 Introduction

9.2 Univariate case

9.3 Multivariate case

9.4 Recommendations

	A1	A2	A3	A4	B1	B2	B3	B4	C1	C2	C3	C4
A1	–
A2	61	–
A3	69	60	–
A4	65	61	66	–
B1	37	28	37	35	–
B2	42	34	31	32	55	–
B3	45	39	39	44	66	66	–
B4	37	29	29	37	59	63	60	–
C1	35	31	27	25	28	56	40	34	–
C2	40	34	26	29	48	69	62	56	56	–
C3	40	31	37	39	59	61	67	53	40	66	–
C4	36	28	34	37	65	55	69	55	38	64	74	–