4.1 Wilcoxon signed-rank test

Overview

The Wilcoxon signed-rank test was described by Wilcoxon (1945) . It is designed for the situation where there are two groups of values, and any individual value in one group is paired with a specific value in the other group. For example, you might have a treatment and a control value for a given response variable across each of a number of different trials. Interest lies in making a formal comparison of treatment vs control values, acknowledging the inherent non-independence of the paired values within each trial. This test is a non-parametric analogue to a classical paired t-test, and may be implemented in PRIMER under either a directional (one-tailed) or non-directional (two-tailed) alternative hypothesis.

The null hypothesis

The essential null hypothesis tested here is H₀: the distribution of paired differences is stochastically symmetric about zero. In other words, the rank order of the values within each pair is arbitrary, so the two observations within any pair are exchangeable with one another. We might consider writing this as H₀: the median of the distribution of paired differences is equal to zero.

Our alternative hypothesis may be non-directional and simply assert that the paired differences are stochastically symmetric about some other value that is not zero. This might be phrased as H_A: the median of the distribution of paired differences is not equal to zero. In that case, we have a two-tailed test.

We may, however, assert a directional alternative hypothesis, yielding a one-tailed test. For example:

H_A: The distribution of paired differences is symmetric about some value greater than zero (e.g., the median of the distribution of paired differences is positive); or
H_A: The distribution of paired differences is symmetric about some value less than zero (e.g., the median of the distribution of paired differences is negative).

Description of the test

Consider a set of $N$ sampling units. For each sampling unit $i = 1, \ldots N$, two paired (or matched) observation values have been recorded: $(x_i,y_i)$. For any pair $i$, let the difference in these paired values be $d_i = (x_i - y_i)$. Next, let $r_i$ be the rank of the absolute value of the differences $|d_i|$, with the smallest absolute difference being given a rank of $1$ and the largest absolute difference being given a rank of $N$. Let $\text{sgn}(\cdot)$ be a function to attribute an indicative sign such that $\text{sgn}(d_i) = 1$ if $d_i>0$ and $\text{sgn}(d_i) = -1$ if $d_i<0$. We can obtain the signed ranks as: $r_i^{\text{sgn}} = \text{sgn}(d_i) \cdot r_i$.

We define the test statistic, $W$, as the sum of all the signed ranks, i.e.: $$ W = \sum_{i = 1}^N \text{sgn}(d_i) \cdot r_i $$

Having obtained an observed value of the test statistic from the data, $W_\text{obs}$, then (as in many other PRIMER routines), we can obtain a p-value empirically using an appropriate permutation algorithm. Specifically, under the assumption of exchangeability, we can generate a plausible value of the test-statistic $W$ under a true null hypothesis by randomizing the ordering of the paired values $(x_i,y_i)$ separately, within each pair, for each and every sampling unit $i = 1,...,n$. Once this randomization has been done, we can re-calculate the differences, $d_i^\pi$ ,and their associated (unsigned) ranks, $r_i^\pi$, under permutation, to yield:

$$ W^\pi = \sum_{i = 1}^N \text{sgn}(d_i^\pi) \cdot r_i^\pi $$

We repeat the above randomization and re-calculation procedure a large number of times (e.g., say $n_\text{perm}$ = 9999) to obtain a large number of values of $W^\pi$ under a true null hypothesis. The probability (p-value) associated with the null hypothesis (and two-tailed alternative hypothesis) is then estimated empirically as the proportion of values of $W^\pi$ that are equal to or more extreme (in absolute value) than the observed value of the test-statistic, $W_\text{obs}$. Thus, letting $W_k^\pi$ be the value of $W^\pi$ obtained for the $k$th permutation ($k = 1, \ldots, n_\text{perm}$), the p--value is:

$$ P = \frac{ \sum_{k=1}^{n_\text{perm} } (\text{I}(|W_k^\pi| \geq |W_\text{obs}| ) + 1 )}{(n_\text{perm} + 1)} $$

with $\text{I}(\textit{expression} ) = 1$ if $\textit{expression}$ is true and zero otherwise. Note that the '$+1$' in the numerator and denominator of this fraction is there to acknowledge the inclusion of the observed value as a member of the distribution of $W$ under a true null hypothesis.

One-tailed alternative hypotheses

As noted above, we may postulate a more specific alternative hypothesis. In such cases, the test-statistic and randomization procedure are all done the same way as described above, but the p-value is calculated differently. For example, if our alternative hypothesis is that the median of the distribution of paired differences is greater than zero, then the p-value is calculated as:

$$ P = \frac{ \sum_{k=1}^{n_\text{perm}} (\text{I}(W_k^\pi \geq W_\text{obs} ) + 1 )}{(n_\text{perm} + 1)} $$ which tallies only the values of $W_k^\pi$ that equal or positively exceed $W_\text{obs}$, in the right-hand tail of the permutation distribution.

If, on the other hand, our alternative hypothesis is that the median of the distribution of paired differences is less than zero, the p-value is calculated as:

$$ P = \frac{ \sum_{k=1}^{n_\text{perm}} (\text{I}(W_k^\pi \leq W_\text{obs} ) + 1 )}{(n_\text{perm} + 1)} $$ which tallies only the values of $W_k^\pi$ that equal or negatively exceed $W_\text{obs}$, in the left-hand tail of the permutation distribution.

Treatment of tied ranks

What happens to the rank-order values, $r_i$, in the event of a tie? Let's suppose $|d_1|$ is the smallest absolute difference (so is given a rank of $r_1 = 1$), that $|d_2|$ is the second-smallest absolute difference (hence $r_2 = 2$), but then we find that a third value for the difference $|d_3|$ is precisely (within double precision) equal to $|d_2|$? In other words, suppose $|d_2|$ and $|d_3|$ are tied for second place in the ranking of values from smallest to largest.

In this case PRIMER takes a fairly standard approach and averages the ranks of tied values. We simply order the values from smallest to largest and give them 'raw' ranks that correspond to the set of ordered integers, i.e., from $1$ to $n$). Then, we replace the ordered integers for any tied values with the average of those integers. Thus, in this case, the ordered integers are $\{1, 2, 3, \ldots \}$. We have $r_1 = 1$, but then both $r_2$ and $r_3$ would be given the average of the ordered integers in their place, i.e., $r_2 = r_3 = (2+3)/2 = 2.5$. So, the set of ranks used for the analysis will be $\{1, 2.5, 2.5, \ldots \}$. The next largest absolute difference would be given the (unsigned) rank value of $4$ (presuming it is not tied), and so on. Any other tied values are treated in the same way, by averaging their corresponding ordered integer values, and all subsequent calculations to calculate the test statistic, etc., simply carry on from there precisely as described above.

An important point here about the PRIMER implementation of the Wilcoxon signed-rank test is that, given that p-values are calculated using an appropriate randomization algorithm under a true null hypothesis of exchangeability, these non-parametric tests are indeed exact tests, even in the event of there being ties in the ranks. This contrasts with other available software implementations of the Wilcoxon test (e.g., such as 'wilcox.test()' in R), which do not compute exact p-values if there are tied ranks.

Treatment of tied pairs (difference = 0)

It is possible to obtain paired values that are identical to one another; i.e., $x_i = y_i$, so $d_i = 0$. This is problematic for the test, not because of the ranking procedure (it would clearly be of very low rank, given its small absolute value), but because, for this sampling unit, there is therefore no sign to attribute to it. The simplest solution is to omit differences equal to zero from the calculation ( Wilcoxon (1949) ). This is the approach that is implemented in PRIMER.

It might be argued, however, that this withdraws certain evidence in favour of the null hypothesis. An alternative approach, suggested by Pratt (1959) , would be to include the zeros when ranking the absolute differences, but subsequently to exclude them from the calculation of the test-statistic; i.e., to thereafter assert that $\text{sgn}(0) = 0$.

From a practical point of view, neither the Pratt (1959) , nor the Wilcoxon (1949) method was found to be universally most efficient ( Conover (1972) ). Also, as PRIMER does not rely on any formal derivation of the distribution of the $W$ test statistic, but rather uses permutation algorithms to estimate a p-value empirically, it would seem that the choice between these two options would have little or no substantive effect on the outcome of PRIMER's implementation of the Wilcoxon test, although this has not been investigated in detail.

Original test-statistic

We note here, in passing, that the original Wilcoxon test-statistic was defined somewhat differently from the above description, which uses $W$. First, let $R^+$ be the sum of the positively signed ranks, i.e. $$ R^+ = \sum_{i=1}^N r_i^{\text{sgn}} \cdot \text{I}(d_i>0) $$ and let $R^-$ be the sum of the negatively signed ranks, i.e., $$ R^- = \sum_{i=1}^N r_i^{\text{sgn}} \cdot \text{I}(d_i<0). $$ Wilcoxon's original test-statistic is then given as the minimum of the absolute values of these two quantities; that is, $$ T = \text{min}(|R^+|, |R^-|). $$

The PRIMER implementation provides both $W$ and $T$ in the output file for this test. Note that the calculation of a p-value using an appropriate permutation algorithm under a true null hypothesis of exchangeability ensures that PRIMER achieves an exact test, no matter which test statistic you prefer to report from the output provided.

Introduction

New Statistical Methods in P8

New Tools & Utilities in P8

1.1 Expansion from P7 to P8

1.2 Definitions of statistics

1.3 Biotic data: summary stats

1.4 Split summary stats results by groups

1.5 Environmental data: summary stats

2.1 What is an empirical distribution?

2.2 Example: Empirical distributions of oyster sizes

3.1 Plots of empirical densities

3.2 Example: Dotplot of oyster sizes

3.3 Example: Violin plot of kelp holdfast volumes

4.1 Wilcoxon signed-rank test

4.2 Example: Plankton hauls

4.3 Mann-Whitney U test

4.4 Example: Snapper in marine reserves

4.5 Kruskal-Wallis test

4.6 Example: A bivalve species from Ekofisk

4.7 Kolmogorov-Smirnov test

4.8 Example: Sizes of oysters

4.9 Test of Association

4.10 Example: Ekofisk diversity

4.11 Example: Associations between species

Overview of new 'Design' options and tools

6.1 Overview - Allow heterogeneity

6.2 ANOVA in a nutshell

6.3 The Behrens-Fisher problem (BFP)

6.4 Multivariate Behrens-Fisher problem

6.5 Solution to the multivariate BFP

6.6 Example: one-way PERMANOVA allowing heterogeneity

6.7 Heterogeneity in more complex designs

6.8 Example: two-way crossed PERMANOVA allowing heterogeneity

7.1 Overview - Finite factors

7.2 Dichotomy: fixed vs random factors

7.3 Not a dichotomy: a progression from fixed to random

7.4 Example: environmental impact on molluscs

7.5 Broader implications for detecting impact

8.1 Designs lacking replication

8.2 Example: Split-plot - Woodstock vegetation

8.3 Example: Repeated measures - Victorian avifauna

9.1 Why group covariables together?

9.2 Periodic and cyclical models

9.3 Example: Annual monthly cycles - B.C. macroalgae

10.1 Ordinations for multi-factor designs

10.2 Main effects plot

10.3 Interaction plot

10.4 Example: NZ fish assemblages

11.1 What are 'residual' distances?

11.2 Example: Plankton (revisited)

12.1 Overview - Control charts

12.2 Classical univariate control chart

12.3 Classical multivariate control chart

12.4 Bivariate normal example: NZ fish

12.5 Dissimilarity-based multivariate control chart

12.6 Additional notes on implementing control charts

12.7 Example: Birds from Grand Forks

13.1 Overview

13.2 Analysing cumulative standardised data

13.3 Example: Mussel sizes in the Gulf of Alaska

13.4 Example: Gulf of Maine invertebrates - functional resemblance

14.1 Overview

14.2 Example: NE Pacific groundfish vs depth

15.1 New default colour palette

15.2 New selection options

15.3 Re-name levels of a factor (or indicator)

15.4 Add customised values/labels to graphical axes

15.5 Split data sheet by factor/indicator

15.6 Line plots for samples

15.7 Output group-level stats from dispersion (or variability) weighting

15.8 Output diagnostic plots from CAP

15.9 New diagnostics for PCA/PCO plots

4.1 Wilcoxon signed-rank test

Overview

The null hypothesis

Description of the test

One-tailed alternative hypotheses

Treatment of tied ranks

Treatment of tied pairs (difference = 0)

Original test-statistic