2.1 What is an empirical distribution?

Overview

What is an empirical distribution? The empirical distribution of a variable is able to be characterised by considering each unique numerical value observed for that variable in a given sample of size $n$. If certain values are repeated, then we simply tally the number of each unique value obtained. These tallies are essentially raw frequencies of the values. We can order the values obtained for the variable from smallest to largest and then look at these frequencies cumulatively, as a percentage of the entire sample. A plot of these cumulative percentages as a function of the ordered values in the sample is known as the empirical cumulative distribution function.

Description

More formally, suppose we have $n$ independent and identically distributed random variables, $Y_1, Y_2, \ldots, Y_n$, with a common (but unknown) probability density function (pdf) of $f(y)$ and cumulative distribution function (cdf) of $F(y) = \text{Pr} \lbrace Y \leq y \rbrace $.

For the discrete case, we have $F(y) = \sum_{t \leq y} f(t)$.

For the continuous case, we have $F(y) = \int_{-\infty}^y f(t) \cdot dt$.

We obtain corresponding observed values $y_1, y_2, \ldots, y_n$ in a sample of size $n$. Now, let $I(y_i \leq t)$ be an indicator that takes the value of $1$ if $y_i \leq t$ is true, and zero otherwise. The empirical cumulative distribution function $\hat{F}_ n(t)$ is defined as the proportion of data points in the sample that are less than or equal to $t$, i.e.,

$$ \hat{F}_ n(t) = \frac{1}{n} \cdot \sum_{i=1}^n I(y_i \leq t) $$

This is therefore a step function, continuous from the right, that jumps up by a quantity of $1/n$ at each of the $n$ data points. Its shape gives us a basic visual understanding of the distributional shape of the data values.

A small example

Suppose we had the following data with a sample size of $n = 10$ for a variable, $Y$:

Sample	y
1	2
2	7
3	10
4	12
5	2
6	5
7	5
8	8
9	10
10	3

The raw frequencies are as follows:

Value of y	Frequency
2	2
3	1
5	2
7	1
8	1
10	2
12	1

Looking at these values cumulatively, we have:

Value of y	Cumulative frequency
2	2
3	3
5	5
7	6
8	7
10	9
12	10

Expressing these frequencies as cumulative proportions of the total sample, we have:

Value of y	Cumulative proportion
2	0.2
3	0.3
5	0.5
7	0.6
8	0.7
10	0.9
12	1.0

These cumulative proportions comprise the empirical cdf. A plot of this empirical distribution (a step function) is shown below, with open circles being used to show the discontinuities (i.e., at the point of each step).

The above plot was obtained by running Plots > Empirical Distribution Plot... in PRIMER 8, and ticking the option to: '$\checkmark$Express as proportions: [0, 1]'. Note that you can alternatively look at an empirical cdf with the values expressed as percentages instead of proportions (i.e., a plot of $100 \times \hat{F}_ n(t)$ versus $t$), which is the default.

As an aside, a related graphical tool for examining distributional shapes of variables is a histogram. A histogram is a plot of the raw frequencies (as bars on the y-axis) vs the empirical values (on the x-axis). For a histogram, we would typically pool together (sum) the raw frequencies into larger 'bin' sizes (instead of having one bin for every unique value in the dataset), which can be very useful if $n$ is large. That's what the Plots > Histogram Plot... function in PRIMER 8 does. Once you get a histogram, you can also change the bin size by clicking Graph > Special.

Introduction

New Statistical Methods in P8

New Tools & Utilities in P8

1.1 Expansion from P7 to P8

1.2 Definitions of statistics

1.3 Biotic data: summary stats

1.4 Split summary stats results by groups

1.5 Environmental data: summary stats

2.1 What is an empirical distribution?

2.2 Example: Empirical distributions of oyster sizes

3.1 Plots of empirical densities

3.2 Example: Dotplot of oyster sizes

3.3 Example: Violin plot of kelp holdfast volumes

4.1 Wilcoxon signed-rank test

4.2 Example: Plankton hauls

4.3 Mann-Whitney U test

4.4 Example: Snapper in marine reserves

4.5 Kruskal-Wallis test

4.6 Example: A bivalve species from Ekofisk

4.7 Kolmogorov-Smirnov test

4.8 Example: Sizes of oysters

4.9 Test of Association

4.10 Example: Ekofisk diversity

4.11 Example: Associations between species

Overview of new 'Design' options and tools

6.1 Overview - Allow heterogeneity

6.2 ANOVA in a nutshell

6.3 The Behrens-Fisher problem (BFP)

6.4 Multivariate Behrens-Fisher problem

6.5 Solution to the multivariate BFP

6.6 Example: one-way PERMANOVA allowing heterogeneity

6.7 Heterogeneity in more complex designs

6.8 Example: two-way crossed PERMANOVA allowing heterogeneity

7.1 Overview - Finite factors

7.2 Dichotomy: fixed vs random factors

7.3 Not a dichotomy: a progression from fixed to random

7.4 Example: environmental impact on molluscs

7.5 Broader implications for detecting impact

8.1 Designs lacking replication

8.2 Example: Split-plot - Woodstock vegetation

8.3 Example: Repeated measures - Victorian avifauna

9.1 Why group covariables together?

9.2 Periodic and cyclical models

9.3 Example: Annual monthly cycles - B.C. macroalgae

10.1 Ordinations for multi-factor designs

10.2 Main effects plot

10.3 Interaction plot

10.4 Example: NZ fish assemblages

11.1 What are 'residual' distances?

11.2 Example: Plankton (revisited)

12.1 Overview - Control charts

12.2 Classical univariate control chart

12.3 Classical multivariate control chart

12.4 Bivariate normal example: NZ fish

12.5 Dissimilarity-based multivariate control chart

12.6 Additional notes on implementing control charts

12.7 Example: Birds from Grand Forks

13.1 Overview

13.2 Analysing cumulative standardised data

13.3 Example: Mussel sizes in the Gulf of Alaska

13.4 Example: Gulf of Maine invertebrates - functional resemblance

14.1 Overview

14.2 Example: NE Pacific groundfish vs depth

15.1 New default colour palette

15.2 New selection options

15.3 Re-name levels of a factor (or indicator)

15.4 Add customised values/labels to graphical axes

15.5 Split data sheet by factor/indicator

15.6 Line plots for samples

15.7 Output group-level stats from dispersion (or variability) weighting

15.8 Output diagnostic plots from CAP

15.9 New diagnostics for PCA/PCO plots

2.1 What is an empirical distribution?

Overview

Description

A small example