# Sampling from Population

## Sampling API

The package provides functions for sampling from a given population (with or without replacement).

`StatsBase.sample`

— Function`sample([rng], a, [wv::AbstractWeights])`

Select a single random element of `a`

. Sampling probabilities are proportional to the weights given in `wv`

, if provided.

Optionally specify a random number generator `rng`

as the first argument (defaults to `Random.GLOBAL_RNG`

).

`sample([rng], a, [wv::AbstractWeights], n::Integer; replace=true, ordered=false)`

Select a random, optionally weighted sample of size `n`

from an array `a`

using a polyalgorithm. Sampling probabilities are proportional to the weights given in `wv`

, if provided. `replace`

dictates whether sampling is performed with replacement and `order`

dictates whether an ordered sample, also called a sequential sample, should be taken.

Optionally specify a random number generator `rng`

as the first argument (defaults to `Random.GLOBAL_RNG`

).

`sample([rng], a, [wv::AbstractWeights], dims::Dims; replace=true, ordered=false)`

Select a random, optionally weighted sample from an array `a`

specifying the dimensions `dims`

of the output array. Sampling probabilities are proportional to the weights given in `wv`

, if provided. `replace`

dictates whether sampling is performed with replacement and `order`

dictates whether an ordered sample, also called a sequential sample, should be taken.

Optionally specify a random number generator `rng`

as the first argument (defaults to `Random.GLOBAL_RNG`

).

`sample([rng], wv::AbstractWeights)`

Select a single random integer in `1:length(wv)`

with probabilities proportional to the weights given in `wv`

.

`rng`

as the first argument (defaults to `Random.GLOBAL_RNG`

).

`StatsBase.sample!`

— Function`sample!([rng], a, [wv::AbstractWeights], x; replace=true, ordered=false)`

Draw a random sample of `length(x)`

elements from an array `a`

and store the result in `x`

. A polyalgorithm is used for sampling. Sampling probabilities are proportional to the weights given in `wv`

, if provided. `replace`

dictates whether sampling is performed with replacement and `order`

dictates whether an ordered sample, also called a sequential sample, should be taken.

`rng`

as the first argument (defaults to `Random.GLOBAL_RNG`

).

## Algorithms

Internally, this package implements multiple algorithms, and the `sample`

(and `sample!`

) methods integrate them into a poly-algorithm, which chooses a specific algorithm based on inputs.

Note that the choices made in `sample`

are decided based on extensive benchmarking (see `perf/sampling.jl`

and `perf/wsampling.jl`

). It performs reasonably fast for most cases. That being said, if you know that a certain algorithm is particularly suitable for your context, directly calling an internal algorithm function might be slightly more efficient.

Here are a list of algorithms implemented in the package. The functions below are not exported (one can still import them from StatsBase via `using`

though).

### Notations

`a`

: source array representing the population`x`

: the destination array`wv`

: the weight vector (of type`AbstractWeights`

), for weighted sampling`n`

: the length of`a`

`k`

: the length of`x`

. For sampling without replacement,`k`

must not exceed`n`

.`rng`

: optional random number generator (defaults to`Random.GLOBAL_RNG`

)

All following functions write results to `x`

(pre-allocated) and return `x`

.

### Sampling Algorithms (Non-Weighted)

`StatsBase.direct_sample!`

— Method`direct_sample!([rng], a::AbstractArray, x::AbstractArray)`

Direct sampling: for each `j`

in `1:k`

, randomly pick `i`

from `1:n`

, and set `x[j] = a[i]`

, with `n=length(a)`

and `k=length(x)`

.

This algorithm consumes `k`

random numbers.

`StatsBase.samplepair`

— Function`samplepair([rng], n)`

Draw a pair of distinct integers between 1 and `n`

without replacement.

`rng`

as the first argument (defaults to `Random.GLOBAL_RNG`

).

`samplepair([rng], a)`

Draw a pair of distinct elements from the array `a`

without replacement.

`rng`

as the first argument (defaults to `Random.GLOBAL_RNG`

).

`StatsBase.knuths_sample!`

— Function`knuths_sample!([rng], a, x)`

*Knuth's Algorithm S* for random sampling without replacement.

Reference: D. Knuth. *The Art of Computer Programming*. Vol 2, 3.4.2, p.142.

This algorithm consumes `length(a)`

random numbers. It requires no additional memory space. Suitable for the case where memory is tight.

`StatsBase.fisher_yates_sample!`

— Function`fisher_yates_sample!([rng], a::AbstractArray, x::AbstractArray)`

Fisher-Yates shuffling (with early termination).

Pseudo-code:

```
n = length(a)
k = length(x)
# Create an array of the indices
inds = collect(1:n)
for i = 1:k
# swap element `i` with another random element in inds[i:n]
# set element `i` in `x`
end
```

This algorithm consumes `k=length(x)`

random numbers. It uses an integer array of length `n=length(a)`

internally to maintain the shuffled indices. It is considerably faster than Knuth's algorithm especially when `n`

is greater than `k`

. It is $O(n)$ for initialization, plus $O(k)$ for random shuffling

`StatsBase.self_avoid_sample!`

— Function`self_avoid_sample!([rng], a::AbstractArray, x::AbstractArray)`

Self-avoid sampling: use a set to maintain the index that has been sampled. Each time draw a new index, if the index has already been sampled, redraw until it draws an unsampled one.

This algorithm consumes about (or slightly more than) `k=length(x)`

random numbers, and requires $O(k)$ memory to store the set of sampled indices. Very fast when $n >> k$, with `n=length(a)`

.

However, if `k`

is large and approaches $n$, the rejection rate would increase drastically, resulting in poorer performance.

`StatsBase.seqsample_a!`

— Function`seqsample_a!([rng], a::AbstractArray, x::AbstractArray)`

Random subsequence sampling using algorithm A described in the following paper (page 714): Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984.

This algorithm consumes $O(n)$ random numbers, with `n=length(a)`

. The outputs are ordered.

`StatsBase.seqsample_c!`

— Function`seqsample_c!([rng], a::AbstractArray, x::AbstractArray)`

Random subsequence sampling using algorithm C described in the following paper (page 715): Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984.

This algorithm consumes $O(k^2)$ random numbers, with `k=length(x)`

. The outputs are ordered.

`StatsBase.seqsample_d!`

— Function`seqsample_d!([rng], a::AbstractArray, x::AbstractArray)`

Random subsequence sampling using algorithm D described in the following paper (page 716-17): Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984.

This algorithm consumes $O(k)$ random numbers, with `k=length(x)`

. The outputs are ordered.

### Weighted Sampling Algorithms

`StatsBase.direct_sample!`

— Method`direct_sample!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)`

Direct sampling.

Draw each sample by scanning the weight vector.

Noting `k=length(x)`

and `n=length(a)`

, this algorithm:

- consumes
`k`

random numbers - has time complexity $O(n k)$, as scanning the weight vector each time takes $O(n)$
- requires no additional memory space.

`StatsBase.alias_sample!`

— Function`alias_sample!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)`

Alias method.

Build an alias table, and sample therefrom.

Reference: Walker, A. J. "An Efficient Method for Generating Discrete Random Variables with General Distributions." *ACM Transactions on Mathematical Software* 3 (3): 253, 1977.

Noting `k=length(x)`

and `n=length(a)`

, this algorithm takes $O(n \log n)$ time for building the alias table, and then $O(1)$ to draw each sample. It consumes $2 k$ random numbers.

`StatsBase.naive_wsample_norep!`

— Function`naive_wsample_norep!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)`

Naive implementation of weighted sampling without replacement.

It makes a copy of the weight vector at initialization, and sets the weight to zero when the corresponding sample is picked.

Noting `k=length(x)`

and `n=length(a)`

, this algorithm consumes $O(k)$ random numbers, and has overall time complexity $O(n k)$.

`StatsBase.efraimidis_a_wsample_norep!`

— Function`efraimidis_a_wsample_norep!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)`

Weighted sampling without replacement using Efraimidis-Spirakis A algorithm.

Reference: Efraimidis, P. S., Spirakis, P. G. "Weighted random sampling with a reservoir." *Information Processing Letters*, 97 (5), 181-185, 2006. doi:10.1016/j.ipl.2005.11.003.

Noting `k=length(x)`

and `n=length(a)`

, this algorithm takes $O(n + k \log k)$ processing time to draw $k$ elements. It consumes $n$ random numbers.

`StatsBase.efraimidis_ares_wsample_norep!`

— Function`efraimidis_ares_wsample_norep!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)`

Implementation of weighted sampling without replacement using Efraimidis-Spirakis A-Res algorithm.

Reference: Efraimidis, P. S., Spirakis, P. G. "Weighted random sampling with a reservoir." *Information Processing Letters*, 97 (5), 181-185, 2006. doi:10.1016/j.ipl.2005.11.003.

Noting `k=length(x)`

and `n=length(a)`

, this algorithm takes $O(k \log(k) \log(n / k))$ processing time to draw $k$ elements. It consumes $n$ random numbers.