Distance sampling

Distance sampling models using data augmentation

Author

Affiliations

Philip T. Patton

Marine Mammal Research Program

Hawaiʻi Institute of Marine Biology

Published

May 14, 2025

In this notebook, I explore how to fit distance sampling models for estimating the size of a closed population. Similar to the occupancy and closed capture-recapture notebooks, I use parameter-expanded data-augmentation (PX-DA) and the zero-inflated binomial model in this notebook.

The idea with distance sampling, also known as line-transect sampling, is that a surveyer traverses a transect, typically in a boat or a plane. As they survey, they note when they detect an individual, or a group, from the species of interest, and further note the distance from the transect to the animal. Further, they note the angle to the animal(s), such that they can calculate the perpendicular distance from the animal to the transect. We assume that probability of detecting an animal \(p\) decreases monotonically as the distance from the transect grows, e.g., \(p=\exp(-x^2/\sigma^2),\) where \(x\) is the distance and \(\sigma\) is a scale parameter to be estimated. These simple assumptions permit the estimation of the population size \(N\) as well as density \(D.\)

Following Hooten and Hefley (2019), Chapter 24 and Royle and Dorazio (2008), Chapter 7, I use the impala data from Burnham, Anderson, and Laake (1980), who credits P. Hemingway with the dataset. In this dataset, 73 impalas were observed along a 60km transect. The distance values below are the perpendicular distances, in meters, from the transect.

%config InlineBackend.figure_format = 'retina'
from scipy.stats import binom
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pymc as pm
import pytensor.tensor as pt
import seaborn as sns

# plotting defaults
plt.style.use('fivethirtyeight')
plt.rcParams['axes.facecolor'] = 'white'
plt.rcParams['figure.facecolor'] = 'white'
plt.rcParams['axes.spines.left'] = False
plt.rcParams['axes.spines.right'] = False
plt.rcParams['axes.spines.top'] = False
plt.rcParams['axes.spines.bottom'] = False
sns.set_palette("tab10")

# hyper parameters
M = 500
U_X = 400
U_SIGMA = 400

# burnham impala dataset with distances in m
x_observed = np.array(
    [71.933980, 26.047227, 58.474341, 92.349221, 163.830409, 84.523652
    ,163.830409, 157.330098, 22.267696, 72.105330, 86.986979, 50.795047
    ,0.000000, 73.135370,  0.000000, 128.557522, 163.830409,  71.845104
    ,30.467336, 71.073909, 150.960702, 68.829172, 90.000000, 64.983827
    ,165.690874, 38.008322, 378.207430, 78.146226, 42.127052, 0.000000
    ,400.000000, 175.386612, 30.467336, 35.069692, 86.036465, 31.686029
    ,200.000000, 271.892336, 26.047227, 76.604444, 41.042417, 200.000000
    ,86.036465, 0.000000, 93.969262, 55.127471, 10.458689, 84.523652
    ,0.000000, 77.645714, 0.000000, 96.418141, 0.000000, 64.278761
    ,187.938524, 0.000000, 160.696902, 150.453756, 63.603607, 193.185165
    ,106.066017, 114.906666, 143.394109, 128.557522, 245.745613, 123.127252
    ,123.127252, 153.208889, 143.394109, 34.202014, 96.418141, 259.807621
    ,8.715574]
)

# plot the distances 
fig, ax = plt.subplots(figsize=(4,4))

ax.hist(x_observed, edgecolor='white')

ax.set_title('Hemingway Impala Data')
ax.set_ylabel('Number of detections')
ax.set_xlabel('Distance (m)')

plt.show()

Figure 1: Histogram of the number of detected impalas at varying distances.

Again, we treat this as a zero-inflated binomial model using PX-DA. The trick for doing so is to create a binary vector of length \(M\), \(y,\) that represents whether the individual was detected during the study. Then, combine the indicator with the distance vector \(x\) to create a the full dataset \((x,y).\)

n = len(x_observed)
unobserved_count = M - n
zeros = np.zeros(unobserved_count)

y = np.ones(n)
y_augmented = np.concatenate((y, zeros))

The issue is that \(x\) is unobserved for the undetected individuals. To work around this, we put a uniform prior on the unobserved \(x,\) i.e., \(x \sim \text{Uniform}(0, U_x).\) With this “complete” \(x,\) we can construct the detection function \(p\) for the unobserved individuals.

with pm.Model() as distance:
    
    psi = pm.Beta('psi', 0.001, 1)
    sigma = pm.Uniform('sigma', 0, U_SIGMA)
    
    x_unobserved = pm.Uniform('x_unobserved', 0, U_X, shape=unobserved_count)
    x_complete = pt.concatenate((x_observed, x_unobserved))

    p = pm.Deterministic('p', pm.math.exp(- x_complete ** 2 / sigma ** 2))
    
    pm.ZeroInflatedBinomial('y', p=p, psi=psi, n=1, observed=y_augmented)

pm.model_to_graphviz(distance)

Figure 2: Visual representation of the distance sampling model.

with distance:
    distance_idata = pm.sample()

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [psi, sigma, x_unobserved]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 5 seconds.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details

az.plot_trace(
    distance_idata, 
    figsize=(8,4),
    var_names=['psi', 'sigma']
)
plt.subplots_adjust(hspace=0.4)
plt.show()

Figure 3: Traceplots for the distance sampling model.

This model samples slower than the models in the other notebooks, presumably because of the unobserved \(x.\) As in the closed capture-recapture notebook, we will have to simulate the posterior for \(N\) using the posterior distribution of \(\psi\) and \(M.\)

RNG = np.random.default_rng()

# extract the posterior samples into numpy arrays
posterior = az.extract(distance_idata)
psi_samples = posterior.psi.to_numpy()

# only interested in the unobserved animals
p_samples = posterior.p.to_numpy()[n:]

# posterior probabilities of being present in the population but not detected
p_if_present = psi_samples * binom.pmf(0, n=1, p=p_samples)
p_total = p_if_present + (1 - psi_samples)

# simulate the latent inclusion state 
Z = RNG.binomial(1, p_if_present / p_total)
number_undetected = Z.sum(axis=0)

# # # compute the total number of animals in the population
N_samples = n + number_undetected

# plot the posterior distributions
sigma_samples = posterior.sigma.to_numpy()
fig, (ax0, ax1) = plt.subplots(1, 2, sharey=True, figsize=(8,4))

# histograms of the posteriors
ax0.hist(N_samples, edgecolor='white', bins=30)
ax1.hist(sigma_samples, edgecolor='white', bins=30)

# axes labels 
ax0.set_xlabel(r'Abundance $N$')
ax0.set_ylabel('Number of samples')
ax1.set_xlabel(r'Detection range $\sigma$')

# add the point estimates
N_hat = N_samples.mean()
sigma_hat = sigma_samples.mean()
ax0.text(200, 350, rf'$\hat{{N}}$={N_hat:.1f}', ha='left', va='center')
ax1.text(205, 350, rf'$\hat{{\sigma}}$={sigma_hat:.1f}', ha='left', va='center')

# the results from royle and dorazio (2008) for comparison
N_hat_royle = 179.9
sigma_hat_royle = 187

ax0.axvline(N_hat_royle, linestyle='--', linewidth=3, color='C1')
ax1.axvline(sigma_hat_royle, linestyle='--', linewidth=3, color='C1')

plt.show()

Figure 4: Posterior distributions for \(N\) and \(\sigma.\) Estimates from Royle and Dorazio (2008) are shown with vertical lines.

The model shows a high level of agreement with Royle and Dorazio (2008), Chapter 7, although note that they reported \(\sigma\) in terms of 100m units. It is also possible to plot the posterior distribution of the detection function.

xx = np.arange(400)

def det_func(x, s):
    return np.exp(- (x ** 2) / (s ** 2))

p_samps = np.array([det_func(xx, s) for s in sigma_samples])

p_mean = p_samps.mean(axis=0)
p_low = np.quantile(p_samps, 0.02, axis=0)
p_high = np.quantile(p_samps, 0.98, axis=0)

fig, ax = plt.subplots(figsize=(5,4))

ax.plot(xx, p_mean, '-')
ax.fill_between(xx, p_low, p_high, alpha=0.2)

ax.set_title('Detection function')
ax.set_ylabel(r'$p$')
ax.set_xlabel(r'Distance (m)')

plt.show()

Figure 5: Posterior distribution for the detection function. The line represents the posterior mean while the shaded area is the 96% interval.

%load_ext watermark

%watermark -n -u -v -iv -w

Last updated: Wed May 14 2025

Python implementation: CPython
Python version       : 3.13.2
IPython version      : 9.0.2

arviz     : 0.21.0
pytensor  : 2.30.2
numpy     : 2.1.3
seaborn   : 0.13.2
matplotlib: 3.10.1
pymc      : 5.22.0

Watermark: 2.5.0

References

Burnham, Kenneth P, David R Anderson, and Jeffrey L Laake. 1980. “Estimation of Density from Line Transect Sampling of Biological Populations.” Wildlife Monographs, no. 72: 3–202.

Hooten, Mevin B, and Trevor Hefley. 2019. Bringing Bayesian Models to Life. CRC Press.

Royle, J Andrew, and Robert M Dorazio. 2008. Hierarchical Modeling and Inference in Ecology: The Analysis of Data from Populations, Metapopulations and Communities. Elsevier.