Model comparison

Comparing closed capture-recapture models in PyMC

Author

Affiliations

Philip T. Patton

Marine Mammal Research Program

Hawaiʻi Institute of Marine Biology

Published

May 14, 2025

In this notebook, I demonstrate an approach to model selection in PyMC. To do so, follow the lead of King and Brooks (2008), although not nearly as elegantly. They demonstrate an approach to model selection for a typical suite of closed capture-recapture models. These include the effects of behavior \(b\), time \(t,\) and individual heterogeneity \(h\) on capture probabilities \(p\). The eight models considered here are combinations of the three: \(M_{0},\) \(M_{t},\) \(M_{b},\) \(M_{tb},\) \(M_{h},\) \(M_{th},\) \(M_{bh}\). The full model, \(M_{tbh}\), is
\[ \begin{equation} \text{logit} \; p_{it} = \mu + \alpha_t + \beta x_{it} + \gamma_i, \end{equation} \] where \(\mu\) is the average catchability, \(\alpha_t\) is the effect of each occasion on catchability, \(\beta\) is the behavioral effect, \(x_{it}\) indicates whether the individual has been previously caught, and \(\gamma_i\) is the individual random effect such that \(\gamma_i \sim \text{Normal}(0,\sigma)\). Formulating the model this way makes the other models nested subsets of the full model.

Like King and Brooks (2008), I use the the Moray Firth bottlenose dolphin data as a motivating example. Wilson, Hammond, and Thompson (1999) detected \(n=56\) dolphins over the course of \(T=17\) boat surveys between May and September 1992. They generated the capture-recapture histories by way of photo-identification, which is near and dear to my heart (and my dissertation).

%config InlineBackend.figure_format = 'retina'

# libraries 
import numpy as np
import pandas as pd
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt 
import seaborn as sns
from pymc.distributions.dist_math import binomln, logpow

plt.style.use('fivethirtyeight')
plt.rcParams['axes.facecolor'] = 'white'
plt.rcParams['figure.facecolor'] = 'white'
pal = sns.color_palette("Set2")
sns.set_palette(pal)

# hyperparameters 
SEED = 808
RNG = np.random.default_rng(SEED)

def augment_history(history):
    '''Augment a capture history with all-zero histories.'''
    
    animals_captured, T = history.shape

    # create M - n all zero histories
    zero_history_count = M - animals_captured
    zero_history = np.zeros((zero_history_count, T))

    # tack those on to the capture history
    augmented = np.row_stack((history, zero_history))

    return augmented 

def get_behavior_covariate(history):
    
    # note the occasion when each individual was first seen
    first_seen = (history != 0).argmax(axis=1)
    
    # create the covariate for the behavior effect
    behavior_covariate = np.zeros_like(history)
    for i, f in enumerate(first_seen):
        behavior_covariate[i, (f + 1):] = 1

    return behavior_covariate

def get_occasion_covariate(history):

    _, T = history.shape
    l = []
    for t in range(T):
        oc = np.zeros_like(history)
        oc[:, t] = 1
        l.append(oc)

    return np.stack(l, axis=2)

def sim_N(idata):
    
    psi_samps = az.extract(idata).psi.values
    p_samps = az.extract(idata).p.values
    not_p = (1 - p_samps)
    
    if p_samps.ndim == 1:
        p_included = psi_samps * (not_p) ** T 
        number_undetected = RNG.binomial(M - n, p_included)

    elif p_samps.ndim == 3:
        p_included = psi_samps * not_p.prod(axis=1)
        number_undetected = RNG.binomial(1, p_included).sum(axis=0)

    # N = n + number_undetected
    N = RNG.binomial(M, psi_samps)
    return N

# convert the dolphin capture history from '1001001' to array
dolphin = np.loadtxt('firth.txt', dtype=str)
dolphin = np.array([list(map(int, d)) for d in dolphin])

# augment the capture history with all zero histories
n, T = dolphin.shape
M = 500
dolphin_augmented = augment_history(dolphin)

# covariates for t and b
occasion_covariate = get_occasion_covariate(dolphin_augmented)
behavior_covariate = get_behavior_covariate(dolphin_augmented)

/var/folders/7b/nb0vyhy90mdf30_65xwqzl300000gn/T/ipykernel_29374/3418041078.py:32: DeprecationWarning: `row_stack` alias is deprecated. Use `np.vstack` directly.
  augmented = np.row_stack((history, zero_history))

The discovery curve, the number of unique dolphins encountered as a function of the total number of dolphins encountered, may be flattening. This suggests that, at this point in the study, Wilson, Hammond, and Thompson (1999) may have encountered many of the unique individuals in the population.

# how many dolphins have been seen?
total_seen = dolphin.sum(axis=0).cumsum()

# how many new dolphins have been seen?
first_seen = (dolphin != 0).argmax(axis=1)
newbies = [sum(first_seen == t) for t in range(T)]
total_newbies = np.cumsum(newbies)

fig, ax = plt.subplots(figsize=(5, 3.5))
ax.plot(total_seen, total_newbies)
ax.fill_between(total_seen, total_newbies, alpha=0.2)
ax.set_title('Discovery curve')
ax.set_xlabel('Total dolphins')
ax.set_ylabel('Unique dolphins')
plt.show()

Figure 1: Discovery curve for the Moray Firth bottlenose dolphin surveys (Wilson, Hammond, and Thompson 1999).

Training each model

This notebook looks messier than the others, in that I train several models with little commentary along the way. In practice, it would probably be better to wrap these up into a function or a class. To complete the model, I used the following priors, \[ \begin{align} \psi &\sim \text{Uniform}(0, 1)\\ \mu &\sim \text{Logistic}(0, 1) \\ \alpha_t &\sim \text{Normal}(0, \sigma_{\alpha}) \\ \beta &\sim \text{Normal}(0, \sigma_{\beta}) \\ \gamma_i &\sim \text{Normal}(0, \sigma_{\gamma}) \\ \sigma_{\alpha} &\sim \text{InverseGamma}(4, 3) \\ \sigma_{\beta} &\sim \text{InverseGamma}(4, 3) \\ \sigma_{\gamma} &\sim \text{InverseGamma}(4, 3), \end{align} \] which were also used by King and Brooks (2008). Although note that I used an informative \(\text{Beta}(1, 5)\) prior for \(\psi\) in the full model (see below). I use the same logp seen in the occupancy and closed capture-recapture notebooks, which accounts for row-level zero-inflation. Unlike other notebooks, I did not look at the summaries or the trace plots unless the sampler indicated that it had issues during training.

Throughout the notebook, I use the nutpie sampler within PyMC. Nutpie is a NUTS sampler written in Rust, and is often faster than PyMC. Also, I have tweaked the sampling keyword arguments for each model, since they are a little finicky.

def logp(value, n, p, psi):
    
    binom = binomln(n, value) + logpow(p, value) + logpow(1 - p, n - value)
    bin_sum = pm.math.sum(binom, axis=1)
    bin_exp = pm.math.exp(bin_sum)

    res = pm.math.switch(
        value.sum(axis=1) > 0,
        bin_exp * psi,
        bin_exp * psi + (1 - psi)
    )
    
    return pm.math.log(res)

with pm.Model() as m0:

    # Priors
    # inclusion
    psi = pm.Uniform('psi', 0, 1)  

    # mean catchability 
    mu = pm.Logistic('mu', 0, 1)

    # Linear model
    mu_matrix = (np.ones((T, M)) * mu).T
    p = pm.Deterministic('p', pm.math.invlogit(mu_matrix))

    # Likelihood 
    pm.CustomDist(
        'y',
        1,
        p,
        psi,
        logp=logp,
        observed=dolphin_augmented
    )
    
pm.model_to_graphviz(m0)

Figure 2: Visual representation of model \(M_{0}\).

with m0:
    m0_idata = pm.sample()

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [psi, mu]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.

with pm.Model() as mt:

    # Priors
    # inclusion
    psi = pm.Uniform('psi', 0, 1)  

    # mean catchability 
    mu = pm.Logistic('mu', 0, 1)

    # time effect
    sigma_alpha = pm.InverseGamma('sigma_alpha', 4, 3)
    alpha = pm.Normal('alpha', 0, pm.math.sqrt(sigma_alpha), shape=T)

    # Linear model
    # nu = mu + pm.math.dot(occasion_covariate, alpha)
    nu = mu + alpha
    p = pm.Deterministic('p', pm.math.invlogit(nu))

    # Likelihood 
    pm.CustomDist(
        'y',
        1,
        p,
        psi,
        logp=logp,
        observed=dolphin_augmented
    )
    
pm.model_to_graphviz(mt)

Figure 3: Visual representation of model \(M_t\).

with mt:
    mt_idata = pm.sample()
    # pass

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [psi, mu, sigma_alpha, alpha]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 3 seconds.

with pm.Model() as mb:

    # Priors
    # inclusion
    psi = pm.Uniform('psi', 0, 1)  

    # mean catchability 
    mu = pm.Logistic('mu', 0, 1)
    
    # behavior effect
    sigma_beta = pm.InverseGamma('sigma_beta', 4, 3)
    beta = pm.Normal('beta', 0, pm.math.sqrt(sigma_beta))

    # Linear model
    nu = mu + behavior_covariate * beta 
    p = pm.Deterministic('p', pm.math.invlogit(nu))

    # Likelihood 
    pm.CustomDist(
        'y',
        1,
        p,
        psi,
        logp=logp,
        observed=dolphin_augmented
    )
    
pm.model_to_graphviz(mb)

Figure 4: Visual representation of model \(M_b\).

with mb:
    mb_idata = pm.sample()

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [psi, mu, sigma_beta, beta]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 6 seconds.

with pm.Model() as mtb:

    # Priors
    # inclusion
    psi = pm.Uniform('psi', 0, 1)  

    # mean catchability 
    mu = pm.Logistic('mu', 0, 1)

    # time effect
    sigma_alpha = pm.InverseGamma('sigma_alpha', 4, 3)
    alpha = pm.Normal('alpha', 0, pm.math.sqrt(sigma_alpha), shape=T)

    # behavior effect
    sigma_beta = pm.InverseGamma('sigma_beta', 4, 3)
    beta = pm.Normal('beta', 0, pm.math.sqrt(sigma_beta))

    # Linear model
    nu = mu + alpha + behavior_covariate * beta
    p = pm.Deterministic('p', pm.math.invlogit(nu))

    # Likelihood 
    pm.CustomDist(
        'y',
        1,
        p,
        psi,
        logp=logp,
        observed=dolphin_augmented
    )
    
pm.model_to_graphviz(mtb)

Figure 5: Visual representation of model \(M_{tb}\).

with mtb:
    mtb_idata = pm.sample()

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [psi, mu, sigma_alpha, alpha, sigma_beta, beta]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 9 seconds.

with pm.Model() as mh:

    # Priors
    # inclusion
    psi = pm.Uniform('psi', 0, 1)  

    # mean catchability 
    mu = pm.Logistic('mu', 0, 1)

    # individual effect
    sigma_gamma = pm.InverseGamma('sigma_gamma', 4, 3)
    gamma = pm.Normal('gamma', 0, pm.math.sqrt(sigma_gamma), shape=M)

    # Linear model
    individual_effect = (np.ones((T, M)) * gamma).T
    nu = mu + individual_effect
    p = pm.Deterministic('p', pm.math.invlogit(nu))

    # Likelihood 
    pm.CustomDist(
        'y',
        1,
        p,
        psi,
        logp=logp,
        observed=dolphin_augmented
    )
    
pm.model_to_graphviz(mh)

Figure 6: Visual representation of model \(M_h\).

with mh:
    mh_idata = pm.sample(3000, target_accept=0.99, )

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [psi, mu, sigma_gamma, gamma]

Sampling 4 chains for 1_000 tune and 3_000 draw iterations (4_000 + 12_000 draws total) took 61 seconds.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details

az.summary(mh_idata, var_names=['psi', 'mu', 'sigma_gamma'])

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
psi	0.185	0.038	0.122	0.257	0.002	0.002	488.0	787.0	1.02
mu	-2.796	0.330	-3.415	-2.224	0.020	0.013	297.0	381.0	1.03
sigma_gamma	0.782	0.341	0.270	1.381	0.030	0.024	151.0	223.0	1.05

az.plot_trace(mh_idata, figsize=(8, 6), var_names=['psi', 'mu', 'sigma_gamma']);

with pm.Model() as mth:

    # Priors
    # inclusion
    psi = pm.Beta('psi', 1, 1)  

    # mean catchability 
    mu = pm.Logistic('mu', 0, 1)

    # time effect
    sigma_alpha = pm.InverseGamma('sigma_alpha', 4, 3)
    alpha = pm.Normal('alpha', 0, pm.math.sqrt(sigma_alpha), shape=T)

    # individual effect
    sigma_gamma = pm.InverseGamma('sigma_gamma', 4, 3)
    gamma = pm.Normal('gamma', 0, pm.math.sqrt(sigma_gamma), shape=M)

    # Linear model
    individual_effect = (np.ones((T, M)) * gamma).T
    nu = mu + alpha + individual_effect
    p = pm.Deterministic('p', pm.math.invlogit(nu))

    # Likelihood 
    pm.CustomDist(
        'y',
        1,
        p,
        psi,
        logp=logp,
        observed=dolphin_augmented
    )
    
pm.model_to_graphviz(mth)

Figure 7: Visual representation of model \(M_{th}\).

with mth:
    mth_idata = pm.sample(draws=3000, target_accept=0.95, )

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [psi, mu, sigma_alpha, alpha, sigma_gamma, gamma]

Sampling 4 chains for 1_000 tune and 3_000 draw iterations (4_000 + 12_000 draws total) took 66 seconds.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details

az.summary(mth_idata, var_names=['psi', 'mu', 'sigma_alpha', 'sigma_gamma', 'alpha'])

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
psi	0.179	0.037	0.118	0.247	0.002	0.001	621.0	1325.0	1.00
mu	-3.035	0.419	-3.819	-2.275	0.018	0.009	541.0	1061.0	1.01
sigma_alpha	0.929	0.367	0.360	1.581	0.005	0.006	7472.0	7430.0	1.00
sigma_gamma	0.857	0.390	0.295	1.618	0.025	0.019	228.0	469.0	1.02
alpha[0]	-1.104	0.608	-2.237	0.033	0.007	0.007	9323.0	6299.0	1.00
alpha[1]	0.574	0.408	-0.213	1.318	0.005	0.004	7081.0	6947.0	1.00
alpha[2]	-0.802	0.562	-1.852	0.258	0.006	0.006	9568.0	7469.0	1.00
alpha[3]	0.570	0.407	-0.190	1.344	0.005	0.004	6293.0	6918.0	1.00
alpha[4]	0.448	0.416	-0.341	1.213	0.005	0.004	6728.0	7205.0	1.00
alpha[5]	0.789	0.401	0.034	1.527	0.005	0.004	5633.0	6286.0	1.00
alpha[6]	0.183	0.435	-0.633	0.991	0.005	0.004	6432.0	7688.0	1.00
alpha[7]	0.031	0.450	-0.830	0.868	0.005	0.004	8462.0	7910.0	1.00
alpha[8]	-0.798	0.559	-1.862	0.239	0.005	0.005	10811.0	8223.0	1.00
alpha[9]	0.023	0.453	-0.850	0.847	0.005	0.004	7440.0	7332.0	1.00
alpha[10]	1.156	0.375	0.439	1.851	0.005	0.003	6280.0	6713.0	1.00
alpha[11]	-0.332	0.495	-1.274	0.592	0.005	0.005	8792.0	7657.0	1.00
alpha[12]	-1.108	0.610	-2.229	0.057	0.006	0.006	11043.0	7865.0	1.00
alpha[13]	-0.141	0.474	-1.037	0.734	0.005	0.005	8281.0	7634.0	1.00
alpha[14]	-1.106	0.618	-2.283	0.011	0.006	0.007	11590.0	7275.0	1.00
alpha[15]	1.610	0.365	0.928	2.289	0.005	0.003	5741.0	6790.0	1.00
alpha[16]	-0.800	0.551	-1.844	0.207	0.005	0.006	10592.0	7003.0	1.00

az.plot_trace(mth_idata, figsize=(8, 10),
              var_names=['psi', 'mu', 'sigma_alpha', 'sigma_gamma', 'alpha']);

Figure 8: Trace plots for model \(M_{th}\).

with pm.Model() as mbh:

    # Priors
    # inclusion
    psi = pm.Beta('psi', 1, 1)  

    # mean catchability 
    mu = pm.Logistic('mu', 0, 1)

    # behavior effect
    sigma_beta = pm.InverseGamma('sigma_beta', 4, 3)
    beta = pm.Normal('beta', 0, pm.math.sqrt(sigma_beta))
    
    # individual effect
    sigma_gamma = pm.InverseGamma('sigma_gamma', 4, 3)
    gamma = pm.Normal('gamma', 0, pm.math.sqrt(sigma_gamma), shape=M)

    # Linear model
    individual_effect = (np.ones((T, M)) * gamma).T
    nu = mu + behavior_covariate * beta + individual_effect
    p = pm.Deterministic('p', pm.math.invlogit(nu))

    # Likelihood 
    pm.CustomDist(
        'y',
        1,
        p,
        psi,
        logp=logp,
        observed=dolphin_augmented
    )
    
pm.model_to_graphviz(mbh)

Figure 9: Visual representation of model \(M_{bh}\).

with mbh:
    mbh_idata = pm.sample(draws=3000, target_accept=0.95, )

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [psi, mu, sigma_beta, beta, sigma_gamma, gamma]

Sampling 4 chains for 1_000 tune and 3_000 draw iterations (4_000 + 12_000 draws total) took 91 seconds.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details

az.summary(mbh_idata, var_names=['psi', 'mu', 'beta', 'sigma_gamma'])

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
psi	0.554	0.192	0.244	0.920	0.010	0.003	357.0	1179.0	1.01
mu	-3.504	0.604	-4.607	-2.429	0.039	0.017	256.0	797.0	1.02
beta	-1.537	0.263	-2.027	-1.051	0.006	0.002	1703.0	3792.0	1.00
sigma_gamma	2.143	0.803	0.798	3.598	0.059	0.041	193.0	345.0	1.03

az.plot_trace(mbh_idata, figsize=(8, 10),
              var_names=['psi', 'mu', 'beta', 'sigma_beta', 'sigma_gamma']);

with pm.Model() as mtbh:

    # Priors
    # inclusion
    psi = pm.Beta('psi', 1, 5)  

    # mean catchability 
    mu = pm.Logistic('mu', 0, 1)

    # time effect
    sigma_alpha = pm.InverseGamma('sigma_alpha', 4, 3)
    alpha = pm.Normal('alpha', 0, pm.math.sqrt(sigma_alpha), shape=T)

    # behavior effect
    sigma_beta = pm.InverseGamma('sigma_beta', 4, 3)
    beta = pm.Normal('beta', 0, pm.math.sqrt(sigma_beta))

    # individual effect
    sigma_gamma = pm.InverseGamma('sigma_gamma', 4, 3)
    gamma = pm.Normal('gamma', 0, pm.math.sqrt(sigma_gamma), shape=M)

    # Linear model
    individual_effect = (np.ones((T, M)) * gamma).T
    nu = mu + alpha + behavior_covariate * beta + individual_effect
    p = pm.Deterministic('p', pm.math.invlogit(nu))

    # Likelihood 
    pm.CustomDist(
        'y',
        1,
        p,
        psi,
        logp=logp,
        observed=dolphin_augmented
    )
    
pm.model_to_graphviz(mtbh)

Figure 10: Visual representation of model \(M_{tbh}\).

with mtbh:
    mtbh_idata = pm.sample(draws=2000, )

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [psi, mu, sigma_alpha, alpha, sigma_beta, beta, sigma_gamma, gamma]

Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 30 seconds.

az.summary(mtbh_idata, 
           var_names=['psi', 'mu', 'alpha', 'beta', 'sigma_alpha', 'sigma_beta', 'sigma_gamma'])

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
psi	0.710	0.101	0.522	0.888	0.002	0.001	3704.0	5368.0	1.00
mu	-3.085	0.466	-3.960	-2.218	0.014	0.006	1129.0	2523.0	1.00
alpha[0]	-3.121	0.712	-4.424	-1.763	0.012	0.008	3656.0	4743.0	1.00
alpha[1]	-0.674	0.509	-1.575	0.307	0.010	0.005	2868.0	4367.0	1.00
alpha[2]	-1.692	0.649	-2.958	-0.542	0.010	0.007	4360.0	4435.0	1.00
alpha[3]	-0.057	0.481	-0.995	0.806	0.009	0.005	3203.0	4372.0	1.00
alpha[4]	0.214	0.487	-0.728	1.100	0.009	0.005	3247.0	4631.0	1.00
alpha[5]	0.838	0.452	-0.007	1.685	0.008	0.005	3130.0	4524.0	1.00
alpha[6]	0.383	0.506	-0.578	1.324	0.009	0.005	3499.0	4907.0	1.00
alpha[7]	0.370	0.520	-0.589	1.371	0.009	0.006	3571.0	4768.0	1.00
alpha[8]	-0.558	0.657	-1.843	0.617	0.009	0.008	6131.0	5387.0	1.00
alpha[9]	0.465	0.515	-0.484	1.447	0.008	0.006	3709.0	4590.0	1.00
alpha[10]	1.663	0.433	0.878	2.491	0.008	0.005	2694.0	3892.0	1.00
alpha[11]	0.138	0.568	-0.971	1.157	0.008	0.007	4685.0	4697.0	1.00
alpha[12]	-0.892	0.749	-2.357	0.427	0.009	0.010	7137.0	5044.0	1.00
alpha[13]	0.384	0.548	-0.644	1.431	0.009	0.007	4154.0	4504.0	1.00
alpha[14]	-0.868	0.752	-2.207	0.535	0.009	0.010	6649.0	4471.0	1.00
alpha[15]	2.243	0.420	1.470	3.039	0.009	0.005	2314.0	4031.0	1.00
alpha[16]	-0.268	0.689	-1.630	0.963	0.009	0.009	5985.0	5182.0	1.00
beta	-3.278	0.384	-3.962	-2.527	0.007	0.004	3080.0	4597.0	1.00
sigma_alpha	1.556	0.592	0.625	2.616	0.009	0.010	4625.0	5420.0	1.00
sigma_beta	2.410	1.614	0.651	5.046	0.021	0.106	8379.0	5059.0	1.00
sigma_gamma	2.751	0.629	1.690	3.982	0.030	0.016	454.0	999.0	1.01

az.plot_trace(mtbh_idata, figsize=(8,14),
           var_names=['psi', 'mu', 'alpha', 'beta', 'sigma_alpha', 'sigma_beta', 'sigma_gamma']);

Figure 11: Trace plots for the model with \(M_{tbh}\).

The trace plots and summary statistics show convergence issues for many of the individual heterogeneity models. The variance parameter, \(\sigma_{\gamma},\) seems to sample poorly. Further, models with both behavioral and individual effects lead to extremely large estimates of \(\psi\). This appears to happen regardless of the size of the data augmentation \(M.\)

Note that I upped the target_accept value for some models. This slows the sampler, but lowers the risk of divergence.

Model comparison

Next, I select a model for inference using an approximation of leave-one-out (loo) cross-validation (Vehtari, Gelman, and Gabry 2017). This approximation can be calculated using PyMC. To do so, I calculate the log-likelihood for each model, which is added to the InferenceData object. This makes it possible to compare the models using loo and az.compare.

with m0:
    pm.compute_log_likelihood(m0_idata)

with mt:
    pm.compute_log_likelihood(mt_idata)

with mb:
    pm.compute_log_likelihood(mb_idata)

with mtb:
    pm.compute_log_likelihood(mtb_idata)

with mh:
    pm.compute_log_likelihood(mh_idata)

with mth:
    pm.compute_log_likelihood(mth_idata)

with mbh:
    pm.compute_log_likelihood(mbh_idata)

with mtbh:
    pm.compute_log_likelihood(mtbh_idata)

model_dict = {"m0": m0_idata, "mt": mt_idata, "mb": mb_idata, 
              "mtb": mtb_idata, "mh": mh_idata, "mth": mth_idata, 
              "mbh": mbh_idata, "mtbh": mtbh_idata}

comparison = az.compare(model_dict)

/Users/philtpatton/source/repos/philpatton.github.io/.venv/lib/python3.13/site-packages/arviz/stats/stats.py:797: UserWarning: Estimated shape parameter of Pareto distribution is greater than 0.70 for one or more samples. You should consider using a more robust model, this is because importance sampling is less likely to work well if the marginal posterior and LOO posterior are very different. This is more likely to happen with a non-robust model and highly influential observations.
  warnings.warn(
/Users/philtpatton/source/repos/philpatton.github.io/.venv/lib/python3.13/site-packages/arviz/stats/stats.py:797: UserWarning: Estimated shape parameter of Pareto distribution is greater than 0.70 for one or more samples. You should consider using a more robust model, this is because importance sampling is less likely to work well if the marginal posterior and LOO posterior are very different. This is more likely to happen with a non-robust model and highly influential observations.
  warnings.warn(
/Users/philtpatton/source/repos/philpatton.github.io/.venv/lib/python3.13/site-packages/arviz/stats/stats.py:797: UserWarning: Estimated shape parameter of Pareto distribution is greater than 0.70 for one or more samples. You should consider using a more robust model, this is because importance sampling is less likely to work well if the marginal posterior and LOO posterior are very different. This is more likely to happen with a non-robust model and highly influential observations.
  warnings.warn(
/Users/philtpatton/source/repos/philpatton.github.io/.venv/lib/python3.13/site-packages/arviz/stats/stats.py:797: UserWarning: Estimated shape parameter of Pareto distribution is greater than 0.70 for one or more samples. You should consider using a more robust model, this is because importance sampling is less likely to work well if the marginal posterior and LOO posterior are very different. This is more likely to happen with a non-robust model and highly influential observations.
  warnings.warn(

The comparison tools notes issues with several of the models, suggesting a lack of robustness. Inspection of the comparison table shows that the struggling models all include the individual effect \(h.\) A more thorough analysis would consider reparameterizing the model, e.g., through the non-centered parameterization. In lieu of that, I simply discard the models that fail this test and re-do the comparison with the passing models.

comparison.round(2)

	rank	elpd_loo	p_loo	elpd_diff	weight	se	dse	warning	scale
mtbh	0	-430.33	86.40	0.00	0.97	57.03	0.00	True	log
mth	1	-488.70	38.12	58.37	0.03	58.80	11.56	True	log
mtb	2	-492.53	20.69	62.20	0.00	61.83	11.86	False	log
mt	3	-493.81	14.28	63.49	0.00	59.56	12.82	False	log
mbh	4	-497.66	70.83	67.33	0.00	61.32	11.16	True	log
mh	5	-518.82	26.05	88.49	0.00	61.54	15.21	True	log
mb	6	-521.59	5.48	91.26	0.00	63.16	15.49	False	log
m0	7	-522.62	2.57	92.29	0.00	62.09	16.20	False	log

good_dict = {"m0": m0_idata, "mt": mt_idata, "mb": mb_idata, "mtb": mtb_idata}
good_comparison = az.compare(good_dict)
good_comparison.round(2)

	rank	elpd_loo	p_loo	elpd_diff	weight	se	dse	warning	scale
mtb	0	-492.53	20.69	0.00	0.62	61.83	0.00	False	log
mt	1	-493.81	14.28	1.28	0.38	59.56	6.37	False	log
mb	2	-521.59	5.48	29.06	0.00	63.16	8.43	False	log
m0	3	-522.62	2.57	30.08	0.00	62.09	10.06	False	log

az.plot_compare(good_comparison, figsize=(5, 4));

Figure 12: Differences in the ELPD criteria, calculated using loo, for each model (Vehtari, Gelman, and Gabry 2017).

The comparison shows that all of the model weight belongs to two models: \(M_t\) and \(M_{tb}.\)

Model averaged predictions

Finally, we can use the model weights to simulate a weighted posterior of \(N.\) To do so, I take a weighted sample of each of the posteriors of \(N,\) with the weight dictated by the comparison tool.

posteriors = [sim_N(good_dict[model]) for model in good_dict]
weights = [good_comparison.loc[model].weight for model in good_dict]
sample_count = len(posteriors[0])

l = []
for w, p in zip(weights, posteriors):
    weighted_sample = RNG.choice(p, size=int(w * sample_count))
    l.append(weighted_sample)

weighted_posterior = np.concatenate(l)

fig, (ax0, ax1) = plt.subplots(2, 1, figsize=(7, 6), sharex=True, sharey=True, tight_layout=True)

pal = sns.color_palette("Set2")

# labs = [k for k in good_dict.keys()]
labs = [r'$M_{0}$', r'$M_{t}$', r'$M_{b}$', r'$M_{tb}$']
for i, p in enumerate(posteriors):
    ax0.hist(p, color=pal[i], edgecolor='white', bins=60, alpha=0.6, label=labs[i])

ax0.set_title(r'Posteriors of $N$')
# ax1.set_title(r'Weighted posterior')

ax0.set_xlim((53, 150))
ax0.legend()

ax0.set_ylabel('Number of samples')
ax1.set_ylabel('Number of samples')

ax1.hist(weighted_posterior, edgecolor='white', bins=60, alpha=0.9, color=pal[6], label='Weighted')
ax1.legend()

plt.show()

Figure 13: Posteriors of \(N\) from the four models under consideration (top panel), with the model averaged posterior (bottom panel).

We can also look at the posterior densities of \(p\) from Model \(M_t,\) the second most weighted model.

p_samps = az.extract(mt_idata).p

fig, ax = plt.subplots(figsize=(6, 4))

a = 0.4
# ax[0].set_title("Poisson")
pal = sns.color_palette('viridis', T)
for t in range(T):
    label_idx = t % 2
    if label_idx == 0:
        az.plot_dist(p_samps[t], ax=ax, color=pal[t], label=f'$t_{{{t}}}$',
                     plot_kwargs={'linewidth':3, 'alpha': a})
    else:
        az.plot_dist(p_samps[t], ax=ax, color=pal[t],
                     plot_kwargs={'linewidth':3, 'alpha': a})

ax.set_title(r'Posterior densities of $p$ from $M_t$')
ax.set_xlabel(r'$p$')

plt.show()

Figure 14: Posteriors of \(p\) from model \(M_t\)

This notebook demonstrates a simple way to compare models using leave one out cross-validation (loo) and a classic example from capture-recapture. This is just one way, however, to perform model comparison using PyMC. Perhaps a more effective solution for this problem would be placing a shrinkage prior on the \(\sigma\) parameters.

%load_ext watermark

%watermark -n -u -v -iv -w

Last updated: Mon Apr 28 2025

Python implementation: CPython
Python version       : 3.13.2
IPython version      : 9.0.2

matplotlib: 3.10.1
pandas    : 2.2.3
arviz     : 0.21.0
pymc      : 5.22.0
numpy     : 2.1.3
seaborn   : 0.13.2

Watermark: 2.5.0

References

King, Ruth, and SP2526632 Brooks. 2008. “On the Bayesian Estimation of a Closed Population Size in the Presence of Heterogeneity and Model Uncertainty.” Biometrics 64 (3): 816–24.

Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.” Statistics and Computing 27: 1413–32.

Wilson, Ben, Philip S Hammond, and Paul M Thompson. 1999. “Estimating Size and Assessing Trends in a Coastal Bottlenose Dolphin Population.” Ecological Applications 9 (1): 288–300.