Bits of string: compositional data

(§) The article A flexible Bayesian tool for CoDa mixed models (Martı́nez-Minaya and Rue 2024) looks promising; it frames compositional count data in a way so that it’s possible to fit by using INLA.

(§) Dimension of of log ratio vectors matters. Compare the CLR (1,1,0)(-1, 1, 0) with (1,1,0,0,)(-1, 1, 0, 0, \ldots) (a hundred zeros repeated). The former is roughly (0.09,0.67,0.24)(0.09, 0.67, 0.24) in the simplex, while the latter is roughly (0.004,0.03,0.01,0.01,)(0.004, 0.03, 0.01, 0.01, \ldots), which is practically uniform. The log ratios between coordinates 1 and 2 (or 1 and 3) remain the same however.

(§) Suppose we’re simulating (count) data where a subcomposition made up of n1n_1 components is active and the remaining n2n_2 components are inactive. We could start with log ratios (not centered, but relative to some unknown denominator) and set the inactive variables to all-0, so we have a log ratio vector x=(x1,,xn1,0,0,,0)x = (x_1, \ldots, x_{n_1}, 0, 0, \ldots, 0). What size should xix_i be?

As the note above suggests this should depend on dimension. We could decide that on average we want to have some set proportion, pp, of counts to fall in the active subcomposition; let’s say p=1/4p = 1/4. Then the probabilities (for a multinomial draw) corresponding to the active parts should sum to 1/41/4. Since we invert log ratios by lr1(x)=𝒞(ex1,,exn1,1,1,,1)\text{lr}^{-1}(x) = \mathcal C(e^{x_1}, \ldots ,e^{x_{n_1}}, 1,1, \ldots, 1), where 𝒞\mathcal C is the closure operator, it seems that we want 3exi=n2(*)3\sum e^{x_i} = n_2\ (*) on average.

We’ll draw independently XiN(0,σ)X_i \sim N(0, \sigma) and choose σ\sigma so that (*)(*) above is true on average. If YN(μ,σ)Y \sim N(\mu, \sigma) then EeY=eμ+σ2/2\text{E} e^Y = e^{\mu + \sigma^2/2} (cf. this) so if we take the expectation of (*)(*) we get

3n1eσ2/2=n2σ=2lnn23n1. 3n_1e^{\sigma^2/2} = n_2 \implies \sigma = \sqrt{2\ln\frac{n_2}{3n_1}}.

Backlinks:

Martı́nez-Minaya, Joaquı́n, and Haavard Rue. 2024. “A Flexible Bayesian Tool for CoDa Mixed Models: Logistic-Normal Distribution with Dirichlet Covariance.” Statistics and Computing 34 (3): 116. https://doi.org/10.1007/s11222-024-10427-3.
this file last touched 2025.09.05