Bits of string: compositional data

(§) The article A flexible Bayesian tool for CoDa mixed models (Martı́nez-Minaya and Rue 2024) looks promising; it frames compositional count data in a way so that it’s possible to fit by using INLA.

(§) Dimension of of log ratio vectors matters. Compare the CLR (1,1,0)(-1, 1, 0) with (1,1,0,0,)(-1, 1, 0, 0, \ldots) (a hundred zeros repeated). The former is roughly (0.09,0.67,0.24)(0.09, 0.67, 0.24) in the simplex, while the latter is roughly (0.004,0.03,0.01,0.01,)(0.004, 0.03, 0.01, 0.01, \ldots), which is practically uniform. The log ratios between coordinates 1 and 2 (or 1 and 3) remain the same however.

(§) Suppose we’re simulating (count) data where a subcomposition made up of n1n_1 components is active and the remaining n2n_2 components are inactive. We could start with log ratios (not centered, but relative to some unknown denominator) and set the inactive variables to all-0, so we have a log ratio vector x=(x1,,xn1,0,0,,0)x = (x_1, \ldots, x_{n_1}, 0, 0, \ldots, 0). What size should xix_i be?

As the note above suggests this should depend on dimension. We could decide that on average we want to have some set proportion, pp, of counts to fall in the active subcomposition; let’s say p=1/4p = 1/4. Then the probabilities (for a multinomial draw) corresponding to the active parts should sum to 1/41/4. Since we invert log ratios by lr1(x)=𝒞(ex1,,exn1,1,1,,1)\text{lr}^{-1}(x) = \mathcal C(e^{x_1}, \ldots ,e^{x_{n_1}}, 1,1, \ldots, 1), where 𝒞\mathcal C is the closure operator, it seems that we want 3exi=n2(*)3\sum e^{x_i} = n_2\ (*) on average.

We’ll draw independently XiN(0,σ)X_i \sim N(0, \sigma) and choose σ\sigma so that (*)(*) above is true on average. If YN(μ,σ)Y \sim N(\mu, \sigma) then EeY=eμ+σ2/2\text{E} e^Y = e^{\mu + \sigma^2/2} (cf. this) so if we take the expectation of (*)(*) we get

3n1eσ2/2=n2σ=2lnn23n1. 3n_1e^{\sigma^2/2} = n_2 \implies \sigma = \sqrt{2\ln\frac{n_2}{3n_1}}.

(§) The sum-to-one constraint forces a negative correlation (something goes up and something else must go down) but the effect might be ignored in high dimension under mild assumptions. From Townes et al. (2019), if we assume

ymultinomial(π,n) y \sim \text{multinomial}(\pi, n)

with y=(y1,,yk)y = (y_1, \ldots, y_k) and π=(π1,,πk)\pi = (\pi_1, \ldots, \pi_k), the marginals are

yibinomial(πi,n), y_i \sim \text{binomial}(\pi_i, n),

with 𝔼yi=μi=nπi\mathbb E y_i = \mu_i = n\pi_i, and var(yi)=nπi(1πi)=μiμi2/n\text{var}(y_i) = n\pi_i(1-\pi_i) = \mu_i - \mu_i^2/n. This last identity hints at the well-known small pp big nn Poisson horse kick approximation to the binomial.

More interesting for our purpose is the expression for the covariance between two compositional parts,

cor(yi,yj)=πiπj(1πi)(1πj), \text{cor}(y_i, y_j) = \sqrt{\frac{\pi_i\pi_j}{(1-\pi_i)(1-\pi_j)}},

which if both πi\pi_i and πj\pi_j are small is practically zero. In some settings it might be quite reasonable to assume that πi1/k\pi_i \approx 1/k, ie they are roughly of similar magnitude. Then πi\pi_i gets quite small as kk (the dimension) grows.

Note to self: But big Q what if we assume some prior over π\pi and this one has a covariance also? Where does it go

Backlinks:

Martı́nez-Minaya, Joaquı́n, and Haavard Rue. 2024. “A Flexible Bayesian Tool for CoDa Mixed Models: Logistic-Normal Distribution with Dirichlet Covariance.” Statistics and Computing 34 (3): 116. https://doi.org/10.1007/s11222-024-10427-3.
Townes, F. William, Stephanie C. Hicks, Martin J. Aryee, and Rafael A. Irizarry. 2019. “Feature Selection and Dimension Reduction for Single-Cell RNA-Seq Based on a Multinomial Model.” Genome Biology 20 (1): 295. https://doi.org/10.1186/s13059-019-1861-6.
this file last touched 2026.03.16