Description
Hi glmGamPoi author,
I have been reviewing the mathematical formulation for the overdispersion parameter $\theta$ optimization and noticed a potential mismatch between the theoretical second derivative of the log-likelihood ($\Large\frac{d^2l}{d\theta^2}$) and its actual code implementation in src/overdispersion.cpp.
Could you help clarify if this is an intentional optimization (e.g., an approximation like Fisher Scoring) or a mathematical typo?
1. Theoretical Second Derivative
Based on the log-likelihood function, the first derivative with respect to $\theta$ is:
$$\frac{d l}{d \theta} = \frac{1}{\theta} \left[ \sum_{k=1}^{N}{ \left( -\frac{1}{\theta} \left( \psi(C_k + \theta^{-1}) - \psi(\theta^{-1}) \right) + \ln(1+\mu_k\theta) + \frac{C_k - \mu_k}{\mu_k + \theta^{-1}} \right) } \right]$$
Differentiating this a second time yields the theoretical Hessian component for $\theta$:
$$\frac{d^2l}{d\theta^2} =
\sum_{k=1}^{N} \left[
\frac{2}{\theta^3}\left(\psi(C_k + \theta^{-1}) - \psi(\theta^{-1})\right) + \frac{1}{\theta^4}\left(\psi'(C_k + \theta^{-1}) - \psi'(\theta^{-1})\right) - \frac{1}{\theta^2}\ln(1 + \mu_k\theta) + \frac{\mu_k}{\theta(1+\mu_k\theta)} - \frac{\mu_k(C_k-\mu_k)}{(1+\mu_k\theta)^2}
\right]$$
2. Code Implementation Analysis
Looking at src/overdispersion.cpp (around line 233):
for(size_t iter = 0; iter < count_frequencies.size(); ++iter){
digamma_term += count_frequencies[iter] * Rf_digamma(unique_counts[iter] + theta_neg1);
trigamma_term += count_frequencies[iter] * Rf_trigamma(unique_counts[iter] + theta_neg1);
}
trigamma_term *= theta_neg2;
digamma_term -= y.size() * Rf_digamma(theta_neg1);
trigamma_term -= theta_neg2 * y.size() * Rf_trigamma(theta_neg1);
...
ll_part_1 += log(1 + mu[i] * theta) + (y[i] - mu[i]) / (mu[i] + theta_neg1);
ll_part_2 += (mu[i] * mu[i] * theta + y[i]) / (1 + mu[i] * theta) / (1 + mu[i] * theta);
double ll_part = -2 * theta_neg1 * (ll_part_1 - digamma_term) + (ll_part_2 + trigamma_term);
double res = ll_part + cr_term * R_pow_di(theta, 2) + (ll_part_1 - digamma_term) * theta_neg1 + cr_term2 * theta;
If we map these code components directly back to the math notation (ignoring the Cox-Reid adjustment terms cr_term for a moment), the logic computes:
$$
\begin{aligned}
\text{digamma-term} &= \psi(C_k + \theta^{-1}) - \psi(\theta^{-1}) \\
\text{trigamma-term} &= \frac{1}{\theta^2} (\psi'(C_k + \theta^{-1}) - \psi'(\theta^{-1})) \\
\text{ll-part-1} &= \ln(1 + \mu_k\theta) + \frac{C_k - \mu_k}{\mu_k + \theta^{-1}} \\
\text{ll-part-2} &= \frac{\mu^2_k\theta+C_k}{(1+\mu\theta)^2} \\
\text{ll-part} &= \frac{-2}{\theta} \left( \text{ll-part-1} - \text{digamma-term} \right) + \text{ll-part-2} + \text{trigamma-term} \\
\text{res} &= \text{ll-part} + \frac{1}{\theta}(\text{ll-part-1} - \text{digamma-term})
\end{aligned}
$$
The Discrepancy
When expanding the final returned variable res, the resulting combination yields:
$$\text{res} = \sum_{k=1}^{N} \left[ \frac{1}{\theta} \left( \psi(C_k + \theta^{-1}) - \psi(\theta^{-1}) - \ln(1 + \mu_k\theta) - \frac{C_k - \mu_k}{\mu_k + \theta^{-1}} \right) + \frac{\mu^2_k\theta+C_k}{(1+\mu_k\theta)^2} + \frac{1}{\theta^2} (\psi'(C_k + \theta^{-1}) - \psi'(\theta^{-1})) \right]$$
Comparing this to the analytic second derivative ($\Large\frac{d^2l}{d\theta^2}$), the coefficients for $\Large\theta$ do not match (for example, the code ends up with a $\frac{1}{\theta}$ scaling on the digamma differences instead of $\Large\frac{2}{\theta^3}$, and $\Large\frac{1}{\theta^2}$ on the trigamma differences instead of $\Large\frac{1}{\theta^4}$).
Could you please clarify if res is intended to evaluate exactly to $\Large\frac{d^2l}{d\theta^2}$, or if this reflects a specific parameter transformation / expected information approximation step?
Thank you for your time and for maintaining this excellent package!
Description
Hi
glmGamPoiauthor,I have been reviewing the mathematical formulation for the overdispersion parameter$\theta$ optimization and noticed a potential mismatch between the theoretical second derivative of the log-likelihood ($\Large\frac{d^2l}{d\theta^2}$ ) and its actual code implementation in
src/overdispersion.cpp.Could you help clarify if this is an intentional optimization (e.g., an approximation like Fisher Scoring) or a mathematical typo?
1. Theoretical Second Derivative
Based on the log-likelihood function, the first derivative with respect to$\theta$ is:
Differentiating this a second time yields the theoretical Hessian component for$\theta$ :
2. Code Implementation Analysis
Looking at
src/overdispersion.cpp(around line 233):If we map these code components directly back to the math notation (ignoring the Cox-Reid adjustment terms
cr_termfor a moment), the logic computes:The Discrepancy
When expanding the final returned variable
res, the resulting combination yields:Comparing this to the analytic second derivative ($\Large\frac{d^2l}{d\theta^2}$ ), the coefficients for $\Large\theta$ do not match (for example, the code ends up with a $\frac{1}{\theta}$ scaling on the digamma differences instead of $\Large\frac{2}{\theta^3}$ , and $\Large\frac{1}{\theta^2}$ on the trigamma differences instead of $\Large\frac{1}{\theta^4}$ ).
Could you please clarify if$\Large\frac{d^2l}{d\theta^2}$ , or if this reflects a specific parameter transformation / expected information approximation step?
resis intended to evaluate exactly toThank you for your time and for maintaining this excellent package!