Quasi-likelihood gamma regression in statsmodels for zeroes in observations
Warning! I am not a statistician. This article is not reviewed. Please confer with a responsible adult!
Generalised linear models with a gamma distribution and log link are frequently used to model non-negative right-skewed continuous data, such as costs [1].
For example, in statsmodels:
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
# Generate some data
np.random.seed(12345)
dist1 = stats.gamma(a=5, scale=40)
dist2 = stats.gamma(a=5, scale=45)
x = [0] * 500 + [1] * 500
y = np.concatenate([dist1.rvs(500), dist2.rvs(500)])
df = pd.DataFrame(dict(x=x, y=y))
# Fit the model
model = sm.formula.glm('y ~ x', df, family=sm.families.Gamma(sm.families.links.Log())).fit()
print(model.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 1000
Model: GLM Df Residuals: 998
Model Family: Gamma Df Model: 1
Link Function: Log Scale: 0.20922
Method: IRLS Log-Likelihood: -5900.3
Date: Sat, 03 Sep 2022 Deviance: 203.31
Time: 20:13:02 Pearson chi2: 209.
No. Iterations: 7 Pseudo R-squ. (CS): 0.008731
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 5.3191 0.020 260.031 0.000 5.279 5.359
x 0.0857 0.029 2.962 0.003 0.029 0.142
==============================================================================
The regression shows, as expected, that x
has a significant effect on y
. We can similarly perform a likelihood ratio test, which yields the same result:
lr_statistic = -2 * (model.llnull - model.llf)
pvalue = 1 - stats.chi2.cdf(lr_statistic, model.df_model)
print(pvalue) # -> 0.0030633579406539324
However, statsmodels encounters difficulty when the dependent variable y
contains zeros. Tinkering with the data a bit to show this:
# Change a single observation to zero
df.loc[0, 'y'] = 0
# Fit the model again
model = sm.formula.glm('y ~ x', df, family=sm.families.Gamma(sm.families.links.Log())).fit()
print(model.summary())
/usr/lib/python3.10/site-packages/statsmodels/genmod/families/family.py:777: RuntimeWarning: divide by zero encountered in log
ll_obs -= special.gammaln(weight_scale) + np.log(endog)
/usr/lib/python3.10/site-packages/statsmodels/genmod/generalized_linear_model.py:1767: RuntimeWarning: invalid value encountered in double_scalars
prsq = 1 - np.exp((self.llnull - self.llf) * (2 / self.nobs))
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 1000
Model: GLM Df Residuals: 998
Model Family: Gamma Df Model: 1
Link Function: Log Scale: 0.21053
Method: IRLS Log-Likelihood: inf
Date: Sat, 03 Sep 2022 Deviance: 273.37
Time: 20:16:26 Pearson chi2: 210.
No. Iterations: 7 Pseudo R-squ. (CS): nan
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 5.3175 0.021 259.136 0.000 5.277 5.358
x 0.0873 0.029 3.010 0.003 0.030 0.144
==============================================================================
While the effect of x
remains significant, statsmodels is reporting a log-likelihood of inf
, which also means we cannot perform a likelihood ratio test.
To satisfy ourselves of why this should be the case, let us derive the log-likelihood function for a gamma model. The probability density function for the gamma distribution is:
\[f(y; k, μ) = \frac{(k/μ)^k}{Γ(k)} e^{-ky/μ} y^{k-1}\]where $k > 0$ is the shape parameter [2 p. 153]. Therefore, the log-likelihood for an observation $i$ is:
\[\begin{align*} \ln L_i(\hat{\bm{β}}) = \ln f(y_i; k, \hat{μ}_i) &= \ln\left(\frac{(k/\hat{μ}_i)^k}{Γ(k)}\right) + \ln\left(e^{-ky_i/\hat{μ}_i}\right) + \ln\left(y_i{}^{k-1}\right) \\ &= k\ln\left(\frac{k}{\hat{μ}_i}\right) - \ln Γ(k) - \frac{ky_i}{\hat{μ}_i} + (k - 1)\ln y \\ &= k\left[ \ln\left(\frac{k}{\hat{μ}_i}\right) + \ln y_i \right] - \frac{ky_i}{\hat{μ}_i} - \ln Γ(k) - \ln y_i \\ &= k\ln\left(\frac{ky_i}{\hat{μ}_i}\right) - \frac{ky_i}{\hat{μ}_i} - \ln Γ(k) - \ln y_i \end{align*}\]When $y_i = 0$, $\ln y_i$ is undefined, and so the log-likelihood is undefined, as statsmodels reports.
However, note that statsmodels still produces parameter estimates based on the maximum likelihood estimating equations. This can be justified from a quasi-likelihood perspective.1 Maximum likelihood estimation maximises the log-likelihood by taking the derivative of the log-likelihood (the score function) and solving for its roots. Quasi-likelihood estimation starts with the form of the score function and treating it as if it were the derivative of some quasi-likelihood function to maximise [2 p. 279]. Quasi-likelihood functions, then, have the same form as the log-likelihood function $\ln L_i(\hat{\bm{β}})$ but may differ by some constant term, such that the derivative is unchanged.
Let us rewrite $\ln L_i(\hat{\bm{β}})$ as follows:
\[\begin{align*} \ln L_i(\hat{\bm{β}}) &= k\ln\left(\frac{ky_i}{\hat{μ}_i}\right) - \frac{ky_i}{\hat{μ}_i} - \ln Γ(k) - \ln y_i \\ &= k\ln(ky_i) - k\ln(\hat{μ}_i) - \frac{ky_i}{\hat{μ}_i} - \ln Γ(k) - \ln y_i \\ &= -k\ln(\hat{μ}_i) - \frac{ky_i}{\hat{μ}_i} + \left[ k\ln(ky_i) - \ln Γ(k) - \ln y_i \right] \end{align*}\]Note that the bracketed term is constant in $\hat{μ}_i$, and so constant in $\hat{\bm{β}}$, so omitting it would not change the derivative of the overall expression with respect to $\hat{\bm{β}}$. In particular, when subtracting two log-likelihoods in the likelihood ratio test, the bracketed terms will cancel.2
So we can take $Q_i(\hat{\bm{β}}) = -k\ln(\hat{μ}_i) - \frac{ky_i}{\hat{μ}_i}$ to be a quasi-likelihood function for each observation $i$.3,4 It has useful similarities to the true log-likelihood – it is maximised whenever the log-likelihood is maximised, and the difference between two quasi-likelihoods equals the difference between the two log-likelihoods – but is also defined when $y_i = 0$.5
We can implement this quasi-likelihood in statsmodels as follows:
class GammaQL(sm.families.Gamma):
def loglike_obs(self, endog, mu, var_weights=1, scale=1):
k = var_weights / scale
endog_mu = self._clean(endog / mu)
ll_obs = -np.log(mu) - endog_mu
return ll_obs * k
model = sm.formula.glm('y ~ x', df, family=GammaQL(sm.families.links.Log())).fit()
print(model.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 1000
Model: GLM Df Residuals: 998
Model Family: GammaQL Df Model: 1
Link Function: Log Scale: 0.21053
Method: IRLS Log-Likelihood: -30214.
Date: Sat, 03 Sep 2022 Deviance: 273.37
Time: 21:30:19 Pearson chi2: 210.
No. Iterations: 7 Pseudo R-squ. (CS): 0.009015
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 5.3175 0.021 259.136 0.000 5.277 5.358
x 0.0873 0.029 3.010 0.003 0.030 0.144
==============================================================================
We can now perform a likelihood ratio test:
lr_statistic = -2 * (model.llnull - model.llf)
pvalue = 1 - stats.chi2.cdf(lr_statistic, model.df_model)
print(pvalue) # -> 0.002618765806616574
As expected, the p value is significant, consistent with the p value of x
.
References
[1] Malehi AS, Pourmotahari F, Angali KA. Statistical models for the analysis of skewed healthcare cost data: a simulation study. Health Economics Review. 2015; 5(1): 11. doi: 10.1186/s13561-015-0045-7
[2] Agresti A. Foundations of linear and generalized linear models. New Jersey: John Wiley & Sons; 2015.
[3] McCullagh P. Quasi-likelihood functions. Annals of Statistics. 1983; 11(1): 59–67.
[4] StataCorp. Stata base reference manual: release 17. Texas: StataCorp. https://www.stata.com/manuals/r.pdf
Footnotes
-
In the maximum likelihood perspective, we assume a gamma distribution for $y_i$, and so the log-likelihood is undefined when $y_i = 0$, because a gamma-distributed variable can never be 0. However, in the quasi-likelihood approach, we do not assume a distribution for $y_i$; rather, we only specify the mean–variance relationship [2 p. 268]. In the case of $\mathrm{var}(y_i) = \frac{\hat{μ}_i{}^2}{k}$, the resulting quasi-likelihood estimating equations are the same as the maximum likelihood estimating equations, but are also defined for $y_i = 0$. Theory tells that the estimators from quasi-likelihood estimation are consistent estimators [2 p. 279]. ↩
-
This assumes that $k$ is same in both the null model and fitted model, which is true in the statsmodels implementation. ↩
-
This can also be obtained by solving the system of partial differential equations arising from working backwards from the (quasi-)score function. The bracketed terms become a constant of integration $c(\bm{y}, \hat{\bm{μ}})$, the choice of which is ‘entirely arbitrary’ [3]. ↩
-
Stata's glm command uses a similar quasi-likelihood, $-\ln(\hat{μ}_i) - \frac{y_i}{\hat{μ}_i}$ [4 p. 836]. This differs only by a factor of $k$, and so is also maximised whenever the true log-likelihood is maximised. However, the omitted factor causes the likelihood ratio statistic to be off by a factor of $k$. See also https://www.statalist.org/forums/forum/general-stata-discussion/general/1680539-stata-s-log-likelihood-and-likelihood-ratio-test-for-gamma-glm. ↩
-
Suppose we had added a small $\epsilon > 0$ to each $y_i$, which would be the usual approach to ‘fudging’ away the zeroes, and then calculate a likelihood ratio as standard. Assuming that $y_i + \epsilon > 0$, $\ln(y_i + \epsilon)$ is defined and the bracketed terms will always cancel. Since $-k\ln(\hat{μ}_i) - \frac{k(y_i + \epsilon)}{\hat{μ}_i}$ is clearly continuous with respect to $y_i + \epsilon$, as $\epsilon \to 0$, it approaches $Q_i(\hat{\bm{β}})$. Therefore, the quasi-likelihood ratio given by $Q_i(\hat{\bm{β}})$ is the equivalent to the limiting behaviour of ‘fudging’ the zeroes under the conventional likelihood ratio. ↩