Quasilikelihood gamma regression in statsmodels for zeroes in observations
Warning! I am not a statistician. This article is not reviewed. Please confer with a responsible adult!
Generalised linear models with a gamma distribution and log link are frequently used to model nonnegative rightskewed continuous data, such as costs [1].
For example, in statsmodels:
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
# Generate some data
np.random.seed(12345)
dist1 = stats.gamma(a=5, scale=40)
dist2 = stats.gamma(a=5, scale=45)
x = [0] * 500 + [1] * 500
y = np.concatenate([dist1.rvs(500), dist2.rvs(500)])
df = pd.DataFrame(dict(x=x, y=y))
# Fit the model
model = sm.formula.glm('y ~ x', df, family=sm.families.Gamma(sm.families.links.Log())).fit()
print(model.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 1000
Model: GLM Df Residuals: 998
Model Family: Gamma Df Model: 1
Link Function: Log Scale: 0.20922
Method: IRLS LogLikelihood: 5900.3
Date: Sat, 03 Sep 2022 Deviance: 203.31
Time: 20:13:02 Pearson chi2: 209.
No. Iterations: 7 Pseudo Rsqu. (CS): 0.008731
Covariance Type: nonrobust
==============================================================================
coef std err z P>z [0.025 0.975]

Intercept 5.3191 0.020 260.031 0.000 5.279 5.359
x 0.0857 0.029 2.962 0.003 0.029 0.142
==============================================================================
The regression shows, as expected, that x
has a significant effect on y
. We can similarly perform a likelihood ratio test, which yields the same result:
lr_statistic = 2 * (model.llnull  model.llf)
pvalue = 1  stats.chi2.cdf(lr_statistic, model.df_model)
print(pvalue) # > 0.0030633579406539324
However, statsmodels encounters difficulty when the dependent variable y
contains zeros. Tinkering with the data a bit to show this:
# Change a single observation to zero
df.loc[0, 'y'] = 0
# Fit the model again
model = sm.formula.glm('y ~ x', df, family=sm.families.Gamma(sm.families.links.Log())).fit()
print(model.summary())
/usr/lib/python3.10/sitepackages/statsmodels/genmod/families/family.py:777: RuntimeWarning: divide by zero encountered in log
ll_obs = special.gammaln(weight_scale) + np.log(endog)
/usr/lib/python3.10/sitepackages/statsmodels/genmod/generalized_linear_model.py:1767: RuntimeWarning: invalid value encountered in double_scalars
prsq = 1  np.exp((self.llnull  self.llf) * (2 / self.nobs))
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 1000
Model: GLM Df Residuals: 998
Model Family: Gamma Df Model: 1
Link Function: Log Scale: 0.21053
Method: IRLS LogLikelihood: inf
Date: Sat, 03 Sep 2022 Deviance: 273.37
Time: 20:16:26 Pearson chi2: 210.
No. Iterations: 7 Pseudo Rsqu. (CS): nan
Covariance Type: nonrobust
==============================================================================
coef std err z P>z [0.025 0.975]

Intercept 5.3175 0.021 259.136 0.000 5.277 5.358
x 0.0873 0.029 3.010 0.003 0.030 0.144
==============================================================================
While the effect of x
remains significant, statsmodels is reporting a loglikelihood of inf
, which also means we cannot perform a likelihood ratio test.
To satisfy ourselves of why this should be the case, let us derive the loglikelihood function for a gamma model. The probability density function for the gamma distribution is:
\[f(y; k, μ) = \frac{(k/μ)^k}{Γ(k)} e^{ky/μ} y^{k1}\]where $k > 0$ is the shape parameter [2 p. 153]. Therefore, the loglikelihood for an observation $i$ is:
\[\begin{align*} \ln L_i(\hat{\bm{β}}) = \ln f(y_i; k, \hat{μ}_i) &= \ln\left(\frac{(k/\hat{μ}_i)^k}{Γ(k)}\right) + \ln\left(e^{ky_i/\hat{μ}_i}\right) + \ln\left(y_i{}^{k1}\right) \\ &= k\ln\left(\frac{k}{\hat{μ}_i}\right)  \ln Γ(k)  \frac{ky_i}{\hat{μ}_i} + (k  1)\ln y \\ &= k\left[ \ln\left(\frac{k}{\hat{μ}_i}\right) + \ln y_i \right]  \frac{ky_i}{\hat{μ}_i}  \ln Γ(k)  \ln y_i \\ &= k\ln\left(\frac{ky_i}{\hat{μ}_i}\right)  \frac{ky_i}{\hat{μ}_i}  \ln Γ(k)  \ln y_i \end{align*}\]When $y_i = 0$, $\ln y_i$ is undefined, and so the loglikelihood is undefined, as statsmodels reports.
However, note that statsmodels still produces parameter estimates based on the maximum likelihood estimating equations. This can be justified from a quasilikelihood perspective.^{1} Maximum likelihood estimation maximises the loglikelihood by taking the derivative of the loglikelihood (the score function) and solving for its roots. Quasilikelihood estimation starts with the form of the score function and treating it as if it were the derivative of some quasilikelihood function to maximise [2 p. 279]. Quasilikelihood functions, then, have the same form as the loglikelihood function $\ln L_i(\hat{\bm{β}})$ but may differ by some constant term, such that the derivative is unchanged.
Let us rewrite $\ln L_i(\hat{\bm{β}})$ as follows:
\[\begin{align*} \ln L_i(\hat{\bm{β}}) &= k\ln\left(\frac{ky_i}{\hat{μ}_i}\right)  \frac{ky_i}{\hat{μ}_i}  \ln Γ(k)  \ln y_i \\ &= k\ln(ky_i)  k\ln(\hat{μ}_i)  \frac{ky_i}{\hat{μ}_i}  \ln Γ(k)  \ln y_i \\ &= k\ln(\hat{μ}_i)  \frac{ky_i}{\hat{μ}_i} + \left[ k\ln(ky_i)  \ln Γ(k)  \ln y_i \right] \end{align*}\]Note that the bracketed term is constant in $\hat{μ}_i$, and so constant in $\hat{\bm{β}}$, so omitting it would not change the derivative of the overall expression with respect to $\hat{\bm{β}}$. In particular, when subtracting two loglikelihoods in the likelihood ratio test, the bracketed terms will cancel.^{2}
So we can take $Q_i(\hat{\bm{β}}) = k\ln(\hat{μ}_i)  \frac{ky_i}{\hat{μ}_i}$ to be a quasilikelihood function for each observation $i$.^{3}^{,}^{4} It has useful similarities to the true loglikelihood – it is maximised whenever the loglikelihood is maximised, and the difference between two quasilikelihoods equals the difference between the two loglikelihoods – but is also defined when $y_i = 0$.^{5}
We can implement this quasilikelihood in statsmodels as follows:
class GammaQL(sm.families.Gamma):
def loglike_obs(self, endog, mu, var_weights=1, scale=1):
k = var_weights / scale
endog_mu = self._clean(endog / mu)
ll_obs = np.log(mu)  endog_mu
return ll_obs * k
model = sm.formula.glm('y ~ x', df, family=GammaQL(sm.families.links.Log())).fit()
print(model.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 1000
Model: GLM Df Residuals: 998
Model Family: GammaQL Df Model: 1
Link Function: Log Scale: 0.21053
Method: IRLS LogLikelihood: 30214.
Date: Sat, 03 Sep 2022 Deviance: 273.37
Time: 21:30:19 Pearson chi2: 210.
No. Iterations: 7 Pseudo Rsqu. (CS): 0.009015
Covariance Type: nonrobust
==============================================================================
coef std err z P>z [0.025 0.975]

Intercept 5.3175 0.021 259.136 0.000 5.277 5.358
x 0.0873 0.029 3.010 0.003 0.030 0.144
==============================================================================
We can now perform a likelihood ratio test:
lr_statistic = 2 * (model.llnull  model.llf)
pvalue = 1  stats.chi2.cdf(lr_statistic, model.df_model)
print(pvalue) # > 0.002618765806616574
As expected, the p value is significant, consistent with the p value of x
.
References
[1] Malehi AS, Pourmotahari F, Angali KA. Statistical models for the analysis of skewed healthcare cost data: a simulation study. Health Economics Review. 2015; 5(1): 11. doi: 10.1186/s1356101500457
[2] Agresti A. Foundations of linear and generalized linear models. New Jersey: John Wiley & Sons; 2015.
[3] McCullagh P. Quasilikelihood functions. Annals of Statistics. 1983; 11(1): 59–67.
[4] StataCorp. Stata base reference manual: release 17. Texas: StataCorp. https://www.stata.com/manuals/r.pdf
Footnotes

In the maximum likelihood perspective, we assume a gamma distribution for $y_i$, and so the loglikelihood is undefined when $y_i = 0$, because a gammadistributed variable can never be 0. However, in the quasilikelihood approach, we do not assume a distribution for $y_i$; rather, we only specify the mean–variance relationship [2 p. 268]. In the case of $\mathrm{var}(y_i) = \frac{\hat{μ}_i{}^2}{k}$, the resulting quasilikelihood estimating equations are the same as the maximum likelihood estimating equations, but are also defined for $y_i = 0$. Theory tells that the estimators from quasilikelihood estimation are consistent estimators [2 p. 279]. ↩

This assumes that $k$ is same in both the null model and fitted model, which is true in the statsmodels implementation. ↩

This can also be obtained by solving the system of partial differential equations arising from working backwards from the (quasi)score function. The bracketed terms become a constant of integration $c(\bm{y}, \hat{\bm{μ}})$, the choice of which is ‘entirely arbitrary’ [3]. ↩

Stata's glm command uses a similar quasilikelihood, $\ln(\hat{μ}_i)  \frac{y_i}{\hat{μ}_i}$ [4 p. 836]. This differs only by a factor of $k$, and so is also maximised whenever the true loglikelihood is maximised. However, the omitted factor causes the likelihood ratio statistic to be off by a factor of $k$. See also https://www.statalist.org/forums/forum/generalstatadiscussion/general/1680539statasloglikelihoodandlikelihoodratiotestforgammaglm. ↩

Suppose we had added a small $\epsilon > 0$ to each $y_i$, which would be the usual approach to ‘fudging’ away the zeroes, and then calculate a likelihood ratio as standard. Assuming that $y_i + \epsilon > 0$, $\ln(y_i + \epsilon)$ is defined and the bracketed terms will always cancel. Since $k\ln(\hat{μ}_i)  \frac{k(y_i + \epsilon)}{\hat{μ}_i}$ is clearly continuous with respect to $y_i + \epsilon$, as $\epsilon \to 0$, it approaches $Q_i(\hat{\bm{β}})$. Therefore, the quasilikelihood ratio given by $Q_i(\hat{\bm{β}})$ is the equivalent to the limiting behaviour of ‘fudging’ the zeroes under the conventional likelihood ratio. ↩