Quasi-likelihood gamma regression in statsmodels for zeroes in observations

Warning! I am not a statistician. This article is not reviewed. Please confer with a responsible adult!

Generalised linear models with a gamma distribution and log link are frequently used to model non-negative right-skewed continuous data, such as costs [1].

For example, in statsmodels:

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm

# Generate some data
np.random.seed(12345)
dist1 = stats.gamma(a=5, scale=40)
dist2 = stats.gamma(a=5, scale=45)

x = [0] * 500 + [1] * 500
y = np.concatenate([dist1.rvs(500), dist2.rvs(500)])
df = pd.DataFrame(dict(x=x, y=y))

# Fit the model
model = sm.formula.glm('y ~ x', df, family=sm.families.Gamma(sm.families.links.Log())).fit()
print(model.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                 1000
Model:                            GLM   Df Residuals:                      998
Model Family:                   Gamma   Df Model:                            1
Link Function:                    Log   Scale:                         0.20922
Method:                          IRLS   Log-Likelihood:                -5900.3
Date:                Sat, 03 Sep 2022   Deviance:                       203.31
Time:                        20:13:02   Pearson chi2:                     209.
No. Iterations:                     7   Pseudo R-squ. (CS):           0.008731
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.3191      0.020    260.031      0.000       5.279       5.359
x              0.0857      0.029      2.962      0.003       0.029       0.142
==============================================================================

The regression shows, as expected, that x has a significant effect on y. We can similarly perform a likelihood ratio test, which yields the same result:

lr_statistic = -2 * (model.llnull - model.llf)
pvalue = 1 - stats.chi2.cdf(lr_statistic, model.df_model)
print(pvalue)  # -> 0.0030633579406539324

However, statsmodels encounters difficulty when the dependent variable y contains zeros. Tinkering with the data a bit to show this:

# Change a single observation to zero
df.loc[0, 'y'] = 0

# Fit the model again
model = sm.formula.glm('y ~ x', df, family=sm.families.Gamma(sm.families.links.Log())).fit()
print(model.summary())

/usr/lib/python3.10/site-packages/statsmodels/genmod/families/family.py:777: RuntimeWarning: divide by zero encountered in log
  ll_obs -= special.gammaln(weight_scale) + np.log(endog)
/usr/lib/python3.10/site-packages/statsmodels/genmod/generalized_linear_model.py:1767: RuntimeWarning: invalid value encountered in double_scalars
  prsq = 1 - np.exp((self.llnull - self.llf) * (2 / self.nobs))

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                 1000
Model:                            GLM   Df Residuals:                      998
Model Family:                   Gamma   Df Model:                            1
Link Function:                    Log   Scale:                         0.21053
Method:                          IRLS   Log-Likelihood:                    inf
Date:                Sat, 03 Sep 2022   Deviance:                       273.37
Time:                        20:16:26   Pearson chi2:                     210.
No. Iterations:                     7   Pseudo R-squ. (CS):                nan
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.3175      0.021    259.136      0.000       5.277       5.358
x              0.0873      0.029      3.010      0.003       0.030       0.144
==============================================================================

While the effect of x remains significant, statsmodels is reporting a log-likelihood of inf, which also means we cannot perform a likelihood ratio test.

To satisfy ourselves of why this should be the case, let us derive the log-likelihood function for a gamma model. The probability density function for the gamma distribution is:

\[f(y; k, μ) = \frac{(k/μ)^k}{Γ(k)} e^{-ky/μ} y^{k-1}\]

where $k > 0$ is the shape parameter [2 p. 153]. Therefore, the log-likelihood for an observation $i$ is:

\[\begin{align*} \ln L_i(\hat{\bm{β}}) = \ln f(y_i; k, \hat{μ}_i) &= \ln\left(\frac{(k/\hat{μ}_i)^k}{Γ(k)}\right) + \ln\left(e^{-ky_i/\hat{μ}_i}\right) + \ln\left(y_i{}^{k-1}\right) \\ &= k\ln\left(\frac{k}{\hat{μ}_i}\right) - \ln Γ(k) - \frac{ky_i}{\hat{μ}_i} + (k - 1)\ln y \\ &= k\left[ \ln\left(\frac{k}{\hat{μ}_i}\right) + \ln y_i \right] - \frac{ky_i}{\hat{μ}_i} - \ln Γ(k) - \ln y_i \\ &= k\ln\left(\frac{ky_i}{\hat{μ}_i}\right) - \frac{ky_i}{\hat{μ}_i} - \ln Γ(k) - \ln y_i \end{align*}\]

When $y_i = 0$, $\ln y_i$ is undefined, and so the log-likelihood is undefined, as statsmodels reports.

However, note that statsmodels still produces parameter estimates based on the maximum likelihood estimating equations. This can be justified from a quasi-likelihood perspective.¹ Maximum likelihood estimation maximises the log-likelihood by taking the derivative of the log-likelihood (the score function) and solving for its roots. Quasi-likelihood estimation starts with the form of the score function and treating it as if it were the derivative of some quasi-likelihood function to maximise [2 p. 279]. Quasi-likelihood functions, then, have the same form as the log-likelihood function $\ln L_i(\hat{\bm{β}})$ but may differ by some constant term, such that the derivative is unchanged.

Let us rewrite $\ln L_i(\hat{\bm{β}})$ as follows:

\[\begin{align*} \ln L_i(\hat{\bm{β}}) &= k\ln\left(\frac{ky_i}{\hat{μ}_i}\right) - \frac{ky_i}{\hat{μ}_i} - \ln Γ(k) - \ln y_i \\ &= k\ln(ky_i) - k\ln(\hat{μ}_i) - \frac{ky_i}{\hat{μ}_i} - \ln Γ(k) - \ln y_i \\ &= -k\ln(\hat{μ}_i) - \frac{ky_i}{\hat{μ}_i} + \left[ k\ln(ky_i) - \ln Γ(k) - \ln y_i \right] \end{align*}\]

Note that the bracketed term is constant in $\hat{μ}_i$, and so constant in $\hat{\bm{β}}$, so omitting it would not change the derivative of the overall expression with respect to $\hat{\bm{β}}$. In particular, when subtracting two log-likelihoods in the likelihood ratio test, the bracketed terms will cancel.²

So we can take $Q_i(\hat{\bm{β}}) = -k\ln(\hat{μ}_i) - \frac{ky_i}{\hat{μ}_i}$ to be a quasi-likelihood function for each observation $i$.³^,⁴ It has useful similarities to the true log-likelihood – it is maximised whenever the log-likelihood is maximised, and the difference between two quasi-likelihoods equals the difference between the two log-likelihoods – but is also defined when $y_i = 0$.⁵

We can implement this quasi-likelihood in statsmodels as follows:

class GammaQL(sm.families.Gamma):
	def loglike_obs(self, endog, mu, var_weights=1, scale=1):
		k = var_weights / scale
		endog_mu = self._clean(endog / mu)
		ll_obs = -np.log(mu) - endog_mu
		return ll_obs * k

model = sm.formula.glm('y ~ x', df, family=GammaQL(sm.families.links.Log())).fit()
print(model.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                 1000
Model:                            GLM   Df Residuals:                      998
Model Family:                 GammaQL   Df Model:                            1
Link Function:                    Log   Scale:                         0.21053
Method:                          IRLS   Log-Likelihood:                -30214.
Date:                Sat, 03 Sep 2022   Deviance:                       273.37
Time:                        21:30:19   Pearson chi2:                     210.
No. Iterations:                     7   Pseudo R-squ. (CS):           0.009015
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.3175      0.021    259.136      0.000       5.277       5.358
x              0.0873      0.029      3.010      0.003       0.030       0.144
==============================================================================

We can now perform a likelihood ratio test:

lr_statistic = -2 * (model.llnull - model.llf)
pvalue = 1 - stats.chi2.cdf(lr_statistic, model.df_model)
print(pvalue)  # -> 0.002618765806616574

As expected, the p value is significant, consistent with the p value of x.

References

[1] Malehi AS, Pourmotahari F, Angali KA. Statistical models for the analysis of skewed healthcare cost data: a simulation study. Health Economics Review. 2015; 5(1): 11. doi: 10.1186/s13561-015-0045-7

[2] Agresti A. Foundations of linear and generalized linear models. New Jersey: John Wiley & Sons; 2015.

[3] McCullagh P. Quasi-likelihood functions. Annals of Statistics. 1983; 11(1): 59–67.

[4] StataCorp. Stata base reference manual: release 17. Texas: StataCorp. https://www.stata.com/manuals/r.pdf

Footnotes

In the maximum likelihood perspective, we assume a gamma distribution for $y_i$, and so the log-likelihood is undefined when $y_i = 0$, because a gamma-distributed variable can never be 0. However, in the quasi-likelihood approach, we do not assume a distribution for $y_i$; rather, we only specify the mean–variance relationship [2 p. 268]. In the case of $\mathrm{var}(y_i) = \frac{\hat{μ}_i{}^2}{k}$, the resulting quasi-likelihood estimating equations are the same as the maximum likelihood estimating equations, but are also defined for $y_i = 0$. Theory tells that the estimators from quasi-likelihood estimation are consistent estimators [2 p. 279]. ↩
This assumes that $k$ is same in both the null model and fitted model, which is true in the statsmodels implementation. ↩
This can also be obtained by solving the system of partial differential equations arising from working backwards from the (quasi-)score function. The bracketed terms become a constant of integration $c(\bm{y}, \hat{\bm{μ}})$, the choice of which is ‘entirely arbitrary’ [3]. ↩
Stata's glm command uses a similar quasi-likelihood, $-\ln(\hat{μ}_i) - \frac{y_i}{\hat{μ}_i}$ [4 p. 836]. This differs only by a factor of $k$, and so is also maximised whenever the true log-likelihood is maximised. However, the omitted factor causes the likelihood ratio statistic to be off by a factor of $k$. See also https://www.statalist.org/forums/forum/general-stata-discussion/general/1680539-stata-s-log-likelihood-and-likelihood-ratio-test-for-gamma-glm. ↩
Suppose we had added a small $\epsilon > 0$ to each $y_i$, which would be the usual approach to ‘fudging’ away the zeroes, and then calculate a likelihood ratio as standard. Assuming that $y_i + \epsilon > 0$, $\ln(y_i + \epsilon)$ is defined and the bracketed terms will always cancel. Since $-k\ln(\hat{μ}_i) - \frac{k(y_i + \epsilon)}{\hat{μ}_i}$ is clearly continuous with respect to $y_i + \epsilon$, as $\epsilon \to 0$, it approaches $Q_i(\hat{\bm{β}})$. Therefore, the quasi-likelihood ratio given by $Q_i(\hat{\bm{β}})$ is the equivalent to the limiting behaviour of ‘fudging’ the zeroes under the conventional likelihood ratio. ↩