Generalised linear models with a gamma distribution and log link are frequently used to model non-negative right-skewed continuous data, such as costs [1].

For example, in statsmodels:

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm

# Generate some data
np.random.seed(12345)
dist1 = stats.gamma(a=5, scale=40)
dist2 = stats.gamma(a=5, scale=45)

x = [0] * 500 + [1] * 500
y = np.concatenate([dist1.rvs(500), dist2.rvs(500)])
df = pd.DataFrame(dict(x=x, y=y))

# Fit the model
model = sm.formula.glm('y ~ x', df, family=sm.families.Gamma(sm.families.links.Log())).fit()
print(model.summary())

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                 1000
Model:                            GLM   Df Residuals:                      998
Model Family:                   Gamma   Df Model:                            1
Method:                          IRLS   Log-Likelihood:                -5900.3
Date:                Sat, 03 Sep 2022   Deviance:                       203.31
Time:                        20:13:02   Pearson chi2:                     209.
No. Iterations:                     7   Pseudo R-squ. (CS):           0.008731
Covariance Type:            nonrobust
==============================================================================
coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.3191      0.020    260.031      0.000       5.279       5.359
x              0.0857      0.029      2.962      0.003       0.029       0.142
==============================================================================


The regression shows, as expected, that x has a significant effect on y. We can similarly perform a likelihood ratio test, which yields the same result:

lr_statistic = -2 * (model.llnull - model.llf)
pvalue = 1 - stats.chi2.cdf(lr_statistic, model.df_model)
print(pvalue)  # -> 0.0030633579406539324


However, statsmodels encounters difficulty when the dependent variable y contains zeros. Tinkering with the data a bit to show this:

# Change a single observation to zero
df.loc[0, 'y'] = 0

# Fit the model again
model = sm.formula.glm('y ~ x', df, family=sm.families.Gamma(sm.families.links.Log())).fit()
print(model.summary())

/usr/lib/python3.10/site-packages/statsmodels/genmod/families/family.py:777: RuntimeWarning: divide by zero encountered in log
ll_obs -= special.gammaln(weight_scale) + np.log(endog)
/usr/lib/python3.10/site-packages/statsmodels/genmod/generalized_linear_model.py:1767: RuntimeWarning: invalid value encountered in double_scalars
prsq = 1 - np.exp((self.llnull - self.llf) * (2 / self.nobs))

Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                 1000
Model:                            GLM   Df Residuals:                      998
Model Family:                   Gamma   Df Model:                            1
Method:                          IRLS   Log-Likelihood:                    inf
Date:                Sat, 03 Sep 2022   Deviance:                       273.37
Time:                        20:16:26   Pearson chi2:                     210.
No. Iterations:                     7   Pseudo R-squ. (CS):                nan
Covariance Type:            nonrobust
==============================================================================
coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.3175      0.021    259.136      0.000       5.277       5.358
x              0.0873      0.029      3.010      0.003       0.030       0.144
==============================================================================


While the effect of x remains significant, statsmodels is reporting a log-likelihood of inf, which also means we cannot perform a likelihood ratio test.

To satisfy ourselves of why this should be the case, let us derive the log-likelihood function for a gamma model. The probability density function for the gamma distribution is:

$f(y; k, μ) = \frac{(k/μ)^k}{Γ(k)} e^{-ky/μ} y^{k-1}$

where $k > 0$ is the shape parameter [2 p. 153]. Therefore, the log-likelihood for an observation $i$ is:

\begin{align*} \ln L_i(\hat{\bm{β}}) = \ln f(y_i; k, \hat{μ}_i) &= \ln\left(\frac{(k/\hat{μ}_i)^k}{Γ(k)}\right) + \ln\left(e^{-ky_i/\hat{μ}_i}\right) + \ln\left(y_i{}^{k-1}\right) \\ &= k\ln\left(\frac{k}{\hat{μ}_i}\right) - \ln Γ(k) - \frac{ky_i}{\hat{μ}_i} + (k - 1)\ln y \\ &= k\left[ \ln\left(\frac{k}{\hat{μ}_i}\right) + \ln y_i \right] - \frac{ky_i}{\hat{μ}_i} - \ln Γ(k) - \ln y_i \\ &= k\ln\left(\frac{ky_i}{\hat{μ}_i}\right) - \frac{ky_i}{\hat{μ}_i} - \ln Γ(k) - \ln y_i \end{align*}

When $y_i = 0$, $\ln y_i$ is undefined, and so the log-likelihood is undefined, as statsmodels reports.

However, note that statsmodels still produces parameter estimates based on the maximum likelihood estimating equations. This can be justified from a quasi-likelihood perspective.1 Maximum likelihood estimation maximises the log-likelihood by taking the derivative of the log-likelihood (the score function) and solving for its roots. Quasi-likelihood estimation starts with the form of the score function and treating it as if it were the derivative of some quasi-likelihood function to maximise [2 p. 279]. Quasi-likelihood functions, then, have the same form as the log-likelihood function $\ln L_i(\hat{\bm{β}})$ but may differ by some constant term, such that the derivative is unchanged.

Let us rewrite $\ln L_i(\hat{\bm{β}})$ as follows:

\begin{align*} \ln L_i(\hat{\bm{β}}) &= k\ln\left(\frac{ky_i}{\hat{μ}_i}\right) - \frac{ky_i}{\hat{μ}_i} - \ln Γ(k) - \ln y_i \\ &= k\ln(ky_i) - k\ln(\hat{μ}_i) - \frac{ky_i}{\hat{μ}_i} - \ln Γ(k) - \ln y_i \\ &= -k\ln(\hat{μ}_i) - \frac{ky_i}{\hat{μ}_i} + \left[ k\ln(ky_i) - \ln Γ(k) - \ln y_i \right] \end{align*}

Note that the bracketed term is constant in $\hat{μ}_i$, and so constant in $\hat{\bm{β}}$, so omitting it would not change the derivative of the overall expression with respect to $\hat{\bm{β}}$. In particular, when subtracting two log-likelihoods in the likelihood ratio test, the bracketed terms will cancel.2

So we can take $Q_i(\hat{\bm{β}}) = -k\ln(\hat{μ}_i) - \frac{ky_i}{\hat{μ}_i}$ to be a quasi-likelihood function for each observation $i$.3,4 It has useful similarities to the true log-likelihood – it is maximised whenever the log-likelihood is maximised, and the difference between two quasi-likelihoods equals the difference between the two log-likelihoods – but is also defined when $y_i = 0$.5

We can implement this quasi-likelihood in statsmodels as follows:

class GammaQL(sm.families.Gamma):
def loglike_obs(self, endog, mu, var_weights=1, scale=1):
k = var_weights / scale
endog_mu = self._clean(endog / mu)
ll_obs = -np.log(mu) - endog_mu
return ll_obs * k

model = sm.formula.glm('y ~ x', df, family=GammaQL(sm.families.links.Log())).fit()
print(model.summary())

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                 1000
Model:                            GLM   Df Residuals:                      998
Model Family:                 GammaQL   Df Model:                            1
Method:                          IRLS   Log-Likelihood:                -30214.
Date:                Sat, 03 Sep 2022   Deviance:                       273.37
Time:                        21:30:19   Pearson chi2:                     210.
No. Iterations:                     7   Pseudo R-squ. (CS):           0.009015
Covariance Type:            nonrobust
==============================================================================
coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.3175      0.021    259.136      0.000       5.277       5.358
x              0.0873      0.029      3.010      0.003       0.030       0.144
==============================================================================


We can now perform a likelihood ratio test:

lr_statistic = -2 * (model.llnull - model.llf)
pvalue = 1 - stats.chi2.cdf(lr_statistic, model.df_model)
print(pvalue)  # -> 0.002618765806616574


As expected, the p value is significant, consistent with the p value of x.

## References

[1] Malehi AS, Pourmotahari F, Angali KA. Statistical models for the analysis of skewed healthcare cost data: a simulation study. Health Economics Review. 2015; 5(1): 11. doi: 10.1186/s13561-015-0045-7

[2] Agresti A. Foundations of linear and generalized linear models. New Jersey: John Wiley & Sons; 2015.

[3] McCullagh P. Quasi-likelihood functions. Annals of Statistics. 1983; 11(1): 59–67.

[4] StataCorp. Stata base reference manual: release 17. Texas: StataCorp. https://www.stata.com/manuals/r.pdf

## Footnotes

1. In the maximum likelihood perspective, we assume a gamma distribution for $y_i$, and so the log-likelihood is undefined when $y_i = 0$, because a gamma-distributed variable can never be 0. However, in the quasi-likelihood approach, we do not assume a distribution for $y_i$; rather, we only specify the mean–variance relationship [2 p. 268]. In the case of $\mathrm{var}(y_i) = \frac{\hat{μ}_i{}^2}{k}$, the resulting quasi-likelihood estimating equations are the same as the maximum likelihood estimating equations, but are also defined for $y_i = 0$. Theory tells that the estimators from quasi-likelihood estimation are consistent estimators [2 p. 279]

2. This assumes that $k$ is same in both the null model and fitted model, which is true in the statsmodels implementation.

3. This can also be obtained by solving the system of partial differential equations arising from working backwards from the (quasi-)score function. The bracketed terms become a constant of integration $c(\bm{y}, \hat{\bm{μ}})$, the choice of which is ‘entirely arbitrary’ [3]

4. Stata's glm command uses a similar quasi-likelihood, $-\ln(\hat{μ}_i) - \frac{y_i}{\hat{μ}_i}$ [4 p. 836]. This differs only by a factor of $k$, and so is also maximised whenever the true log-likelihood is maximised. However, the omitted factor causes the likelihood ratio statistic to be off by a factor of $k$. See also https://www.statalist.org/forums/forum/general-stata-discussion/general/1680539-stata-s-log-likelihood-and-likelihood-ratio-test-for-gamma-glm

5. Suppose we had added a small $\epsilon > 0$ to each $y_i$, which would be the usual approach to ‘fudging’ away the zeroes, and then calculate a likelihood ratio as standard. Assuming that $y_i + \epsilon > 0$, $\ln(y_i + \epsilon)$ is defined and the bracketed terms will always cancel. Since $-k\ln(\hat{μ}_i) - \frac{k(y_i + \epsilon)}{\hat{μ}_i}$ is clearly continuous with respect to $y_i + \epsilon$, as $\epsilon \to 0$, it approaches $Q_i(\hat{\bm{β}})$. Therefore, the quasi-likelihood ratio given by $Q_i(\hat{\bm{β}})$ is the equivalent to the limiting behaviour of ‘fudging’ the zeroes under the conventional likelihood ratio.