Generalised linear models for medical biostatistics

Warning! I am not a statistician. This article is not reviewed. Please confer with a responsible adult!

Recently, I've been doing some statistical analysis using log-binomial generalised linear models (GLMs). Resources on the topic seem to fall largely into 2 categories:

Assume you want to know none of the background: ‘Use a log-binomial GLM if you want a risk ratio.’¹
Assume you need to know all of the background: ‘The saddlepoint approximation for the log density is $-\frac12 D(y; μ)/σ^2 - \frac12\log\left(2πσ^2V(y)\right)$ …’²

I have tried to develop an intuition of a degree of understanding somewhere in the middle. My role is not to pore over the theory (we have fantastic biostatisticians!), but nor am I comfortable simply mindlessly clicking buttons in SPSS because the ‘cookbook’ told me to.

After some reading and thinking on the topic, no doubt high off Dunning–Kruger and spurred on by the sort of inflated sense of self-importance promoted by the internet, I have word-vomited some of those thoughts into the void here.

Simple linear regression

Imagine we have conducted a cross-sectional study of patients aged 8–11, and want to determine if height ($y$) varies with age ($x$). We sample 12 patients, and obtain the observations $\{(x_1, y_1), (x_2, y_2), …, (x_{12}, y_{12})\}$. This is a natural application of linear regression, hopefully familiar from high school. We plot $y$ against $x$ and fit a least-squares regression line:

import statsmodels.api as sm
data = ...
sm.OLS(
	data['Height'], sm.add_constant(data['Age'])
).fit().summary()

Linear regression example

But what have we actually done here? What justifies the interpretation of this regression line? If your maths education was like mine, there was probably some handwaving about ‘minimises the sum of the squared errors’ – what is the significance of that?

Let's look more closely at the model for simple linear regression.³

Consider all patients in the population aged 8, with a mean height $\mu_8$. The model assumes that when we sample from these patients, our observed heights $y$ are normally distributed with mean $\mu_8$.
Similarly, the model assumes that, when sampling from patients aged 9 (10, 11), $y$ is normally distributed with mean $\mu_9$ ($\mu_{10}$, $\mu_{11}$).
Finally, given this is linear regression, the model assumes $\mu_8$, $\mu_9$, $\mu_{10}$ and $\mu_{11}$ have a linear relationship, $\mu = β_0 + β_1x$.⁴

Linear regression model

With a particular set of observations, $\{(x_1, y_1), (x_2, y_2), …\}$, we can fit a least-squares regression line, which gives us the predicted mean height for a given age, $\hat\mu = \hatβ_0 + \hatβ_1x$.

A key reason for using least-squares regression is that, using Maths™, we can demonstrate that (under the assumptions of the model) $\hat\mu$ using the least-squares method is also the ‘maximum likelihood estimator’ of $\mu$.⁵^,⁶

In a moment, we will see how to extend our linear regression model into a generalised linear model.

Categorical outcomes; 2×2 contingency tables

The linear regression model deals with continuous outcomes (in the example, height), but we want to know how to deal with categorical data. Say, the effect of smoking or not smoking on developing or not developing lung cancer.

Suppose we have conducted a cohort study of smokers and non-smokers, and observed which patients developed lung cancer over the follow-up period. The results are summarised in the contingency table below:

	Non-smoker (non-exposed)	Smoker (exposed)	Total
No lung cancer (non-cases)	743	238	981
Lung cancer (cases)	7	12	19
Total	750	250	1000

Without using any regression, we can directly calculate various statistics. Using SciPy:

import statsmodels.stats.proportion as smprop
from statsmodels.stats.contingency_tables import Table2x2
ct = Table2x2(...)

risk0 = ct.table[1][0] / ct.table[1].sum()
risk0_ci = sm.stats.proportion_confint(ct.table[1][0], ct.table[1].sum())
risk1 = ct.table[0][0] / ct.table[0].sum()
risk1_ci = sm.stats.proportion_confint(ct.table[0][0], ct.table[0].sum())
print('Baseline risk (95% CI) = {:.3f} ({:.3f}, {:.3f})'.format(risk0, *risk0_ci))
print('Risk in smokers (95% CI) = {:.3f} ({:.3f}, {:.3f})'.format(risk1, *risk1_ci))

ari = risk1 - risk0
ari_ci = smprop.confint_proportions_2indep(ct.table[0][0], ct.table[0].sum(), ct.table[1][0], ct.table[1].sum(), method='wald')
print('ARI (95% CI) = {:.3f} ({:.3f}, {:.3f})'.format(ari, *ari_ci))

print('OR (95% CI) = {:.2f} ({:.2f}, {:.2f})'.format(ct.oddsratio, *ct.oddsratio_confint()))
print('RR (95% CI) = {:.2f} ({:.2f}, {:.2f})'.format(ct.riskratio, *ct.riskratio_confint()))

	Value	(95% CI)
Baseline risk	0.009	(0.002,	0.016)
Risk in smokers	0.048	(0.022,	0.074)
Absolute risk increase ($ARI$)	0.039	(0.011,	0.066)
Odds ratio ($OR$)	5.35	(2.08,	13.75)
Risk ratio ($RR$)	5.14	(2.05,	12.92)

But, instead of using the contingency table, can we instead do something more like the linear regression from the first example? Well, there is no law saying we can't!

Simple linear regression on binary outcomes

For each patient in our cohort study, let's set $x = 0$ if the patient is a non-smoker, and $x = 1$ if the patient is a smoker. Let's set $y = 0$ if the patient did not develop lung cancer over follow-up, and $y = 1$ if the patient did develop lung cancer. We now have a set of observations $\{(x_1, y_1), (x_2, y_2), … (x_{1000}, y_{1000})\}$, just like our first example.

Let's take these now numerical values, plot $y$ against $x$,⁷ and fit an ordinary least-squares linear regression model:

data = ...
sm.OLS(
	data['Cancer'], sm.add_constant(data['Smoker'])
).fit().summary()

	$\bm{\hatβ}$	Std. Err.	$\bm{t}$	$\bm{p}$	(95% CI)
Intercept ($\bm{\hatβ_0}$)	0.009	0.005	1.885	0.060	(−0.000,	0.019)
Smoker ($\bm{\hatβ_1}$)	0.039	0.010	3.904	<0.001	(0.019,	0.058)

Linear regression on binary outcomes

What is the significance of these $\hatβ$ coefficients? Note that $\hatβ_0 = 0.009$, the intercept term, is the value of $\hat\mu$ when $x = 0$, i.e. for non-smokers. Remember that, according to our model, $\hat\mu$ is the ‘maximum likelihood estimator’ for the population means $\mu$.

Therefore, the regression suggests that, if we were to assign every non-smoker in the population $y = 0$ if they did not develop cancer and $y = 1$ if they did, our estimation of the population mean for $y$ among non-smokers is $\hatβ_0 = 0.009$. In other words, the baseline absolute risk of cancer (among non-smokers) is 0.009.

Similarly, when $x = 1$, i.e. for smokers, $\hat\mu = \hatβ_0 + \hatβ_1 × 1 = 0.009 + 0.039 = 0.048$. Our estimation of the population mean for $y$ among smokers is 0.048. In other words, the absolute risk of cancer among smokers is 0.048.

$\hatβ_1$, then, represents the difference in absolute risk between smokers and non-smokers – in other words, the absolute risk increase! We can confirm that $\hatβ_1 = 0.039$ accords with the absolute risk increase we calculated from the contingency table.

However, the confidence interval for $\hatβ_1$ does not match the confidence interval on the absolute risk increase we calculated earlier. Additionally, the confidence interval for $\hatβ_0$, the baseline risk, suggests the risk could even be negative! This is clearly absurd.

This is because, in the simple linear regression model, the outcome data $y$ is assumed to follow a normal distribution. Clearly, this is not an appropriate assumption – our observed outcomes (cancer or no cancer) can take only 2 values, not the infinite range of values suggested by the normal distribution.

The graph below shows the true distribution of $y$, compared with the normal approximation assumed by the linear regression model:

Comparison of normal and binomial distributions

Clearly, the normal distribution is an inappropriate model. The distribution of $y$ is more like flipping a weighted coin. Just as a coin will land either heads or tails, each patient will either develop cancer (with probability, say, $p$) or not develop cancer (with probability $1 - p$). Like in coin flipping, this is a binomial distribution: $y \sim \mathcal{B}(1, p)$.⁸

Random component of a GLM; linear probability model

A generalised linear model extends linear regression in 2 ways. Here we introduce the first.

In simple linear regression, the randomness introduced by sampling is modelled as a normal distribution. In a GLM, more options are available.⁹ In the case of binary outcomes (cancer or no cancer), we will use a binomial distribution.

Let's use the exact same model as before, $\hat\mu = \hatβ_0 + \hatβ_1x$, but fit a GLM using a binomial distribution instead of a normal distribution.¹⁰ Now, $\hat\mu$ is the model's estimate of the probability ($p$, i.e. the risk) of developing cancer. In SciPy:

sm.GLM(
	data['Cancer'], sm.add_constant(data['Smoker']),
	family=sm.families.Binomial(sm.families.links.identity())
	# We will explain the significance of "sm.families.links.identity()"
	# in the next section
).fit().summary()

	$\bm{\hatβ}$	Std. Err.	$\bm{z}$	$\bm{p}$	(95% CI)
Intercept ($\bm{\hatβ_0}$)	0.009	0.004	2.658	0.008	(0.002,	0.016)
Smoker ($\bm{\hatβ_1}$)	0.039	0.014	2.768	0.006	(0.011,	0.066)

The $\hatβ$ coefficients are the same as in the simple linear regression, but the confidence intervals are now different.

Let's plot the model $\hat\mu$:

Linear probability model

The graph demonstrates that $\hatβ_0$ is an estimate of the baseline risk of developing lung cancer (among non-smokers), and $\hatβ_1$ is the amount by which this risk is increased – i.e. the absolute risk increase – in smokers. The point estimate and confidence interval for $\hatβ_1$ now match the absolute risk increase we calculated using the contingency table.

At this point, all we have is a needlessly complicated way of doing exactly what we did with the contingency tables. (Hooray?) But the advantage of the GLM is that it could be easily extended. For example, we could add another covariate (say, male sex, $m$), and change the model to $\hat\mu = \hatβ_0 + \hatβ_1x + \hatβ_2m$: now we have a multivariable analysis adjusting for the effects of $m$. We could even add a continuous covariate like age – our covariates are not restricted to being binary.

However, there is a problem. An expression such as $\hatβ_0 + \hatβ_1x + \hatβ_2m$ can have an unlimited range of values, but we have used this to model a probability, which can only have values from 0 to 1. Suppose, for the sake of argument, we fit the following model:

	$\bm{\hatβ}$
Intercept ($\bm{\hatβ_0}$, baseline risk)	0.20
Smoker ($\bm{\hatβ_1}$, adjusted $ARI$)	0.50
Male sex ($\bm{\hatβ_2}$, adjusted $ARI$)	0.40

The model suggests the baseline risk is 0.20 (makes sense), the absolute risk increase in smokers who are not male is 0.50 (absolute risk $0.20 + 0.50 = 0.70$, also makes sense), and the absolute risk increase in males who are not smokers is 0.40 (absolute risk $0.20 + 0.40 = 0.60$, so far so good). But what is the risk in male smokers? The model suggests the risk is $0.20 + 0.50 + 0.40 = 1.10$, a risk exceeding 100%!

Clearly, this will not do. We need a method of taking the output of the linear part of the model (e.g. $\hatβ_0 + \hatβ_1x + \hatβ_2m$) and transforming it into a probability $\hat\mu$.

Link functions; probit regression

Let's imagine a function $f(z)$ which takes an unbounded real number $z$ from our linear expression, and transforms it into a probability $0 ≤ p ≤ 1$. We could then have the model:

\[\hat\mu = f(β_0 + β_1x + β_2m + …)\]

For Maths™-related reasons, this is usually expressed in terms of the inverse function, $g(p) = f^{-1}(p)$:

\[g(\hat\mu) = β_0 + β_1x + β_2m + …\]

The function $g$ is known as the link function. So far, we have effectively had $g(p) = p$: we did not transform the output of our linear expression at all. This is known as the identity link. The second way that GLMs extend simple linear regression is by permitting different link functions.

Probably, we would like to instead use an S-shaped (sigmoid) link function, so that as the value of our linear expression increases towards infinity, the probability $p$ approaches 1, and as the value of our linear expression decreases towards negative infinity, the probability $p$ approaches 0.

One such function is $f_\mathrm{probit}(z) = Φ(z)$, where $Φ$ is the cumulative distribution function of the standard normal distribution:

Standard normal distribution cumulative distribution function

The equivalent link function is the inverse, $g_\mathrm{probit}(p) = {f_\mathrm{probit}}^{-1}(p) = Φ^{-1}(p)$. This is called the probit function. Note that it has no closed-form representation.

Let us fit a GLM using the probit function as the link. Just as a GLM using an identity link (and normal distribution) is also called linear regression, this is also called probit regression:

sm.GLM(
	data['Cancer'], sm.add_constant(data['Smoker']),
	family=sm.families.Binomial(sm.families.links.probit())
).fit().summary()

	$\bm{\hatβ}$	Std. Err.	$\bm{z}$	$\bm{p}$	(95% CI)
Intercept ($\bm{\hatβ_0}$)	−2.352	0.140	−16.810	<0.001	(−2.626,	−2.078)
Smoker ($\bm{\hatβ_1}$)	0.688	0.195	3.531	<0.001	(0.306,	1.069)

What, now, is the significance of the coefficients in the probit regression?

When $x = 0$ (non-smokers), $\hatβ_0 + \hatβ_1x = \hatβ_0 = -2.35$. Remember that $f$ maps values of the linear expression to probabilities, so if we pass this number through $f_\mathrm{probit}$, we will obtain the probability (i.e. risk) of developing cancer for non-smokers: $f_\mathrm{probit}(-2.35) = Φ(-2.35) = 0.009$, which is the same as the baseline risk we calculated earlier.

Similarly, when $x = 1$ (smokers), $\hatβ_0 + \hatβ_1x = -2.35 + 0.69 = -1.66$. If we pass this number through $f_\mathrm{probit}$, we obtain the probability (risk) of developing cancer for smokers: $f_\mathrm{probit}(-1.66) = 0.048$, also as expected.

More generally, just as $\hatβ_0$ and $\hatβ_1$ could be interpreted as absolute risks in the linear probability model, we can interpret the $\hatβ_0$ and $\hatβ_1$ of probit regression on the z-score scale. $\hatβ_0 = 0.009$ means that the baseline probability of developing cancer (in non-smokers) has the likelihood of something 2.35 standard deviations below the mean of a normal distribution (the z-score is $-2.35$). $\hatβ_1 = 0.69$ means that, in smokers, the probability is increased by the equivalent of 0.69 standard deviations (the z-score is increased by $0.69$).

Compared with the linear probability model, the probit model also has the advantage that $f_\mathrm{probit}$ will never produce invalid probabilities greater than 1 or less than 0.

However, z-scores and standard deviations can be difficult to interpret in clinical practice. It is also somewhat unfortunate that we must refer to $Φ$, which has no closed-form representation, to interpret the results. Can we do better?

Logistic regression

Let's consider another S-shaped function, the standard logistic function, $f(z) = \frac{1}{1 + e^{-z}}$. Its inverse is the logit function, $g(p) = f^{-1}(p) = \ln\left(\frac{p}{1 - p}\right)$. If we use this as the link function, we have:

\[\begin{align*} g(\hat\mu) &= \hatβ_0 + \hatβ_1x \\ \ln\left(\frac{\hat\mu}{1 - \hat\mu}\right) &= \hatβ_0 + \hatβ_1x \end{align*}\]

This is actually simply logistic regression! Notice that on the left-hand side of the equation, $\frac{\hat\mu}{1 - \hat\mu}$ is simply the odds of $\hat\mu$, i.e. the odds of developing cancer. We have, then:

\[\begin{align*} \ln({\rm odds\ of\ cancer}) &= \hatβ_0 + \hatβ_1x \\ {\rm odds\ of\ cancer} &= e^{\hatβ_0 + \hatβ_1x} \\ &= e^{\hatβ_0} \cdot e^{\hatβ_1x} \end{align*}\]

So when $x = 0$ (non-smokers), the odds of developing cancer is simply $e^{\hatβ_0}$. Equivalently, $\hatβ_0$ is the baseline log odds of developing cancer (in non-smokers).

When $x = 1$ (smokers), these baseline odds are multiplied by $e^{\hatβ_1}$, so $e^{\hatβ_1}$ is the odds ratio for developing cancer among smokers. Equivalently, $\hatβ_1$ is the log odds ratio for developing cancer among smokers.

Let's fit a GLM using this logit link:

sm.GLM(
	data['Cancer'], sm.add_constant(data['Smoker']),
	family=sm.families.Binomial(sm.families.links.logit())
).fit().summary()

	$\bm{\hatβ}$	Std. Err.	$\bm{z}$	$\bm{p}$	(95% CI)
Intercept ($\bm{\hatβ_0}$)	−4.665	0.380	−12.284	<0.001	(−5.409,	−3.921)
Smoker ($\bm{\hatβ_1}$)	1.677	0.481	3.485	<0.001	(0.734,	2.621)

As mentioned, $\hatβ_0$ is the baseline log odds of cancer in the non-smokers, and $\hatβ_1$ is the log odds ratio for cancer in smokers. Log odds ratios are difficult to interpret, and odds ratios are so commonly used that statistical packages can generally automatically exponentiate these coefficients for us:

	$\bm{e^{\hatβ}}$	(95% CI)		$\bm{p}$
Intercept ($\bm{e^{\hatβ_0}}$, baseline odds)	0.009	(0.004,	0.020)	<0.001
Smoker ($\bm{e^{\hatβ_1}}$, $OR$)	5.35	(2.08,	13.75)	<0.001

Indeed, the odds ratio ($e^{\hatβ_1}$) and 95% confidence interval accord with what we calculated in the contingency table – but, as previously discussed, this GLM is now easily extended to include more covariates.

Again, as with the probit link, $f_\mathrm{logit}$ maps all real numbers to valid probabilities from 0 to 1, so we do not need to worry about generating invalid probabilities outside the range 0 to 1.

Log-binomial regression

We have now (re-)discovered logistic regression and how to compute odds ratios using GLMs. But our study was a cohort study, and so we are not limited to computing odds ratios, but should be able to compute risk ratios, just as we did using the contingency tables.

We turn, then, to the final link function to introduce in this post. Consider $f_\mathrm{log}(z) = e^z$. The inverse is $g_\mathrm{log}(p) = {f_\mathrm{log}}^{-1}(p) = \ln(p)$: hopefully it is not a surprise that the name of this function is the (natural) log. If we use this as the link function, we have:

\[\begin{align*} \hat\mu &= f(\hatβ_0 + \hatβ_1x) \\ &= e^{\hatβ_0 + \hatβ_1x} \\ &= e^{\hatβ_0} \cdot e^{\hatβ_1x} \end{align*}\]

So when $x = 0$ (non-smokers), the probability (risk) of developing cancer is simply $e^{\hatβ_0}$. When $x = 1$ (smokers), this baseline risk is multiplied by $e^{\hatβ_1}$, so $e^{\hatβ_1}$ is the risk ratio for developing cancer among smokers.

Let's fit a GLM using this log link:

sm.GLM(
	data['Cancer'], sm.add_constant(data['Smoker']),
	family=sm.families.Binomial(sm.families.links.log())
).fit().summary()

	$\bm{\hatβ}$	Std. Err.	$\bm{z}$	$\bm{p}$	(95% CI)
Intercept ($\bm{\hatβ_0}$)	−4.674	0.376	−12.425	<0.001	(−5.411,	−3.937)
Smoker ($\bm{\hatβ_1}$)	1.638	0.470	3.485	<0.001	(0.717,	2.559)

Exponentiating the parameters, we obtain:

	$\bm{e^{\hatβ}}$	(95% CI)		$\bm{p}$
Intercept ($\bm{e^{\hatβ_0}}$, baseline risk)	0.009	(0.004,	0.020)	<0.001
Smoker ($\bm{e^{\hatβ_1}}$, $RR$)	5.14	(2.05,	12.92)	<0.001

The risk ratio ($e^{\hatβ_1}$) and 95% confidence interval again match those we calculated from the contingency table.

Note, however, that the range of $f(z) = e^z$ is not restricted to $[0, 1]$. Therefore, like the linear probability model, multivariable log-binomial regression can in some circumstances produce invalid probabilities which are greater than 1.

‘Cookbook’ summary

No half-baked discussion of medical biostatistics is complete without a ‘cookbook’ encouraging the uncritical application of statistical techniques, so to conclude:

Model	Distribution	Link	$\bm{β_0}$	$\bm{β_1, …, β_n}$	Use it?
Linear regression	Normal	Identity	Intercept	Slope	Probably yes
Linear probability model	Binomial	Linear	Baseline risk	Absolute risk increase	Probably no
Probit regression	Binomial	Probit	Baseline risk as z-score	Increase in z-score	Probably no
Logistic regression	Binomial	Logit	Exponentiate it to get baseline odds	Exponentiate it to get odds ratio	Probably yes
Log-binomial regression	Binomial	Log	Exponentiate it to get baseline risk	Exponentiate it to get risk ratio	Probably maybe

Footnotes

e.g. Spiegelman D, Hertzmark E. Easy SAS calculations for risk or prevalence ratios and differences. Amer J Epidemiol. 2005; 162(3): 199–200. doi: 10.1093/aje/kwi188 ↩
e.g. McCullagh P, Nelder JA. Generalized linear models. London: Chapman and Hall; 1989. 2nd edition. ↩
Further reading: Penn State Eberly College of Science. STAT 501: Regression methods. Lesson 1.3, The simple linear regression model. https://online.stat.psu.edu/stat501/lesson/1/1.3 ↩
We also assume that observed $y$'s are independent of each other, and regardless of $x$, have the same standard deviation $\sigma$. Another way of representing the model is to say that the distribution of $y$ is exactly $β_0 + β_1x + \epsilon$, where $\epsilon$ is the error term, independently and identically distributed $\epsilon \sim \mathcal{N}(0, \sigma^2)$. ↩
Take as an example our previous sample, with regression line $\hat\mu = 81.13 + 5.79x$. Evaluated at ages $x = 8, 9, 10, 11$, the corresponding estimated population means would be 127, 133, 139 and 145. Because $\hat\mu$ is the maximum likelihood estimator of the population means, our interpretation is that (under the assumptions of our model) these are the true population means which are most likely to result in the observations we sampled. ↩
Further reading: Agresti A. Foundations of linear and generalized linear models. New Jersey: Wiley; 2015. ↩
Since there are only 4 distinct points, we plot a bubble chart to show the number represented in each. ↩
Specifically, each patient's observed $y$ follows a Bernoulli distribution, which is the special case of the binomial distribution where the number of trials $n = 1$. ↩
Specifically, a GLM may be constructed from any distribution in the ‘exponential family’ of distributions. This includes the binomial distribution, normal distribution, Poisson distribution and negative binomial distribution, among others. ↩
Unlike in linear regression, there is no closed-form expression for directly computing the maximum likelihood estimator. Instead, iterative methods are used to compute the MLE to fit the model. ↩