Regression
Functions
- yli.regress(model_class, df, dep, formula, *, nan_policy='warn', exposure=None, method=None, maxiter=None, start_params=None, reduced=None, bool_baselevels=False, exp=None)
Fit a statsmodels regression model
- Parameters:
model_class (
yli.regress.RegressionModel
subclass) – Type of regression model to fitdf (DataFrame) – Data to perform regression on
dep (str) – Column(s) in df for the dependent variable (numeric)
formula (str) – Patsy formula for the regression model
nan_policy (str) – How to handle nan values (see NaN handling)
exposure (str) – Column in df for the exposure variable (numeric, some models only)
method – See statsmodels model.fit
maxiter – See statsmodels model.fit
start_params – See statsmodels model.fit
reduced – See
yli.IntervalCensoredCox()
bool_baselevels (bool) – Show reference categories for boolean independent variables even if reference category is False
exp (bool) – Report exponentiated parameters rather than raw parameters, default (None) is to autodetect based on model_class
- Return type:
- yli.vif(df, formula=None, *, nan_policy='warn')
Calculate the variance inflation factor (VIF) for each variable in df
- Parameters:
df (DataFrame) – Data to calculate the VIF for
formula (str) – If specified, calculate the VIF only for the variables in the formula
- Returns:
The variance inflation factors
- Return type:
Series
Example:
df = pd.DataFrame({ 'D': [68.58, 67.33, 67.33, ...], 'T1': [14, 10, 10, ...], 'T2': [46, 73, 85, ...], ... }) yli.vif(df[['D', 'T1', 'T2', ...]])
D 8.318301 T1 6.081590 T2 2.457122 ... dtype: float64
The output shows the variance inflation factor for each variable in df.
Regression models
- class yli.Logit
Logistic regression
Example:
df = pd.DataFrame({ 'Unhealthy': [False, False, False, ...], 'Fibrinogen': [2.52, 2.46, 2.29, ...], 'GammaGlobulin': [38, 36, 36, ...] }) yli.regress(yli.Logit, df, 'Unhealthy', 'Fibrinogen + GammaGlobulin')
Logistic Regression Results ====================================================== Dep. Variable: Unhealthy | No. Observations: 32 Model: Logit | Df. Model: 2 Date: 2022-10-18 | Df. Residuals: 29 Time: 19:00:34 | Pseudo R²: 0.26 Std. Errors: Non-Robust | LL-Model: -11.47 | LL-Null: -15.44 | p (LR): 0.02* ====================================================== exp(β) (95% CI) p ----------------------------------------------- (Intercept) 0.00 (0.00 - 0.24) 0.03* Fibrinogen 6.80 (1.01 - 45.79) 0.049* GammaGlobulin 1.17 (0.92 - 1.48) 0.19 -----------------------------------------------
The output summarises the results of the regression. Note that the parameter estimates are automatically exponentiated. For example, the odds ratio for unhealthiness per unit increase in fibrinogen is 6.80, with 95% confidence interval 1.01–45.79, and is significant with p value 0.049.
- class yli.OrdinalLogit
Ordinal logistic (or probit) regression
The implementation calls statsmodels’ native OrderedModel, but substitutes an alternative parameterisation used by R and Stata. In the native statsmodels implementation, the first cutoff parameter is the true cutoff, but further cutoff parameter are log differences between consecutive cutoffs. In this parameterisation, cutoff terms are represented directly in the model.
Example:
df = pd.DataFrame(...) yli.regress(yli.OrdinalLogit, df, 'apply', 'pared + public + gpa', exp=False)
Ordinal Logistic Regression Results ========================================================== Dep. Variable: apply | No. Observations: 400 Model: Ordinal Logit | Df. Model: 5 Date: 2022-12-02 | Df. Residuals: 395 Time: 21:30:38 | Pseudo R²: 0.03 Std. Errors: Non-Robust | LL-Model: -358.51 | LL-Null: -370.60 | p (LR): <0.001* ============================================================ β (95% CI) p ------------------------------------------------------------ pared 1.05 (0.53 - 1.57) <0.001* public -0.06 (-0.64 - 0.53) 0.84 gpa 0.62 (0.10 - 1.13) 0.02* (Cutoffs) unlikely/somewhat likely 2.20 (0.68 - 3.73) 0.005* somewhat likely/very likely 4.30 (2.72 - 5.88) <0.001* ------------------------------------------------------------
The output summarises the result of the regression. The parameters shown under “(Cutoffs)” are the cutoff values in the latent variable parameterisation of ordinal regression. Note that because exp=False was passed, the parameter estimates are not automatically exponentiated.
- class yli.OLS
Ordinary least squares linear regression
Example:
df = pd.DataFrame(...) yli.regress(yli.OLS, df, 'LNC', 'D + T1 + T2 + S + PR + NE + CT + BW + N + PT')
Ordinary Least Squares Regression Results ======================================================= Dep. Variable: LNC | No. Observations: 32 Model: OLS | Df. Model: 10 Date: 2023-04-16 | Df. Residuals: 21 Time: 23:34:01 | R²: 0.86 Std. Errors: Non-Robust | F: 13.28 | p (F): <0.001* ======================================================= β (95% CI) p ---------------------------------------------- (Intercept) -10.63 (-22.51 - 1.24) 0.08 D 0.23 (0.05 - 0.41) 0.02* T1 0.01 (-0.04 - 0.05) 0.82 T2 0.01 (-0.00 - 0.02) 0.24 S 0.00 (0.00 - 0.00) <0.001* PR -0.11 (-0.28 - 0.07) 0.21 NE 0.26 (0.09 - 0.42) 0.004* CT 0.12 (-0.03 - 0.26) 0.12 BW 0.04 (-0.18 - 0.26) 0.73 N -0.01 (-0.03 - 0.00) 0.14 PT -0.22 (-0.49 - 0.05) 0.10 ----------------------------------------------
The output summarises the results of the regression. For example, the adjusted mean difference in “LNC” per unit increase in “D” is 0.23, with 95% confidence interval 0.05–0.41, and is significant with p value 0.02.
- class yli.PenalisedLogit
Firth penalised logistic regression
Uses the R logistf library.
Example:
df = pd.DataFrame({ 'Pred': [1] * 20 + [0] * 220, 'Outcome': [1] * 40 + [0] * 200 }) yli.regress(yli.PenalisedLogit, df, 'Outcome', 'Pred', exp=False)
Penalised Logistic Regression Results ============================================================ Dep. Variable: Outcome | No. Observations: 240 Model: Penalised Logit | Df. Model: 1 Date: 2022-10-19 | Pseudo R²: 0.37 Time: 07:50:40 | LL-Model: -66.43 Std. Errors: Non-Robust | LL-Null: -105.91 | p (LR): <0.001* ============================================================ β (95% CI) p --------------------------------------------- (Intercept) -2.28 (-2.77 - -1.85) <0.001* Pred 5.99 (3.95 - 10.85) <0.001* ---------------------------------------------
The output summarises the result of the regression. The summary table shows that a penalised method was applied. Note that because exp=False was passed, the parameter estimates are not automatically exponentiated.
- class yli.Poisson
Poisson regression
Example:
df = pd.DataFrame(...) yli.regress(yli.Poisson, df, 'num_awards', 'C(prog) + math')
Poisson Regression Results ======================================================= Dep. Variable: num_awards | No. Observations: 200 Model: Poisson | Df. Model: 3 Date: 2023-04-22 | Df. Residuals: 196 Time: 16:58:21 | Pseudo R²: 0.21 Std. Errors: Non-Robust | LL-Model: -182.75 | LL-Null: -231.86 | p (LR): <0.001* ======================================================= exp(β) (95% CI) p -------------------------------------------- (Intercept) 0.01 (0.00 - 0.02) <0.001* prog 1 Ref. 2 2.96 (1.46 - 5.97) 0.002* 3 1.45 (0.61 - 3.44) 0.40 math 1.07 (1.05 - 1.10) <0.001* --------------------------------------------
The output summarises the results of the regression. For example, the adjusted incidence rate ratio in “num_awards” per unit increase in “math” is 1.07, with 95% confidence interval 1.05–1.10, and is significant with p value < 0.001.
Result classes
- class yli.regress.BrantResult(tests)
Result of a Brant test for ordinal regression
See
OrdinalLogit.brant()
.- summary()
Return a stringified summary of the χ2 test
- Return type:
str
- tests
Results for Brant test on each coefficient (Dict[str, ChiSquaredResult])
- class yli.regress.LikelihoodRatioTestResult(statistic, dof, pvalue)
Result of a likelihood ratio test for regression
See
RegressionModel.lrtest_null()
.- dof
Degrees of freedom for the χ2 distribution (int)
- pvalue
p value for the χ2 test (float)
- statistic
χ2 statistic (float)
- summary()
Return a stringified summary of the likelihood ratio test
- Return type:
str
- class yli.regress.RegressionModel
- bayesfactor_beta_zero(term)
Compute a Bayes factor testing the hypothesis that the given beta is 0
Uses the R BFpack library.
Requires the regression to be from statsmodels. The term must be specified as the raw name from the statsmodels regression, available via
SingleTerm.raw_name
.- Parameters:
term (str) – Raw name of the term to be tested
- Return type:
- deviance_chi2()
Perform the deviance χ2 test for goodness of fit
- Return type:
- ftest()
Perform the F test that all slopes are 0
- Return type:
- lrtest_null()
Compute the likelihood ratio test comparing the model to the null model
- Return type:
- property pseudo_rsquared
McFadden’s pseudo R2 statistic
- property pseudo_rsquared_adj
McFadden’s pseudo R2 statistic, adjusted for number of parameters
- shap(**kwargs)
Compute SHAP values for the model
Uses the Python shap library.
- Parameters:
kwargs – Keyword arguments to pass to shap.LinearExplainer
- Return type:
Reference: Lundberg SM, Lee SI. A unified approach to interpreting model predictions. In: Guyon I, Von Luxburg U, Bengio S, et al., editors. Advances in Neural Information Processing Systems; 2017 Dec 4–9; Long Beach, CA. https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html
- summary()
Return a stringified summary of the regression model
- Return type:
str
- terms_flat()
Iterate over each
SingleTerm
in self.terms, recursively stepping throughCategoricalTerm
s
- class yli.shap.ShapResult(model, shap_values, features)
SHAP values for a regression model
See
yli.regress.RegressionModel.shap()
.- mean()
Compute the mean absolute SHAP value for each parameter
- Return type:
Series
- plot(**kwargs)
Generate a scatterplot of the SHAP values
Uses the Python matplotlib library.
- Return type:
(Figure, Axes)
Model terms
- class yli.regress.CategoricalTerm(categories, ref_category)
A group of terms in a
RegressionModel
corresponding to a categorical variable- categories
Terms for each of the categories, excluding the reference category (dict of
SingleTerm
)
- ref_category
Name of the reference category (str)
- class yli.regress.SingleTerm(raw_name, beta, pvalue)
A term in a
RegressionModel
which is a single term- beta
yli.utils.Estimate
of the coefficient
- pvalue
p value for the coefficient (float)
- raw_name
Raw name of the term (str)