Regression

Functions

yli.regress(model_class, df, dep, formula, *, nan_policy='warn', exposure=None, method=None, maxiter=None, start_params=None, reduced=None, bool_baselevels=False, exp=None)

Fit a statsmodels regression model

Parameters:
  • model_class (yli.regress.RegressionModel subclass) – Type of regression model to fit

  • df (DataFrame) – Data to perform regression on

  • dep (str) – Column(s) in df for the dependent variable (numeric)

  • formula (str) – Patsy formula for the regression model

  • nan_policy (str) – How to handle nan values (see NaN handling)

  • exposure (str) – Column in df for the exposure variable (numeric, some models only)

  • method – See statsmodels model.fit

  • maxiter – See statsmodels model.fit

  • start_params – See statsmodels model.fit

  • reduced – See yli.IntervalCensoredCox()

  • bool_baselevels (bool) – Show reference categories for boolean independent variables even if reference category is False

  • exp (bool) – Report exponentiated parameters rather than raw parameters, default (None) is to autodetect based on model_class

Return type:

yli.regress.RegressionModel

Example: See yli.OLS, yli.Logit, etc.

yli.vif(df, formula=None, *, nan_policy='warn')

Calculate the variance inflation factor (VIF) for each variable in df

Parameters:
  • df (DataFrame) – Data to calculate the VIF for

  • formula (str) – If specified, calculate the VIF only for the variables in the formula

Returns:

The variance inflation factors

Return type:

Series

Example:

df = pd.DataFrame({
        'D': [68.58, 67.33, 67.33, ...],
        'T1': [14, 10, 10, ...],
        'T2': [46, 73, 85, ...],
        ...
})
yli.vif(df[['D', 'T1', 'T2', ...]])
D     8.318301
T1    6.081590
T2    2.457122
...
dtype: float64

The output shows the variance inflation factor for each variable in df.

Regression models

class yli.Logit

Logistic regression

Example:

df = pd.DataFrame({
        'Unhealthy': [False, False, False, ...],
        'Fibrinogen': [2.52, 2.46, 2.29, ...],
        'GammaGlobulin': [38, 36, 36, ...]
})
yli.regress(yli.Logit, df, 'Unhealthy', 'Fibrinogen + GammaGlobulin')
             Logistic Regression Results              
======================================================
Dep. Variable:  Unhealthy  |  No. Observations:     32
        Model:      Logit  |         Df. Model:      2
         Date: 2022-10-18  |     Df. Residuals:     29
         Time:   19:00:34  |         Pseudo R²:   0.26
  Std. Errors: Non-Robust  |          LL-Model: -11.47
                           |           LL-Null: -15.44
                           |            p (LR):  0.02*
======================================================
                exp(β)   (95% CI)          p   
-----------------------------------------------
  (Intercept)     0.00 (0.00 -  0.24)    0.03* 
   Fibrinogen     6.80 (1.01 - 45.79)    0.049*
GammaGlobulin     1.17 (0.92 -  1.48)    0.19  
-----------------------------------------------

The output summarises the results of the regression. Note that the parameter estimates are automatically exponentiated. For example, the odds ratio for unhealthiness per unit increase in fibrinogen is 6.80, with 95% confidence interval 1.01–45.79, and is significant with p value 0.049.

class yli.OrdinalLogit

Ordinal logistic (or probit) regression

The implementation calls statsmodels’ native OrderedModel, but substitutes an alternative parameterisation used by R and Stata. In the native statsmodels implementation, the first cutoff parameter is the true cutoff, but further cutoff parameter are log differences between consecutive cutoffs. In this parameterisation, cutoff terms are represented directly in the model.

Example:

df = pd.DataFrame(...)
yli.regress(yli.OrdinalLogit, df, 'apply', 'pared + public + gpa', exp=False)
           Ordinal Logistic Regression Results            
==========================================================
Dep. Variable:         apply  |  No. Observations:     400
        Model: Ordinal Logit  |         Df. Model:       5
         Date:    2022-12-02  |     Df. Residuals:     395
         Time:      21:30:38  |         Pseudo R²:    0.03
  Std. Errors:    Non-Robust  |          LL-Model: -358.51
                              |           LL-Null: -370.60
                              |            p (LR): <0.001*
============================================================
                                β      (95% CI)         p   
------------------------------------------------------------
                      pared    1.05  (0.53 - 1.57)   <0.001*
                     public   -0.06 (-0.64 - 0.53)    0.84  
                        gpa    0.62  (0.10 - 1.13)    0.02* 
                  (Cutoffs)                                 
   unlikely/somewhat likely    2.20  (0.68 - 3.73)    0.005*
somewhat likely/very likely    4.30  (2.72 - 5.88)   <0.001*
------------------------------------------------------------

The output summarises the result of the regression. The parameters shown under “(Cutoffs)” are the cutoff values in the latent variable parameterisation of ordinal regression. Note that because exp=False was passed, the parameter estimates are not automatically exponentiated.

class yli.OLS

Ordinary least squares linear regression

Example:

df = pd.DataFrame(...)
yli.regress(yli.OLS, df, 'LNC', 'D + T1 + T2 + S + PR + NE + CT + BW + N + PT')
       Ordinary Least Squares Regression Results       
=======================================================
Dep. Variable:        LNC  |  No. Observations:      32
        Model:        OLS  |         Df. Model:      10
         Date: 2023-04-16  |     Df. Residuals:      21
         Time:   23:34:01  |                R²:    0.86
  Std. Errors: Non-Robust  |                 F:   13.28
                           |             p (F): <0.001*
=======================================================
                β        (95% CI)         p   
----------------------------------------------
(Intercept)   -10.63 (-22.51 - 1.24)    0.08  
          D     0.23   (0.05 - 0.41)    0.02* 
         T1     0.01  (-0.04 - 0.05)    0.82  
         T2     0.01  (-0.00 - 0.02)    0.24  
          S     0.00   (0.00 - 0.00)   <0.001*
         PR    -0.11  (-0.28 - 0.07)    0.21  
         NE     0.26   (0.09 - 0.42)    0.004*
         CT     0.12  (-0.03 - 0.26)    0.12  
         BW     0.04  (-0.18 - 0.26)    0.73  
          N    -0.01  (-0.03 - 0.00)    0.14  
         PT    -0.22  (-0.49 - 0.05)    0.10  
----------------------------------------------

The output summarises the results of the regression. For example, the adjusted mean difference in “LNC” per unit increase in “D” is 0.23, with 95% confidence interval 0.05–0.41, and is significant with p value 0.02.

class yli.PenalisedLogit

Firth penalised logistic regression

Uses the R logistf library.

Example:

df = pd.DataFrame({
        'Pred': [1] * 20 + [0] * 220,
        'Outcome': [1] * 40 + [0] * 200
})
yli.regress(yli.PenalisedLogit, df, 'Outcome', 'Pred', exp=False)
           Penalised Logistic Regression Results            
============================================================
Dep. Variable:         Outcome  |  No. Observations:     240
        Model: Penalised Logit  |         Df. Model:       1
         Date:      2022-10-19  |         Pseudo R²:    0.37
         Time:        07:50:40  |          LL-Model:  -66.43
  Std. Errors:      Non-Robust  |           LL-Null: -105.91
                                |            p (LR): <0.001*
============================================================
                β      (95% CI)          p   
---------------------------------------------
(Intercept)   -2.28 (-2.77 - -1.85)   <0.001*
       Pred    5.99  (3.95 - 10.85)   <0.001*
---------------------------------------------

The output summarises the result of the regression. The summary table shows that a penalised method was applied. Note that because exp=False was passed, the parameter estimates are not automatically exponentiated.

class yli.Poisson

Poisson regression

Example:

df = pd.DataFrame(...)
yli.regress(yli.Poisson, df, 'num_awards', 'C(prog) + math')
               Poisson Regression Results              
=======================================================
Dep. Variable: num_awards  |  No. Observations:     200
        Model:    Poisson  |         Df. Model:       3
         Date: 2023-04-22  |     Df. Residuals:     196
         Time:   16:58:21  |         Pseudo R²:    0.21
  Std. Errors: Non-Robust  |          LL-Model: -182.75
                           |           LL-Null: -231.86
                           |            p (LR): <0.001*
=======================================================
              exp(β)   (95% CI)         p   
--------------------------------------------
(Intercept)     0.01 (0.00 - 0.02)   <0.001*
       prog                                 
          1     Ref.                        
          2     2.96 (1.46 - 5.97)    0.002*
          3     1.45 (0.61 - 3.44)    0.40  
       math     1.07 (1.05 - 1.10)   <0.001*
--------------------------------------------

The output summarises the results of the regression. For example, the adjusted incidence rate ratio in “num_awards” per unit increase in “math” is 1.07, with 95% confidence interval 1.05–1.10, and is significant with p value < 0.001.

Result classes

class yli.regress.BrantResult(tests)

Result of a Brant test for ordinal regression

See OrdinalLogit.brant().

summary()

Return a stringified summary of the χ2 test

Return type:

str

tests

Results for Brant test on each coefficient (Dict[str, ChiSquaredResult])

class yli.regress.LikelihoodRatioTestResult(statistic, dof, pvalue)

Result of a likelihood ratio test for regression

See RegressionModel.lrtest_null().

dof

Degrees of freedom for the χ2 distribution (int)

pvalue

p value for the χ2 test (float)

statistic

χ2 statistic (float)

summary()

Return a stringified summary of the likelihood ratio test

Return type:

str

class yli.regress.RegressionModel
bayesfactor_beta_zero(term)

Compute a Bayes factor testing the hypothesis that the given beta is 0

Uses the R BFpack library.

Requires the regression to be from statsmodels. The term must be specified as the raw name from the statsmodels regression, available via SingleTerm.raw_name.

Parameters:

term (str) – Raw name of the term to be tested

Return type:

yli.bayes_factors.BayesFactor

deviance_chi2()

Perform the deviance χ2 test for goodness of fit

Return type:

yli.sig_tests.ChiSquaredResult

ftest()

Perform the F test that all slopes are 0

Return type:

yli.sig_tests.FTestResult

lrtest_null()

Compute the likelihood ratio test comparing the model to the null model

Return type:

LikelihoodRatioTestResult

property pseudo_rsquared

McFadden’s pseudo R2 statistic

property pseudo_rsquared_adj

McFadden’s pseudo R2 statistic, adjusted for number of parameters

shap(**kwargs)

Compute SHAP values for the model

Uses the Python shap library.

Parameters:

kwargs – Keyword arguments to pass to shap.LinearExplainer

Return type:

yli.shap.ShapResult

Reference: Lundberg SM, Lee SI. A unified approach to interpreting model predictions. In: Guyon I, Von Luxburg U, Bengio S, et al., editors. Advances in Neural Information Processing Systems; 2017 Dec 4–9; Long Beach, CA. https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html

summary()

Return a stringified summary of the regression model

Return type:

str

terms_flat()

Iterate over each SingleTerm in self.terms, recursively stepping through CategoricalTerms

class yli.shap.ShapResult(model, shap_values, features)

SHAP values for a regression model

See yli.regress.RegressionModel.shap().

mean()

Compute the mean absolute SHAP value for each parameter

Return type:

Series

plot(**kwargs)

Generate a scatterplot of the SHAP values

Uses the Python matplotlib library.

Return type:

(Figure, Axes)

Model terms

class yli.regress.CategoricalTerm(categories, ref_category)

A group of terms in a RegressionModel corresponding to a categorical variable

categories

Terms for each of the categories, excluding the reference category (dict of SingleTerm)

ref_category

Name of the reference category (str)

class yli.regress.SingleTerm(raw_name, beta, pvalue)

A term in a RegressionModel which is a single term

beta

yli.utils.Estimate of the coefficient

pvalue

p value for the coefficient (float)

raw_name

Raw name of the term (str)