Regression

Functions

yli.regress(model_class, df, dep, formula, *, nan_policy='warn', exposure=None, method=None, maxiter=None, start_params=None, reduced=None, bool_baselevels=False, exp=None)

Fit a statsmodels regression model

Parameters:

model_class (yli.regress.RegressionModel subclass) – Type of regression model to fit
df (DataFrame) – Data to perform regression on
dep (str) – Column(s) in df for the dependent variable (numeric)
formula (str) – Patsy formula for the regression model
nan_policy (str) – How to handle nan values (see NaN handling)
exposure (str) – Column in df for the exposure variable (numeric, some models only)
method – See statsmodels model.fit
maxiter – See statsmodels model.fit
start_params – See statsmodels model.fit
reduced – See yli.IntervalCensoredCox()
bool_baselevels (bool) – Show reference categories for boolean independent variables even if reference category is False
exp (bool) – Report exponentiated parameters rather than raw parameters, default (None) is to autodetect based on model_class

Return type:

yli.regress.RegressionModel

Example: See yli.OLS, yli.Logit, etc.

yli.vif(df, formula=None, *, nan_policy='warn')

Calculate the variance inflation factor (VIF) for each variable in df

Parameters:

df (DataFrame) – Data to calculate the VIF for
formula (str) – If specified, calculate the VIF only for the variables in the formula

Returns:

The variance inflation factors

Return type:

Series

Example:

df = pd.DataFrame({
        'D': [68.58, 67.33, 67.33, ...],
        'T1': [14, 10, 10, ...],
        'T2': [46, 73, 85, ...],
        ...
})
yli.vif(df[['D', 'T1', 'T2', ...]])

D     8.318301
T1    6.081590
T2    2.457122
...
dtype: float64

The output shows the variance inflation factor for each variable in df.

Regression models

class yli.Logit

Logistic regression

Example:

df = pd.DataFrame({
        'Unhealthy': [False, False, False, ...],
        'Fibrinogen': [2.52, 2.46, 2.29, ...],
        'GammaGlobulin': [38, 36, 36, ...]
})
yli.regress(yli.Logit, df, 'Unhealthy', 'Fibrinogen + GammaGlobulin')

             Logistic Regression Results              
======================================================
Dep. Variable:  Unhealthy  |  No. Observations:     32
        Model:      Logit  |         Df. Model:      2
         Date: 2022-10-18  |     Df. Residuals:     29
         Time:   19:00:34  |         Pseudo R²:   0.26
  Std. Errors: Non-Robust  |          LL-Model: -11.47
                           |           LL-Null: -15.44
                           |            p (LR):  0.02*
======================================================
                exp(β)   (95% CI)          p   
-----------------------------------------------
  (Intercept)     0.00 (0.00 -  0.24)    0.03* 
   Fibrinogen     6.80 (1.01 - 45.79)    0.049*
GammaGlobulin     1.17 (0.92 -  1.48)    0.19  
-----------------------------------------------

The output summarises the results of the regression. Note that the parameter estimates are automatically exponentiated. For example, the odds ratio for unhealthiness per unit increase in fibrinogen is 6.80, with 95% confidence interval 1.01–45.79, and is significant with p value 0.049.

class yli.OrdinalLogit

Ordinal logistic (or probit) regression

The implementation calls statsmodels’ native OrderedModel, but substitutes an alternative parameterisation used by R and Stata. In the native statsmodels implementation, the first cutoff parameter is the true cutoff, but further cutoff parameter are log differences between consecutive cutoffs. In this parameterisation, cutoff terms are represented directly in the model.

Example:

df = pd.DataFrame(...)
yli.regress(yli.OrdinalLogit, df, 'apply', 'pared + public + gpa', exp=False)

           Ordinal Logistic Regression Results            
==========================================================
Dep. Variable:         apply  |  No. Observations:     400
        Model: Ordinal Logit  |         Df. Model:       5
         Date:    2022-12-02  |     Df. Residuals:     395
         Time:      21:30:38  |         Pseudo R²:    0.03
  Std. Errors:    Non-Robust  |          LL-Model: -358.51
                              |           LL-Null: -370.60
                              |            p (LR): <0.001*
============================================================
                                β      (95% CI)         p   
------------------------------------------------------------
                      pared    1.05  (0.53 - 1.57)   <0.001*
                     public   -0.06 (-0.64 - 0.53)    0.84  
                        gpa    0.62  (0.10 - 1.13)    0.02* 
                  (Cutoffs)                                 
   unlikely/somewhat likely    2.20  (0.68 - 3.73)    0.005*
somewhat likely/very likely    4.30  (2.72 - 5.88)   <0.001*
------------------------------------------------------------

The output summarises the result of the regression. The parameters shown under “(Cutoffs)” are the cutoff values in the latent variable parameterisation of ordinal regression. Note that because exp=False was passed, the parameter estimates are not automatically exponentiated.

class yli.OLS

Ordinary least squares linear regression

Example:

df = pd.DataFrame(...)
yli.regress(yli.OLS, df, 'LNC', 'D + T1 + T2 + S + PR + NE + CT + BW + N + PT')

       Ordinary Least Squares Regression Results       
=======================================================
Dep. Variable:        LNC  |  No. Observations:      32
        Model:        OLS  |         Df. Model:      10
         Date: 2023-04-16  |     Df. Residuals:      21
         Time:   23:34:01  |                R²:    0.86
  Std. Errors: Non-Robust  |                 F:   13.28
                           |             p (F): <0.001*
=======================================================
                β        (95% CI)         p   
----------------------------------------------
(Intercept)   -10.63 (-22.51 - 1.24)    0.08  
          D     0.23   (0.05 - 0.41)    0.02* 
         T1     0.01  (-0.04 - 0.05)    0.82  
         T2     0.01  (-0.00 - 0.02)    0.24  
          S     0.00   (0.00 - 0.00)   <0.001*
         PR    -0.11  (-0.28 - 0.07)    0.21  
         NE     0.26   (0.09 - 0.42)    0.004*
         CT     0.12  (-0.03 - 0.26)    0.12  
         BW     0.04  (-0.18 - 0.26)    0.73  
          N    -0.01  (-0.03 - 0.00)    0.14  
         PT    -0.22  (-0.49 - 0.05)    0.10  
----------------------------------------------

The output summarises the results of the regression. For example, the adjusted mean difference in “LNC” per unit increase in “D” is 0.23, with 95% confidence interval 0.05–0.41, and is significant with p value 0.02.

class yli.PenalisedLogit

Firth penalised logistic regression

Uses the R logistf library.

Example:

df = pd.DataFrame({
        'Pred': [1] * 20 + [0] * 220,
        'Outcome': [1] * 40 + [0] * 200
})
yli.regress(yli.PenalisedLogit, df, 'Outcome', 'Pred', exp=False)

           Penalised Logistic Regression Results            
============================================================
Dep. Variable:         Outcome  |  No. Observations:     240
        Model: Penalised Logit  |         Df. Model:       1
         Date:      2022-10-19  |         Pseudo R²:    0.37
         Time:        07:50:40  |          LL-Model:  -66.43
  Std. Errors:      Non-Robust  |           LL-Null: -105.91
                                |            p (LR): <0.001*
============================================================
                β      (95% CI)          p   
---------------------------------------------
(Intercept)   -2.28 (-2.77 - -1.85)   <0.001*
       Pred    5.99  (3.95 - 10.85)   <0.001*
---------------------------------------------

The output summarises the result of the regression. The summary table shows that a penalised method was applied. Note that because exp=False was passed, the parameter estimates are not automatically exponentiated.

class yli.Poisson

Poisson regression

Example:

df = pd.DataFrame(...)
yli.regress(yli.Poisson, df, 'num_awards', 'C(prog) + math')

               Poisson Regression Results              
=======================================================
Dep. Variable: num_awards  |  No. Observations:     200
        Model:    Poisson  |         Df. Model:       3
         Date: 2023-04-22  |     Df. Residuals:     196
         Time:   16:58:21  |         Pseudo R²:    0.21
  Std. Errors: Non-Robust  |          LL-Model: -182.75
                           |           LL-Null: -231.86
                           |            p (LR): <0.001*
=======================================================
              exp(β)   (95% CI)         p   
--------------------------------------------
(Intercept)     0.01 (0.00 - 0.02)   <0.001*
       prog                                 
          1     Ref.                        
          2     2.96 (1.46 - 5.97)    0.002*
          3     1.45 (0.61 - 3.44)    0.40  
       math     1.07 (1.05 - 1.10)   <0.001*
--------------------------------------------

The output summarises the results of the regression. For example, the adjusted incidence rate ratio in “num_awards” per unit increase in “math” is 1.07, with 95% confidence interval 1.05–1.10, and is significant with p value < 0.001.

Result classes

class yli.regress.BrantResult(tests)

Result of a Brant test for ordinal regression

See OrdinalLogit.brant().

summary()

Return a stringified summary of the χ² test

Return type:: str

tests: Results for Brant test on each coefficient (Dict[str, ChiSquaredResult])

class yli.regress.LikelihoodRatioTestResult(statistic, dof, pvalue)

Result of a likelihood ratio test for regression

See RegressionModel.lrtest_null().

dof: Degrees of freedom for the χ² distribution (int)

pvalue: p value for the χ² test (float)

statistic: χ² statistic (float)

summary()

Return a stringified summary of the likelihood ratio test

Return type:: str

class yli.regress.RegressionModel

bayesfactor_beta_zero(term)

Compute a Bayes factor testing the hypothesis that the given beta is 0

Uses the R BFpack library.

Requires the regression to be from statsmodels. The term must be specified as the raw name from the statsmodels regression, available via SingleTerm.raw_name.

Parameters:: term (str) – Raw name of the term to be tested
Return type:: yli.bayes_factors.BayesFactor

deviance_chi2()

Perform the deviance χ² test for goodness of fit

Return type:: yli.sig_tests.ChiSquaredResult

ftest()

Perform the F test that all slopes are 0

Return type:: yli.sig_tests.FTestResult

lrtest_null()

Compute the likelihood ratio test comparing the model to the null model

Return type:: LikelihoodRatioTestResult

property pseudo_rsquared: McFadden’s pseudo R² statistic

property pseudo_rsquared_adj: McFadden’s pseudo R² statistic, adjusted for number of parameters

shap(**kwargs)

Compute SHAP values for the model

Uses the Python shap library.

Parameters:: kwargs – Keyword arguments to pass to shap.LinearExplainer
Return type:: yli.shap.ShapResult

Reference: Lundberg SM, Lee SI. A unified approach to interpreting model predictions. In: Guyon I, Von Luxburg U, Bengio S, et al., editors. Advances in Neural Information Processing Systems; 2017 Dec 4–9; Long Beach, CA. https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html

summary()

Return a stringified summary of the regression model

Return type:: str

terms_flat(): Iterate over each SingleTerm in self.terms, recursively stepping through CategoricalTerms

class yli.shap.ShapResult(model, shap_values, features)

SHAP values for a regression model

See yli.regress.RegressionModel.shap().

mean()

Compute the mean absolute SHAP value for each parameter

Return type:: Series

plot(**kwargs)

Generate a scatterplot of the SHAP values

Uses the Python matplotlib library.

Return type:: (Figure, Axes)

Model terms

class yli.regress.CategoricalTerm(categories, ref_category)

A group of terms in a RegressionModel corresponding to a categorical variable

categories: Terms for each of the categories, excluding the reference category (dict of SingleTerm)

ref_category: Name of the reference category (str)

class yli.regress.SingleTerm(raw_name, beta, pvalue)

A term in a RegressionModel which is a single term

beta: yli.utils.Estimate of the coefficient

pvalue: p value for the coefficient (float)

raw_name: Raw name of the term (str)