Internal utilities

Data wrangling

yli.utils.as_2groups(df, data, group)

Group the data by the given variable, asserting only 2 groups

Parameters:
  • df (DataFrame) – Data to group

  • group (str) – Column to group by

Returns:

(group1, data1, group2, data2)

  • group1, group2 (str) – The 2 values of the grouping variable

  • data1, data2 (DataFrame) – The 2 corresponding subsets of df

yli.utils.as_numeric(data)

Convert the data to a numeric type, factorising if required

Parameters:

data – Data to convert

Returns:

See pandas.factorize

yli.utils.convert_pandas_nullable(df)

Convert pandas nullable dtypes (e.g. Int64) to non-nullable numpy dtypes

Behaviour on encountering NA values is undefined, so the data should be passed through check_nan() first.

Parameters:

df (DataFrame) – Data to check for pandas nullable dtypes

Returns:

Data with pandas nullable dtypes converted, which may or may not be copied

Return type:

DataFrame

p values

yli.utils.fmt_p(p, style)

Format p value for display

Parameters:
  • p (float) – p value to display

  • style (PValueStyle) – Style to format the p value

Returns:

Formatted p value

Return type:

str

class yli.utils.PValueStyle(value)

An enum.Flag representing how to render a p value

VALUE_ONLY

Display only the p value (e.g. 0.08, <0.001*)

This is an alias for specifying no flags.

RELATION

Force displaying a relational operator before the p value (e.g. = 0.08, < 0.001*)

TABULAR

Pad with spaces to ensure decimal points align (incompatible with RELATION)

HTML

Format as HTML (e.g. escape <)

Formula manipulation

yli.utils.cols_for_formula(formula, df)

Return the columns corresponding to the Patsy formula

Parameters:
  • formula (str) – Patsy formula to parse

  • df (DataFrame) – Data to apply the formula on

Returns:

Columns in (the right-hand side of) the formula

Return type:

List[str]

yli.utils.formula_factor_ref_category(formula, df, factor)

Get the reference category for a term in a Patsy formula referring to a categorical factor

Parameters:
  • formula (str) – Patsy formula to parse

  • df (DataFrame) – Data to apply the formula on

  • factor – Factor to determine reference category for (e.g. Country, C(Country), C(Country, Treatment), C(Country, Treatment("Australia")))

Returns:

Reference category for the specified factor

yli.utils.parse_patsy_term(formula, df, term)

Parse a Patsy term into its component parts

Example: The term "C(x, Treatment(y))[T.z]" parses to ("C(x, Treatment(y))", "x", "z").

Returns:

(factor, column, contrast)

  • factor (str) – Name of the factor, as specified in the Patsy formula

  • column (str) – Name of the DataFrame column corresponding to the factor

  • contrast (str) – Name of the contrast for the factor, or None if not applicable

Library style

For API nomenclature, the following guidelines are used:

  • Prefer to call a test by its specific name (e.g. anova rather than ftest where applicable), unless most commonly known only by the distribution of the test statistic (e.g. chi2, ttest).

  • A test/statistic is not referred to by both a distribution and specific name (e.g. mannwhitney rather than mannwhitneyu), unless required for disambiguation (e.g. pearsonr to distinguish the Pearson χ2 test).

  • The word “test” is omitted (e.g. chi2 rather than chi2test), unless the name would otherwise be a single letter (e.g. ttest, ftest), or unless required for disambiguation (e.g. LikelihoodRatioTestResult to distinguish from the unrelated meaning of “likelihood ratio” in epidemiology).

  • Underscores are usually omitted from the names of specific tests, test families and statistics (e.g. ttest, oddsratio, pearsonr, pvalue), but are used to separate these from other components (e.g. ttest_ind, anova_oneway, lrtest_null). There are a few exceptions (e.g. rank_biserial, pseudo_rsquared, f_statistic).

  • The result class for a test has the same naming convention as the test function (e.g. TTestResult for ttest_ind), with abbreviations spelled out (e.g. PearsonChiSquaredResult, LikelihoodRatioTestResult); unless the result class is generic among several tests (e.g. FTestResult for anova_oneway and RegressionResult.ftest), or unless required for disambiguation (e.g. PearsonChiSquaredResult for chi2, as there are other χ2 tests).