Internal utilities

Data wrangling

yli.utils.as_2groups(df, data, group)

Group the data by the given variable, asserting only 2 groups

Parameters:

df (DataFrame) – Data to group
group (str) – Column to group by

Returns:

(group1, data1, group2, data2)

group1, group2 (str) – The 2 values of the grouping variable
data1, data2 (DataFrame) – The 2 corresponding subsets of df

yli.utils.as_numeric(data)

Convert the data to a numeric type, factorising if required

Parameters:: data – Data to convert
Returns:: See pandas.factorize

yli.utils.convert_pandas_nullable(df)

Convert pandas nullable dtypes (e.g. Int64) to non-nullable numpy dtypes

Behaviour on encountering NA values is undefined, so the data should be passed through check_nan() first.

Parameters:: df (DataFrame) – Data to check for pandas nullable dtypes
Returns:: Data with pandas nullable dtypes converted, which may or may not be copied
Return type:: DataFrame

p values

yli.utils.fmt_p(p, style)

Format p value for display

Parameters:

p (float) – p value to display
style (PValueStyle) – Style to format the p value

Returns:

Formatted p value

Return type:

str

class yli.utils.PValueStyle(value)

An enum.Flag representing how to render a p value

VALUE_ONLY

Display only the p value (e.g. 0.08, <0.001*)

This is an alias for specifying no flags.

RELATION: Force displaying a relational operator before the p value (e.g. = 0.08, < 0.001*)

TABULAR: Pad with spaces to ensure decimal points align (incompatible with RELATION)

HTML: Format as HTML (e.g. escape <)

Formula manipulation

yli.utils.cols_for_formula(formula, df)

Return the columns corresponding to the Patsy formula

Parameters:

formula (str) – Patsy formula to parse
df (DataFrame) – Data to apply the formula on

Returns:

Columns in (the right-hand side of) the formula

Return type:

List[str]

yli.utils.formula_factor_ref_category(formula, df, factor)

Get the reference category for a term in a Patsy formula referring to a categorical factor

Parameters:

formula (str) – Patsy formula to parse
df (DataFrame) – Data to apply the formula on
factor – Factor to determine reference category for (e.g. Country, C(Country), C(Country, Treatment), C(Country, Treatment("Australia")))

Returns:

Reference category for the specified factor

yli.utils.parse_patsy_term(formula, df, term)

Parse a Patsy term into its component parts

Example: The term "C(x, Treatment(y))[T.z]" parses to ("C(x, Treatment(y))", "x", "z").

Returns:

(factor, column, contrast)

factor (str) – Name of the factor, as specified in the Patsy formula
column (str) – Name of the DataFrame column corresponding to the factor
contrast (str) – Name of the contrast for the factor, or None if not applicable

Library style

For API nomenclature, the following guidelines are used:

Prefer to call a test by its specific name (e.g. anova rather than ftest where applicable), unless most commonly known only by the distribution of the test statistic (e.g. chi2, ttest).

A test/statistic is not referred to by both a distribution and specific name (e.g. mannwhitney rather than mannwhitneyu), unless required for disambiguation (e.g. pearsonr to distinguish the Pearson χ² test).

The word “test” is omitted (e.g. chi2 rather than chi2test), unless the name would otherwise be a single letter (e.g. ttest, ftest), or unless required for disambiguation (e.g. LikelihoodRatioTestResult to distinguish from the unrelated meaning of “likelihood ratio” in epidemiology).

Underscores are usually omitted from the names of specific tests, test families and statistics (e.g. ttest, oddsratio, pearsonr, pvalue), but are used to separate these from other components (e.g. ttest_ind, anova_oneway, lrtest_null). There are a few exceptions (e.g. rank_biserial, pseudo_rsquared, f_statistic).

The result class for a test has the same naming convention as the test function (e.g. TTestResult for ttest_ind), with abbreviations spelled out (e.g. PearsonChiSquaredResult, LikelihoodRatioTestResult); unless the result class is generic among several tests (e.g. FTestResult for anova_oneway and RegressionResult.ftest), or unless required for disambiguation (e.g. PearsonChiSquaredResult for chi2, as there are other χ² tests).