Combine discrete p values (in Python)

This module provides a toolbox for combining p values of rank tests and other tests with a discrete null distribution. The focus lies on the analysis of segmented datasets as opposed to a meta-analysis of separate studies, although the latter is also possible.

Installation

To install from PyPI use something along the lines of:

pip install combine-pvalues-discrete

Or use the following to directly install from GitHub:

pip install git+git://github.com/BPSB/combine-p-values-discrete

When do you need this?

If you want a more hands-on introduction on what kind of problems this module handles and that it can make a difference, feel free to first read: An extensive example.

This module has a scope similar to SciPy’s combine_pvalues:

  • You have a dataset consisting of independent sub-datasets. (So this is not about multiple testing or pseudo-replication.)

  • For each sub-dataset, you have performed a test investigating the same or similar research hypothesis. (Often, this is the same one-sided test and the sub-datasets only differ in size.)

  • There is no straightforward test to apply to the entire dataset.

  • You want a single p value for the null hypothesis taking into account the entire dataset, i.e., you want to combine your test results for the sub-datasets.

However, SciPy’s combine_pvalues assumes that the individual tests are continuous (see below for a definition), and applying it to discrete tests will yield a systematically wrong combined p value [Kincaid1962], [Mielke2004]. For example, for Fisher’s method it systematically overestimates the p value, i.e., you may falsely accept the null hypothesis (false negative). This module addresses this and thus you should consider it if:

  • At least one of the sub-tests is discrete with a low number of possible p values. What is a “low number” depends on the details, but 30 almost always is.

  • The combined p value returned by combine_pvalues is not very low already.

See An extensive example for an example, where combining p values only yields the correct result when we account for the discreteness of tests.

Also, as a side product, this module also implements Monte Carlo-based weighted variants of methods other than Stouffer’s, which combine_pvalues does not provide.

Discrete and continuous tests

If the null hypothesis of a given test holds, its p values are uniformly distributed on the interval \((0,1]\) in the sense that \(\text{CDF}(p_0) ≡ P(p≤p_0) = p_0\). However, for some tests, there is a limited number of possible outcomes for a given sample size. For the purposes of this module, I call such tests discrete. By contrast, for a continuous test, all values on the interval \((0,1]\) are possible outcomes (for any given sample size).

For example, the sign test is discrete: Its one-sample variant evaluates how often each sign occurs in the dataset. Therefore, for a sample size of five, every dataset can be boiled down to one of six possible scenarios: \([++++++]\), \([+++++-]\), …, \([------]\). For the one-sided test, these correspond to the p values \(\frac{ 1}{32}\), \(\frac{ 3}{16}\), \(\frac{ 1}{ 2}\), \(\frac{13}{16}\), \(\frac{31}{32}\), and \(1\). These six p values are the only possible outcomes of the test.

By contrast, the t test is continuous: Even for a sample size of two, every value in the interval \((0,1]\) can be obtained with the right data. For example, for the dataset \([98,99]\), the one-sided, one-sample variant of the test yields \(p=0.0016\).

Discrete tests include all rank tests, since there is only a finite number of ways to rank a given number of samples. Moreover, they contain tests of bound integer data. The most relevant discrete tests are:

  • the sign test,

  • Wilcoxon’s signed rank test,

  • the Mann–Whitney U test,

  • any test based on a ranked correlation such as Kendall’s τ and Spearman’s ρ,

  • Boschloo’s exact test and any other test for integer contingency tables.

Tests whose result continuously depends on the samples are continuous. The most relevant continuous tests are:

  • all flavours of the t test,

  • the test for significance of Pearson’s r,

Tests such as the Kruskal–Wallis test, ANOVA or the Kolmogorov–Smirnov test are not listed above because I cannot imagine a scenario where combining their p values makes sense.

How this module works

To correctly compute the combined p value, we need to take into account the null distributions of the individual tests, i.e., what p values are possible. This module determines these values for popular tests or lets you specify them yourself. Of course, if you have continuous tests in the mix, you can also include them. Either way, the relevant information is stored in a CTR object (“combinable test result”). These objects can then be combined using the combine function.

The difficulty for determining the combined p value is convolving the respective null distributions. While this is analytically possible for continuous tests or a small number of discrete tests, it requires numerical approximations otherwise due to a combinatorial explosion (e.g., even for the small dataset in An extensive example, we we would have to handle 53508000 combinations). To perform these approximations, we use a Monte Carlo sampling of combinations of individual p values (from the respective null distributions). Thanks to modern computing and NumPy, it is easy to make the number of samples very high and the result very accurate.

Complements

In several cases, this module uses the complement q of a p value. For example, combining methods such as Pearson’s or Mudholkar’s and George’s use it as part of their statistics. For continuous tests, this complement is straightforwardly computed as \(q = 1-p\). However, for discrete tests this leads to implausible results, in particular if \(p=1\). To avoid this, this module uses for q the probability to observe such a p value or a higher one. In analogy to \(\text{CDF}(p_0) = P(p≤p_0) = p_0\), we have \(\text{CCDF}(p_0) = P(p≥0) = q\) (both under the null hypothesis). This applies whenever the complement of a p value is relevant.

A simple example

We have three pairs of independent datasets (A_1,A_2), (B_1,B_2), and (C_1,C_2). Our research hypothesis is that the values in the respective first dataset are lower. However, the datasets have different properties and thus cannot simply be pooled:

  • (A_1,A_2) is a paired dataset. Since we cannot assume any distribution, we want to apply the sign test.

  • (B_1,B_2) is an unpaired dataset (with unequal sizes). Again, we cannot assume a distribution. Hence we want to apply the Mann–Whitney U test.

  • (C_1,C_2) is a paired dataset, however we can assume that the differences to be normally distributed and thus apply the t test for two dependent samples.

We start with performing the sign test on A_1 and A_2. We create a CTR from this:

from combine_pvalues_discrete import CTR, combine
result_A = CTR.sign_test(A_1,A_2,alternative="less")

result_A now contains the p value of the sign test as well as the null distribution of possible p values. It is ready for being combined with other test results, but we have to create these first. We do so by doing something very similar with B_1 and B_2 and the Mann–Whitney U test:

result_B = CTR.mann_whitney_u(B_1,B_2,alternative="less")

Finally, we perform the t test on C_1 and C_2. Since the t test is a continuous test, we do not need a special constructor to create a CTR, but can use generic one using only the p value computed with an existing function, here scipy.stats.ttest_rel. Leaving the second argument empty specifies that the test is continuous:

from scipy.stats import ttest_rel
p_C = ttest_rel(C_1,C_2).pvalue
result_C = CTR(p_C)

After we have performed the individual tests, we can obtain the compound p value using combine (which we imported earlier).

print( combine([result_A,result_B,result_C]) )

An extensive example

In this example, we illustrate the benefits of this module by comparing it with other approaches to the same dataset, all of which yield wrong results or consume a lot of time.

Suppose, we want to to explore the effect on a drug on dogs. We expect our observable (by which we measure the effect of the drug) be more affected by the dog breed than by the drug, e.g., because a poodle is generally weaker than a mastiff. Therefore we group the dogs by breed a priori, creating sub-datasets. Each breed group gets further randomly split into a treatment (T) and control group (C). Since not all dogs complete the study, our sub-datasets become very inhomogeneous in sample size.

Our data looks like this:

data = [# control group  , treatment group
        ( [20,44,14,68]  , [73,22,80]    ), # Breed 1
        ( [60,68,46,45]  , [30]          ), # Breed 2
        ( [92,97,98]     , [84,89]       ), # …
        ( [14]           , [21,45,31,23] ),
        ( [24,58,0,24,33], [65,51,61]    ),
        ( [93,76,70,83]  , [84]          ),
        ( [10,2]         , [28,36,11]    ),
        ( [27]           , [38,58]       ),
        ( [8,13,37]      , [43,51]       ),
        ( [18]           , [12]          ),
]

Each pair represents one breed, with the first half being the control group and the second the treatment group. If our drug works as desired, the second half should exhibit higher values. Finally let’s assume that due to the nature of our observable, we only want to use a ranked statistics.

Thus, we want to investigate the null hypothesis that for each sub-dataset, both samples are from the same distribution (or more precisely, they fulfil the null hypothesis of the Mann–Whitney U test). The alternative hypothesis is that that the first pair of samples are from a distribution with a lower median.

First, suppose we discard our information on breeds and pool the control and treatment groups. We then apply the Mann–Whitney U test to the pooled samples. This way, we do not need to combine tests, but we lose statistical power, i.e., we increase the false-negative rate.

from scipy.stats import mannwhitneyu

pooled_Cs = [ c for C,T in data for c in C ]
pooled_Ts = [ t for C,T in data for t in T ]
print( mannwhitneyu(pooled_Cs,pooled_Ts,alternative="less") )
# MannwhitneyuResult(statistic=282.0, pvalue=0.30908071682819527)

Alternatively, we can summarise the samples in each pair by their median and use the sign test to compare the groups. Again, we discard information and thus lose statistical power:

# Summarizing data and sign test
import numpy as np
from combine_pvalues_discrete import sign_test

reduced_Cs = [ np.median(C) for C,T in data ]
reduced_Ts = [ np.median(T) for C,T in data ]
print( sign_test( reduced_Cs, reduced_Ts, alternative="less" ) )
# SignTestResult(pvalue=0.171875, not_tied=10, statistic=3)

To properly take into account all information, we have to apply the Mann–Whitney U test to each pair (breed) and then combine the p values. SciPy’s combine_pvalues allows us to do this, but it requires continuous tests. Since the Mann–Whitney U test does not fulfil this requirement, we will overestimate the combined p value:

from scipy.stats import combine_pvalues, mannwhitneyu

pvalues = [ mannwhitneyu(C,T,alternative="less").pvalue for C,T in data ]
statistic,pvalue = combine_pvalues(pvalues,method="fisher")
print(statistic,pvalue)
# (27.447712265267114, 0.123131292229715)

Finally, by using this module, we can take into account the discreteness of tests, obtaining a correct combined p value:

from combine_pvalues_discrete import CTR, combine

ctrs = [ CTR.mann_whitney_u(C,T,alternative="less") for C,T in data ]
print( combine(ctrs,method="fisher") )
# Combined_P_Value(pvalue=0.0014229998577000142, std=1.1920046440408576e-05)

We here use Fisher’s method, since it simplifies the following demonstration. See Choosing a combining method as to how this affects our results and their interpretation.

Checking the result

Let’s convince ourselves that the result of combine is actually correct. To this end, we first implement the statistic of Fisher’s method (as used by SciPy) for combining Mann–Whitney U tests. Note how the result agrees with that of applying combine_pvalues above:

def fisher_statistic(dataset):
	pvalues = [ mannwhitneyu(C,T,alternative="less").pvalue for C,T in dataset ]
	return -2*np.sum(np.log(pvalues))

data_statistic = fisher_statistic(data)
print(data_statistic)
# 27.447712265267114

Next, we implement a function that samples analogous datasets corresponding to our null hypothesis (surrogates). (Since we only care about the order of samples, we do not have to recreate the magnitude of values.)

rng = np.random.default_rng()
def null_sample(data):
	return [
		( rng.random(len(C)), rng.random(len(T)) )
		for C,T in data
	]

Finally, we sample n=10000 times from our null model and estimate the p value by comparing the values of Fisher’s statistic for the null model and the original data.

n = 10000
null_statistic = [ fisher_statistic(null_sample(data)) for _ in range(n) ]
count = np.sum( null_statistic >= data_statistic )
p_from_simulation = (count+1)/(n+1)
print(p_from_simulation)
# 0.0016998300169983002

This confirms the low p value we obtained with combine above and that the p values obtained with the other methods were too high. You may note that, although this value is low, it is not within the confidence interval of the result of combine from above. The reason for this is that n and thus count is low (in the above, count is just 16). Thus, the p value estimate from the null model is subject to much higher fluctuations than that of combine. Obtaining a precision comparable to combine would require an excessive amount of time.

Once more with weights

So far, we have treated the p values for each sub-dataset equally. However, the different sub-datasets differ considerably in size and thus meaningfulness, e.g., the first contains seven data points while the last only contains two. We can account for this by weighting the sub-datasets with their degrees of freedom (number of samples minus one), which are stored with the test results for this purpose:

print( combine(ctrs,method="fisher",weights="dof") )
# Combined_P_Value(pvalue=0.0011117998888200112, std=1.0537855099673787e-05)

And like above, we can check this result by comparing to an explicit, computationally expensive simulation of the null hypothesis:

def weighted_fisher_statistic(dataset,weights):
	pvalues = [ mannwhitneyu(C,T,alternative="less").pvalue for C,T in dataset ]
	return -2*weights.dot(np.log(pvalues))

weights = np.array([len(C)+len(T)-1 for C,T in data])
weighted_data_statistic = weighted_fisher_statistic(data,weights)
weighted_null_statistic = [
		weighted_fisher_statistic(null_sample(data),weights)
		for _ in range(n)
	]
weighted_count = np.sum( weighted_null_statistic >= weighted_data_statistic )
weighted_p_from_simulation = (weighted_count+1)/(n+1)
print(weighted_p_from_simulation)
# 0.0010998900109989002

Implementing your own test

If you want to analyse a given dataset with a test that this module does not provide, you need to determine two things:

  • The p value of the test applied to your dataset.

  • A list of all possible p values that you test can yield for datasets with the same sample size(s).

You can use these as arguments of CTR’s default constructor.

The best way to find all possible p values is to get a rough understanding of the test statistics and look into an existing implementation of the test, so you don’t have to fully re-invent the wheel.

If your test is continuous, you do not need to implement anything, but can just use CTR(p), where p is the p value of your individual test.

Note that individual tests should always be one-sided for the following reason: If you have two equally significant, but opposing sub-results, they should not add in effect, but cancel each other. This is not possible when you use two-sided sub-tests, since all information on the directionality of a result gets lost. You can obtain a two-sided result with combine though, which accounts for trends in either direction as long as consistent over all datasets.

Example: Mann–Whitney U test

Suppose, we want to implement the Mann–Whitney U test (with the alternative “less”) into the CTR framework. (This test is already implemented, so we don’t actually need to do this.)

The test is named for its statistics U which can exhibit values between 0 and \(n·m\), where \(n\) and \(m\) are the sizes of the datasets that are compared. These U values are then translated into p values. If we look into SciPy’s implementation of this test, we find that it uses an implementation of the test’s null distribution (scipy.stats._mannwhitneyu._mwu_state), which we can exploit. Finally, we can use SciPy’s implementation to compute the p value of our given datasets.

With that we have all the ingredients to write a small function to return a combinable test result for a given pair of datasets:

from combine_pvalues_discrete import CTR
from scipy.stats._mannwhitneyu import _mwu_state, mannwhitneyu

def mwu_ctr(x,y):
	p = mannwhitneyu(x,y,method="exact",alternative="less").pvalue
	n,m = len(x),len(y)
	possible_ps = [ _mwu_state.cdf(U,n,m) for U in range(n*m+1) ]
	return CTR( p, possible_ps )

Supported combining methods

This module supports the following combining methods [Heard2018]. They are listed together with their test statistics – with \(p_i\) being the p values of the individual tests and \(q_i\) being their complements (see Complements):

  • Fisher: \(\prod\limits_i p_i\)

  • Pearson: \(\prod\limits_i q_i^{-1}\)

  • Mudholkar and George (default): \(\prod\limits_i \frac{p_i}{q_i}\)

  • Edgington: \(\sum\limits_i p_i\) and Edgington symmetrised: \(\sum\limits_i p_i+1-q_i\) (see below as to why you may want this)

  • Stouffer: \(\sum\limits_i \Phi^{-1} (p_i)\), where \(\Phi\) is the CDF of the standard normal distribution.

  • Simes: \(\min\limits_i \left( \frac{p_i}{R(p_i)} \right)\), where \(R(p_i)\) is the rank of \(p_i\) amongst all combined p values [Simes1986]..

  • Tippett: \(\min\limits_i(p_i)\)

Weighted variants exist for all but the latter two (for which they do not make sense).

Note that we use different, but equivalent statistics internally for numerical reasons. In particular, we transform products to sums in logarithmic space to avoid numerical underflows.

Choosing a combining method

Mudholkar’s and George’s method being the default is based on the assumption that the research hypothesis is that all datasets are subject to the same trend. In An extensive example, this corresponds to the drug being beneficial to dogs in general.

The trend may manifest more clearly in some of the datasets (and you don’t know which a priori), but it should not be inverted (other than by chance). In this case, you would perform one-sided sub-tests. (If you would consider a trend in either direction a finding, the combination needs to be two-sided, not the sub-tests.)

If the p value of such a sub-test is small, the sub-dataset exhibits the trend you hypothesised. Conversely, if the complement \(q ≈ 1-p\) of a sub-test is small, the sub-dataset exhibits a trend opposite to what you hypothesised – with a p value q. (See Complements on how q is defined for the purposes of this module.) I think that the combined p values should reflect this, i.e., the complement q should indicate the significance of the opposite one-sided hypothesis (not to be confused with the null hypothesis) just like the p value indicates the significance of the research hypothesis.

To achieve this, the combining method must treat p and q in a symmetrical fashion. This also means that the following results exactly negate each other:

  • a sub-test with \(p=p_0\).

  • a sub-test with \(q=p_0\), i.e., \(p≈1-p_0\).

Of the supported methods, only three fulfil this:

  • Stouffer’s method. However, its statistics becomes infinite if \(p=1\) for any sub-test and thus the method cannot distinguish between this happening for one or almost all tests.

  • Edgington’s symmetrised method. However, this does not give extreme p values the emphasis they deserve (in my humble opinion), e.g., a p value changing from 0.1 to 0.001 has the same effect as one changing from 0.5 to 0.401.

  • Mudholkar’s and George’s method. This one puts emphasis on extreme p values, i.e., close to 0 or 1.

I therefore prefer the latter in this case.

This changes if your research hypothesis is that some datasets exhibit a given trend. In the dog example, this corresponds to the research hypothesis that the drug is beneficial to some dog breeds, while it may be harmful to others. In this case, a method emphasising low p values is more appropriate, e.g., Fisher’s.

For other research hypotheses, you have yet other considerations and appropriate methods. Also see [Adcock1960] for a discussion of this.

Supported Tests

Currently, this module supports:

  • the sign test and Wilocoxon’s signed rank test,

  • the Mann–Whitney U test,

  • Fisher’s and Boschloo’s exact tests,

  • Spearman’s ρ and Kendall’s τ,

  • all continuous tests (just use CTR(p)).

Ties are not supported in every case. If you require any further test or support for ties, please tell me. Two-sided tests are not supported because I cannot imagine a combination scenario where they make sense.

Command reference

class CTR(p, all_ps=None, dof=None)

CTR = combinable test result

Represents a single test result. Use the default constructor to implement a test yourself or use one of the class methods for the respective test.

Parameters:
p

The p value yielded by the test for the investigated sub-dataset.

all_ps

An iterable containing all possible p values of the test for datasets with the same size as the dataset for this individual test. If None or empty, all p values will be considered possible, i.e., the test will be assumed to be continuous.

dof

Number of degrees of freedom that affect the test’s outcome; only relevant for automatic weighting (see the weights argument of combine).

Examples

p = 0.1
possible_ps = [0.1,0.5,1]
dof = 7
ctr = CTR(p,possible_ps,dof)
classmethod mann_whitney_u(x, y, **kwargs)

Creates an object representing the result of a single Mann–Whitney U test (using SciPy’s mannwhitneyu).

Ties are not supported yet because I expect them not to occur in the scenarios that require test combinations (but I may be wrong about this) and they make things much more complicated.

For automatic weighting, the number of degrees of freedom is taken to be len(x) + len(y) - 1. Note how this aligns with the sign test and Wilcoxon’s signed rank test when either x or y has size 1.

Parameters:
x,y

The two iterables of samples to compare.

kwargs

Further keyword arguments to be passed on to SciPy’s mannwhitneyu, such as alternative.

Examples

dataset_A = [20,44,14,68]
dataset_B = [73,22,80]
ctr = CTR.mann_whitney_u(dataset_A,dataset_B,alternative="less")
classmethod sign_test(x, y=0, alternative='less')

Creates an object representing the result of a single sign test.

For automatic weighting, the number of pairs is taken as the number of degrees of freedom.

Parameters:
x,y

The two arrays of paired samples to compare. If y is a number, a one-sample sign test is performed with y as the median. With y as an iterable, the two-sample test is performed.

alternative: “less” or “greater”

Examples

dataset_A = [20,44,14,68]
dataset_B = [73,22,80,53]
ctr = CTR.sign_test(dataset_A,dataset_B,alternative="less")
classmethod wilcoxon_signed_rank(x, y=None, alternative='less')

Creates an object representing the result of a single Wilcoxon signed-rank test.

For automatic weighting, the number of pairs is taken as the number of degrees of freedom.

Parameters:
x,y

The two arrays of paired samples to compare. If y is None, the one-sample test is performed, otherwise the two-sample one.

alternative: “less” or “greater”

Examples

dataset_A = [20,44,14,68]
dataset_B = [73,22,80,53]
ctr = CTR.wilcoxon_signed_rank(dataset_A,dataset_B,alternative="less")
classmethod spearmanr(x, y, alternative='greater', n_thresh=9)

Creates an object representing the result of a single Spearman’s ρ test. If the size of x and y, n, is smaller than n_thresh, p values are exactly determined using a permutation test. Otherwise p values are computed using SciPy’s spearmanr assuming a uniform distribution of p values and ensuring \(p≥\frac{1}{n!}\).

For automatic weighting, the number of degrees of freedom is taken to be one less than the number of pairs.

Parameters:
x,y

The two arrays of samples to correlate.

alternative: “greater” or “less”
n_thresh:

Threshold under which a permutation test is used.

Examples

dataset_A = [1,3,4,2,5,6]
dataset_B = [3,1,2,5,6,4]
ctr = CTR.spearmanr(dataset_A,dataset_B,alternative="greater")
classmethod kendalltau(x, y, **kwargs)

Creates an object representing the result of a single Kendall’s τ test using SciPy’s kendalltau to compute p values.

NaNs and ties are not supported yet.

For automatic weighting, the number of degrees of freedom is taken to be one less than the number of pairs.

Parameters:
x,y

The two arrays of samples to correlate.

alternative: “greater” or “less”

Examples

dataset_A = [1,3,4,2,5,6]
dataset_B = [3,1,2,5,6,4]
ctr = CTR.kendalltau(dataset_A,dataset_B,alternative="greater")
classmethod fisher_exact(C, alternative='less')

Creates an object representing the result of Fisher’s exact test for a single contingency table C. This is unrelated to Fisher’s method of combining p values. Note that, for most scientific applications, the restrictive conditions of this test are not met and Boschloo’s exact test is more appropriate.

For automatic weighting, the sum over C is taken as the number of degrees of freedom.

Parameters:
C: 2×2 array or nested iterable

The contingency table.

alternative: “less” or “greater”

Examples

contingency_table = [[10,4],[5,4]]
ctr = CTR.fisher_exact(contingency_table,alternative="greater")
classmethod boschloo_exact(C, alternative='less', n=32, atol=1e-10)

Creates an object representing the result of Boschloo’s exact for a single contingency table C using SciPy’s implementation.

For automatic weighting, the sum over C is taken as the number of degrees of freedom.

Parameters:
C: 2×2 array or nested iterable

The contingency table.

alternative: “less” or “greater”
n

The same parameter of SciPy’s boschloo_exact.

atol

p values that are closer than this are treated as identical.

Examples

contingency_table = [[10,4],[5,4]]
ctr = CTR.boschloo_exact(contingency_table,alternative="greater")
combine(ctrs, weights=None, method='mudholkar_george', alternative='less', n_samples=10000000, sampling_method='proportional', rtol=1e-15, atol=1e-15, RNG=None)

Estimates the combined p value of combinable test results. Usually, this result is why you are using this module.

Parameters:
ctrs: iterable of CTRs

The test results that shall be combined.

method: string or function

One of “fisher”, “pearson”, “mudholkar_george”, “stouffer”, “tippett”, “edgington”, “edgington_sym”, “simes”, or a self-defined function.

In the latter case, the function can have the following arguments (which must be named as given):

  • A two-dimensional array p containing the p values.

  • A two-dimensional array q containing their complements.

  • A one-dimensional array w containing the weights.

The function must return the statistics computed along the zero-th axis. For example for the weighted Mudholkar–George method, this function would be lambda p,q,w:  w.dot(np.log(p/q)). The sign of the statistics must be such that low values indicate a high significance.

alternative: “less”, “greater”, or “two-sided”

The direction of the (common) trend that your compound null hypothesis is testing against. Mind that this is not about the sidedness of the individual tests: Those should always be one-sided.

  • If “less”, the compound research hypothesis is that the subtests exhibit a trend towards a low p value.

  • If “greater”, the compound research hypothesis is that the subtests exhibit a trend towards high p values (close to 1). In this case, the method of choice will be applied to the complements of the p values (see Complements).

  • If “two-sided”, the compound research hypothesis is that the subtests exhibit either of the two above trends. Beware that this is not necessarily the same as just doubling the p value of the respective one-sided test, since for some combining methods, a compound dataset may exhibit both trends.

weights: iterable of numbers or “dof”

Weights for individual results. Does not work for minimum-based methods (Tippett and Simes).

If "dof", each test will be weighted with its degrees of freedom, i.e., the number of samples that can independently affect the test’s result. As there is some potential for ambiguity as to what the degrees of freedom of a given test are, please check that they are what you expect by looking at the tests’ documentations or the attribute dof of a CombinableTestResult. This particularly applies when combining different tests.

n_samples

Number of samples used for Monte Carlo simulation. High numbers increase the accuracy, but also the runtime and memory requirements.

rtol: non-negative float
atol: non-negative float

Values of the statistics closer than specified by atol and rtol are regarded as identical (as in numpy.isclose). A small value (such as the default) may improve the results if numerical noise makes values different.

RNG

NumPy random-number generator used for the Monte Carlo simulation. If None, it will be automatically generated.

sampling_method: “proportional” or “stochastic”

How to sample from the p value distributions of the subtests.

If "proportional", the frequency of p values for each individual result will be exactly proportional to its probability – except for rounding. Only the rounding and the order of elements will be random.

If "stochastic", the values will be randomly sampled and thus their sampled frequencies are subject to stochastic fluctuations. This usually leads to slightly less accurate results, but the simulations are statistically independent.

The author of these lines cannot think of any disadvantage to the first approach and has not found any in numerical experiments.

Returns:
pvalue

The estimated combined p value.

std

The estimated standard deviation of p values when repeating the sampling. This is accurate for stochastic sampling and overestimating for proportional sampling.

Examples

ctrs = [
        CTR.sign_test( [24,58,10], [65,51,61], alternative="less" ),
        CTR.mann_whitney_u( [20,44,14,68], [73,22,80], alternative="less" ),
        CTR( ttest_ind( [28,36,11], [93,76,70,83], alternative="less" ).pvalue ),
]
compound_p = combine(ctrs,alternative="less").pvalue
direction(ctrs, weights=None, method='mudholkar_george')

A service function to indicate whether the ctrs are rather trending towards high or low p values.

If you are combining two-sidedly, this tells you the direction of the strongest trend, whether it’s significant or not (that’s what combine is for). Beware that for some methods such as Fisher’s, a compound dataset may exhibit a significant trend in both directions and this function won’t tell you. This cannot happen for symmetric methods (Mudholkar–George, Stouffer, and symmetrised Edgington).

Parameters:
`weights` and `method` as for `combine`.
Returns:
direction: string

One of “less”, “greater” or “equal”. The latter is only returned if the statistics of the combining method are exactly equal in either direction.

Examples

ctrs = [
        CTR.sign_test( [24,58,10], [65,51,61], alternative="less" ),
        CTR.mann_whitney_u( [20,44,14,68], [73,22,80], alternative="less" ),
        CTR( ttest_ind( [28,36,11], [93,76,70,83], alternative="less" ).pvalue ),
]
trend = direction(ctrs)
sign_test(x, y=0, alternative='less')

Just the sign test without any combination features, provided because I have it anyway.

two-sided: Pass paired samples x and y as arguments. The tested null hypothesis is that x[i] and y[i] are from the same distribution (separately for each i).

one-sided Pass a single sample x and a number y. The tested null hypothesis is that x is sampled from a distribution with a median larger than y.

Returns a named tuple consisting of the p value, the number of non-tied samples, and the statistic of the sign test (number of samples which are greater than the reference).

Examples

dataset_A = [20,44,14,68]
dataset_B = [73,22,80,53]
p = sign_test(dataset_A,dataset_B,alternative="less").pvalue

References

[Adcock1960]
    1. Adcock: A note on combining probabilities, Psychometrika 25, pp. 303–305 (1960), 10.1007/BF02289734

[Kincaid1962]
    1. Kincaid: The combination of tests based on discrete distributions, Journal of the American Statistical Association 57 (297), pp. 10–19 (1962), 10.1080/01621459.1962.10482147

[Simes1986]
    1. Simes: An improved Bonferroni procedure for multiple tests of significance, Biometrika 73 (3), pp. 751–754 (1986), 10.1093/biomet/73.3.751

[Mielke2004]
    1. Mielke and J. E. Johnston and K. J. Berry: Combining probability values from independent permutation tests: A discrete analog of Fisher’s classical method, Psychological Reports 95 (2), pp. 449–458 (2004), 10.2466/pr0.95.2.449-458

[Heard2018]
    1. Heard and P. Rubin-Delanchy: Choosing between methods of combining p-values, Biometrika 105 (1), pp. 239–246 (2018) 10.1093/biomet/asx076