Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: random: Add the multivariate hypergeometric distribution. #13794

Merged
merged 2 commits into from Oct 20, 2019

Conversation

WarrenWeckesser
Copy link
Member

@WarrenWeckesser WarrenWeckesser commented Jun 17, 2019

The new method

multivariate_hypergeometric(self, object colors, object nsample,
                            size=None, method='marginals')

of the class numpy.random.Generator implements the multivariate
hypergeometric distribution; see
https://en.wikipedia.org/wiki/Hypergeometric_distribution,
specifically the section "Multivariate hypergeometric distribution".

Two algorithms are implemented. The user selects which algorithm
to use with the method parameter. The default, method='marginals',
is based on repeated calls of the univariate hypergeometric
distribution function. The other algorithm, selected with
method='count', is a brute-force method that allocates an
internal array of length sum(colors). It should only be used
when that value is small, but it can be much faster than the
"marginals" algorithm in that case.

The C implementations of the two methods are in the files
random_mvhg_count.c and random_mvhg_marginals.c in
numpy/random/src/distributions.

The new file numpy/random/src/distributions/util.c contains the
function safe_sum_nonneg_int64() for summing nonnegative int64
values while checking for overflow.

@WarrenWeckesser
Copy link
Member Author

This is the updated version of #8056.

@WarrenWeckesser
Copy link
Member Author

WarrenWeckesser commented Jun 17, 2019

The refguide check doesn't like the blank line in the output of this example in the docstring:

    >>> gen.multivariate_hypergeometric(colors, 6, size=(2, 2))
    array([[[5, 1, 0],
            [4, 2, 0]],

           [[2, 2, 2],
            [4, 2, 0]]])

The error reported in the refguide check on the azure macOS build is

File "build/testenv/lib/python3.6/site-packages/numpy/random/generator.cpython-36m-darwin.so", line ?, in Generator.multivariate_hypergeometric
Failed example:
    gen.multivariate_hypergeometric(colors, 6, size=(2, 2))
Expected:
    array([[[5, 1, 0],
            [4, 2, 0]],
Got:
    array([[[5, 1, 0],
            [4, 2, 0]],
    <BLANKLINE>
           [[2, 2, 2],
            [4, 2, 0]]])

It looks like the checker ignores the part of the array that occurs after the blank line.

Is this is a known issue? Is there a work-around?


EDIT: Worked around: I removed the blank line from the docstring.

.lgtm.yml Outdated Show resolved Hide resolved
@WarrenWeckesser
Copy link
Member Author

LGTM C/C++ passed. 🎉 🎆 😀 I don't know if that is just luck or if the LGTM folks have fixed something. Anyway, it's nice to see the green check mark.

@WarrenWeckesser WarrenWeckesser force-pushed the new-mvhg branch 2 times, most recently from e350e65 to 570ea89 Compare June 17, 2019 22:59
@WarrenWeckesser WarrenWeckesser changed the title WIP/ENH: random: Add the multivariate hypergeometric distribution. ENH: random: Add the multivariate hypergeometric distribution. Jun 17, 2019
@WarrenWeckesser WarrenWeckesser changed the title ENH: random: Add the multivariate hypergeometric distribution. WIP/ENH: random: Add the multivariate hypergeometric distribution. Jun 17, 2019
@WarrenWeckesser WarrenWeckesser force-pushed the new-mvhg branch 2 times, most recently from 07ec3c3 to f0753d7 Compare June 19, 2019 02:08
@WarrenWeckesser
Copy link
Member Author

I've run some validation scripts for the multivariate hypergeometric distribution. For a given set of parameters and sample size, the basic calculation is to compute the expected counts for each point in the support of the distribution, and then do a G-test to compare that to the observed counts in a sample of that size. Under the usual statistical assumptions of the G-test, we expect the p-value to be uniformly distributed on [0, 1] if multivariate_hypergeometric is correctly generating variates from the distribution.

I put the scripts in my mvhg_validation repository. The two main scripts are "check1.py" and "check2.py". "check1.py" is for parameters where the both sum(colors) and nsample are small. The script selects the sample size so that the smallest expected value of any point in the support is at least 100. (If this script is run with large parameters, that calculation could easily require a sample size of a gazillion or more.)

"check2.py" can handle bigger input values (but the inputs still can't be too big if you want the calculation to complete in a reasonable amount of time). The script will aggregate points in the support where the expected value is less than 100 by grouping points such that the sum of the expected value in each group is at least 100. I've been running "check2.py" so that it generates one to three thousand p-values. The script plots a histogram of the p-values. A visual inspection can confirm that the distribution of the p-values looks uniform.

Here's a typical output printed by "check1.py":

=== method: marginals ===

colors: [10, 90]   nsample: 40 size: 2042128   min expected count: 100.00002039706943
p-value: 0.76614134883673501696
p-value: 0.80032715886854746921
p-value: 0.28646592404637900574

colors: [5, 10, 10]   nsample: 12 size: 52003000   min expected count: 100.0
p-value: 0.47885957645276647213
p-value: 0.15488694832529717484
p-value: 0.90967770179322239413

colors: [40, 6, 5]   nsample: 19 size: 63012380   min expected count: 100.00000017133492
p-value: 0.35531305116108942806
p-value: 0.74187174727261006604
p-value: 0.21760603383501619266

colors: [40, 6, 5]   nsample: 30 size: 13502653   min expected count: 100.00000122932808
p-value: 0.34773389687467755291
p-value: 0.8784454513961536854
p-value: 0.66566849746858336538

colors: [3, 4, 5, 6]   nsample: 8 size: 4375800   min expected count: 100.0
p-value: 0.62493583602354243111
p-value: 0.46977151593588086797
p-value: 0.12208932027017217148

colors: [2, 3, 3, 3, 3, 4, 5]   nsample: 13 size: 114406600   min expected count: 100.0
p-value: 0.40644759025965258828
p-value: 0.13206413901661218496
p-value: 0.88757976584140347645

=== method: count ===

colors: [10, 90]   nsample: 40 size: 2042128   min expected count: 100.00002039706943
p-value: 0.7924735267779910425
p-value: 0.10806047197238961777
p-value: 0.6017646563892657599

colors: [5, 10, 10]   nsample: 12 size: 52003000   min expected count: 100.0
p-value: 0.68786055960586100563
p-value: 0.6552933632895854508
p-value: 0.33080547907126165367

colors: [40, 6, 5]   nsample: 19 size: 63012380   min expected count: 100.00000017133492
p-value: 0.83923738455884193607
p-value: 0.71541951382572945313
p-value: 0.15817358197836481528

colors: [40, 6, 5]   nsample: 30 size: 13502653   min expected count: 100.00000122932808
p-value: 0.040057685646735838088
p-value: 0.75932942773257168664
p-value: 0.62433805773198801097

colors: [3, 4, 5, 6]   nsample: 8 size: 4375800   min expected count: 100.0
p-value: 0.41021088926200340939
p-value: 0.50847417267434287225
p-value: 0.83636688795034767405

colors: [2, 3, 3, 3, 3, 4, 5]   nsample: 13 size: 114406600   min expected count: 100.0
p-value: 0.39852057363014235066
p-value: 0.51818919493381321297
p-value: 0.11866139396718165838

And here are several plots generated by "check2.py":

histogram2

histogram4

histogram3

histogram12341234-marginals

@WarrenWeckesser
Copy link
Member Author

WarrenWeckesser commented Jun 21, 2019

One more p-value histogram, using method='count', with len(colors) = 11, sum(colors) = 24 and nsample=12. Each trial used 270415600 variates, and 2000 p-values were computed to generate the histogram. The large value for size meant that no aggregation was required for the G-test; the expected count for each value in the support was at least 100. (Note: it is just a coincidence that 100 also happens to be the expected size of the bars in the histrogram. I computed 2000 p-values and then created the histogram with 20 bins. This value is the dashed line in the plot.)

histogram

@WarrenWeckesser WarrenWeckesser changed the title WIP/ENH: random: Add the multivariate hypergeometric distribution. ENH: random: Add the multivariate hypergeometric distribution. Jun 21, 2019
@WarrenWeckesser
Copy link
Member Author

This is ready for review. (Reminder: this is the updated version of #8056.)

@WarrenWeckesser
Copy link
Member Author

WarrenWeckesser commented Jun 24, 2019

A friendly request: could someone remove the "WIP" tag from this pull request? It is ready for review.

I know that the folks who would likely review this are focused on finishing the randomgen update, so I don't expect a review to happen soon, but I want to be sure that someone who does find time doesn't pass over this one because of the "WIP" tag.

@seberg seberg removed the 25 - WIP label Jun 24, 2019
@WarrenWeckesser
Copy link
Member Author

Thanks @seberg!

@mattip
Copy link
Member

mattip commented Jul 5, 2019

@WarrenWeckesser there is a merge conflict.

@bashtage, @rkern any opinions?

@WarrenWeckesser
Copy link
Member Author

The pull request includes two methods for generating a sample:

  • The "count" method generates an array containing the full population defined by colors (i.e. if colors is [2, 5, 3], the method generates a temporary array containing [0, 0, 1, 1, 1, 1, 1, 2, 2, 2]). It then does a partial Fisher-Yates shuffle to create a random selection of length nsample from that array, and then counts the occurrences of the items in the sample.

  • The "marginals" method uses the hypergeometric distribution iteratively to generate the sample.

It is not in the pull request, but I also experimented with a third implementation that used Floyd's algorithm:

  • select nsample integers from range(0, sum(colors)) without replacement;
  • map each of those integers to its color (binary search on cumsum(colors)), and increment the color's count.

It avoided creating the (potentially big) array that is created by the "count" method. I used the C library khash for the hash table. This third method is not in the pull request because, in all my tests, either the "count" method or the "marginals" method was faster.

After seeing the pull request "MAINT: Rewrite Floyd algorithm" (#13812), I'm inspired to experiment a bit more with the hash table method. I only timed the full process of generating samples, so I don't know if swapping out khash for the simple hash table with linear probing would make a signifcant difference.

rkern
rkern previously requested changes Jul 6, 2019
Copy link
Member

@rkern rkern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code generally LGTM. master needs to be merged in and the tests adapted to the current style.

Whether this goes into 1.17, I leave up to @charris.

@mattip
Copy link
Member

mattip commented Jul 14, 2019

@WarrenWeckesser ping

@mattip
Copy link
Member

mattip commented Aug 8, 2019

It is a shame this nice addition didn't make it into 1.17. @WarrenWeckesser will you be picking it up again?

@WarrenWeckesser
Copy link
Member Author

Yes, I will get back to this in a few weeks, if not sooner.

@WarrenWeckesser
Copy link
Member Author

Rebased and updated.

@WarrenWeckesser
Copy link
Member Author

@rkern's last comment was

The code generally LGTM. master needs to be merged in and the tests adapted to the current style.

Whether this goes into 1.17, I leave up to @charris.

I have since rebased and updated the tests.

For anyone taking a fresh look at this PR, the essential additions are:

  • The C functions random_mvhg_count and random_mvhg_marginals implement two algorithms for generating random variates from the multivariate hypergeometric distribution. They take the same arguments, the first of which is a bitgen_t pointer. The implementations are in the files (respectively):
    • numpy/random/src/distributions/random_mvhg_count.c
    • numpy/random/src/distributions/random_mvhg_marginals.c
  • The method multivariate_hypergeometric is added to the Generator class. The method has a method argument that lets the user choose which algorithm to use. The implementation is in
    • numpy/random/generator.pyx

All the other changes are to support the above, add tests, or update the documentation.

@mattip
Copy link
Member

mattip commented Oct 12, 2019

xref #14608, since both that and this should be merged before 1.18, and both touch generator.pyx

@mattip mattip added this to the 1.18.0 release milestone Oct 12, 2019
The new method

  multivariate_hypergeometric(self, object colors, object nsample,
                              size=None, method='marginals')

of the class numpy.random.Generator implements the multivariate
hypergeometric distribution; see
  https://en.wikipedia.org/wiki/Hypergeometric_distribution,
specifically the section "Multivariate hypergeometric distribution".

Two algorithms are implemented.  The user selects which algorithm
to use with the `method` parameter. The default, `method='marginals'`,
is based on repeated calls of the univariate hypergeometric
distribution function.  The other algorithm, selected with
`method='count'`, is a brute-force method that allocates an
internal array of length ``sum(colors)``.  It should only be used
when that value is small, but it can be much faster than the
"marginals" algorithm in that case.

The C implementations of the two methods are in the files
random_mvhg_count.c and random_mvhg_marginals.c in
numpy/random/src/distributions.
@WarrenWeckesser
Copy link
Member Author

I updated the PR to match the changes in #14608.

I don't know why the Azure tests didn't run.

@mattip mattip dismissed rkern’s stale review October 18, 2019 09:05

rebased and changed as requested

@mattip mattip closed this Oct 18, 2019
@mattip mattip reopened this Oct 18, 2019
@mattip
Copy link
Member

mattip commented Oct 18, 2019

no idea why azure is not running :(

@mattip mattip closed this Oct 18, 2019
@mattip mattip reopened this Oct 18, 2019
@mattip mattip merged commit 2aa3bba into numpy:master Oct 20, 2019
@mattip
Copy link
Member

mattip commented Oct 20, 2019

Thanks @WarrenWeckesser

@WarrenWeckesser
Copy link
Member Author

Thanks @mattip & @rkern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants