Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip: add statistical tests between histograms #503

Draft
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

ikrommyd
Copy link

I'm starting an early PR to add statistical tests to histograms after this feature was requested in #500 and I actually believe it is a good and useful feature.
This is an early WIP PR but I'm starting it now in order to make it publicly visible and ask for feedback on the statistics.

@ikrommyd ikrommyd marked this pull request as draft June 10, 2023 22:43
@ikrommyd
Copy link
Author

I think so far all 4 implemented tests have a decent core functionality. I'm saying this by running toys and looking at the pvalue distributions because they should be uniform(0, 1) under the null hypothesis.
Next steps are to add the option to consider flow bins in the tests and I also think there is a way to take bin errors into consideration when peforming Kolmogorov tests.

@ikrommyd
Copy link
Author

@nsmith- I don't think having an option to consider flow bins in ks_1samp tests makes sense right?
Cause like where exactly in the flow bins are you gonna compare the empirical CDF with an actual CDF function. The flow bins technically extend to infinity.

@henryiii
Copy link
Member

I'm thinking having flow=True doesn't make ever make sense, how about we just remove that argument for all of them?

@ikrommyd
Copy link
Author

I'm thinking having flow=True doesn't make ever make sense, how about we just remove that argument for all of them?

Hmmm, for chi2 tests it does. You are including the observed number of counts in the flow bins in your tests statistic and comparing vs the expected number of counts (in one sample tests) or with the observed number of counts of another histogram (in two sample tests).
For Kolmogorov tests it's iffy. It doesn't make sense to me at all for one sample, and for two sample it's kinda weird. You can see I did some trick there where I sent the flow bin centers to infinity. It is included in ROOT though for two sample tests betwen TH1s but we can remove that.

@ikrommyd
Copy link
Author

By the way, pre-commit is complaining for "Too many public methods (23/20)" in basehist.

@fabriceMUKARAGE fabriceMUKARAGE marked this pull request as ready for review June 22, 2023 21:46
@ikrommyd
Copy link
Author

@fabriceMUKARAGE Why did you change it from draft?

@fabriceMUKARAGE
Copy link
Collaborator

@iasonkrom. oooh, that was a mistake I made when checking the unsuccessful checks. My bad, can we bring it back to the draft?

@ikrommyd ikrommyd marked this pull request as draft June 23, 2023 01:04
@ikrommyd
Copy link
Author

Yup it's back to draft. I was just scared about an accidental merge. Don't worry about failing pytests though. I'm changing things in the statistics and therefore the tests fail if I don't update the values there. But the "real" tests are ran offline where I run toy models to see if my statistics are correct. Proper pytests will be written later since I'm tired of chaging them all the time to pass

@fabriceMUKARAGE
Copy link
Collaborator

I see it's back to draft mode now. That's true, the proper pytests can be done later, thanks for the insights.

fabriceMUKARAGE and others added 3 commits June 27, 2023 21:41
Added pytest.importorskip("mplhep")
Added pytest.importorskip("mplhep")
)

if mode == "exact":
success, d, pvalue = _attempt_exact_2kssamp(n1, n2, g, d, alternative)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this really worries me. Using a hidden object is a really bad idea - see copier-org/copier#1225 for the most recent time I've seen someone broken by this! Is it possible to avoid this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean the function written by scipy, I was trying to avoid writing too many lines since after defining the test statistic from the histograms, performing the test is the same. I could write a similar function and have it here in hist but it would be mainly done by copying code from scipy

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just added the code in the stats.py file instead of importing from scipy. Do you think this is resolved?

@ikrommyd
Copy link
Author

@henryiii BTW, I was just busy with other things, I’ll get back to this PR soon. Sorry if I’m being slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants