Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing health check #115

Closed
mattbarrett98 opened this issue Apr 21, 2022 · 18 comments
Closed

Failing health check #115

mattbarrett98 opened this issue Apr 21, 2022 · 18 comments

Comments

@mattbarrett98
Copy link

E hypothesis.errors.FailedHealthCheck: Data generation is extremely slow: Only produced 6 valid examples in 1.04 seconds (14 invalid ones and 2 exceeded maximum size). Try decreasing size of the data you're generating (with e.g. max_size or max_leaves parameters). E See https://hypothesis.readthedocs.io/en/latest/healthchecks.html for more information about this. If you want to disable just this health check, add HealthCheck.too_slow to the suppress_health_check settings for this test.

When testing against JAX this error tends to pop up every now and then, is there any easy way to disable this health check?

@honno
Copy link
Member

honno commented Apr 21, 2022

@mattbarrett98 Would you know which test case was this for? This is most likely a problem on our end!

@mattbarrett98
Copy link
Author

It seems to pop up for a variety of functions. From memory I've seen it for test_reshape, test_inv, test_remainder and I think I'm probably forgetting a few. Would it be possible to add a flag to allow disabling of the health check? Thanks!

@honno
Copy link
Member

honno commented Apr 21, 2022

Okay I just tried testing with jax and the test suite won't work with it anyway (missing namespaced dtypes e.g. jax.int16), so I'm guessing you're using a wrapper... or maybe I missed a development version of jax that implements the Array API?

Anywho, I can't figure out why say a test like test_reshape will have a health check problem (in Hypothesis terms, that's making sure the examples it generates are actually what we want to test with), so I'll close this for now—please feel free to re-open if you come up with reproducible steps!

Would it be possible to add a flag to allow disabling of the health check?

Not keen on this unless it becomes a common use case, for now see https://hypothesis.readthedocs.io/en/latest/settings.html (e.g. you could decorate the failing tests with @settings(suppress_health_check=True)). These health checks more often then not are important to deal with properly.

@honno honno closed this as completed Apr 21, 2022
@djl11
Copy link
Contributor

djl11 commented Apr 27, 2022

We're diving deeper into what the exact issue is right now. We'll try to provide a minimal example ASAP. For a bit more context, this is an example commit where the health check has failed, tested against this Array API test suite commit, with this unit test. A stack trace for the failure is as follows:

=================================== FAILURES ===================================
______________________________ test_matrix_power _______________________________

    @pytest.mark.xp_extension('linalg')
>   @given(
        # Generate any square matrix if n >= 0 but only invertible matrices if n < 0
        x=matrix_power_n.flatmap(lambda n: invertible_matrices() if n < 0 else
                                 xps.arrays(dtype=xps.floating_dtypes(),
                                            shape=square_matrix_shapes)),
        n=matrix_power_n,
    )
E   hypothesis.errors.FailedHealthCheck: Data generation is extremely slow: Only produced 7 valid examples in 2.15 seconds (0 invalid ones and 3 exceeded maximum size). Try decreasing size of the data you're generating (with e.g. max_size or max_leaves parameters).
E   See https://hypothesis.readthedocs.io/en/latest/healthchecks.html for more information about this. If you want to disable just this health check, add HealthCheck.too_slow to the suppress_health_check settings for this test.

ivy/ivy_tests/test_array_api/array_api_tests/test_linalg.py:339: FailedHealthCheck
---------------------------------- Hypothesis ----------------------------------
You can add @seed(51313401586751[272](https://github.com/unifyai/ivy/runs/6125830248?check_suite_focus=true#step:4:272)551885782834508242891) to this test or run pytest with --hypothesis-seed=51313401586751272551885782834508242891 to reproduce this failure.

Before continuing the discussion, we'll determine whether Ivy-specific inefficiencies are causing the failures, or perhaps JAX-specific slowdowns on the first forward pass during the JIT compilation or something similar. We'll sync back when we know a bit more.

@asmeurer
Copy link
Member

The health checks are there for a reason, so it would a bad idea to just ignore them. The most likely cause is that our test generation strategies are written poorly in some way. Hypothesis health checks are also very often a symptom of a much more serious problem, like a strategy that doesn't actually generate what we thought it did. So it's always a good idea to dig into this when it happens.

Of course it could just be the case that the health check is just the array library being slower than hypothesis expects. If we determine that's really all that's going on here, it might make sense to modify the hypothesis array_api submodule to be smarter about this.

It looks like this is the same test that from #117, which has some other problems (by all accounts it isn't generating what it should be), so there really is some more serious issue going on here.

@mattbarrett98
Copy link
Author

Perhaps it was confusing to show you the health check for matrix_power which has another issue. Here's another example for reshape

=================================== FAILURES ===================================

_________________________________ test_reshape _________________________________
    @given(
>       x=xps.arrays(dtype=xps.scalar_dtypes(), shape=hh.shapes(max_side=MAX_SIDE)),
        data=st.data(),
    )
E   hypothesis.errors.FailedHealthCheck: Data generation is extremely slow: Only produced 9 valid examples in 1.22 seconds (11 invalid ones and 4 exceeded maximum size). Try decreasing size of the data you're generating (with e.g. max_size or max_leaves parameters).
E   See https://hypothesis.readthedocs.io/en/latest/healthchecks.html for more information about this. If you want to disable just this health check, add HealthCheck.too_slow to the suppress_health_check settings for this test.

ivy/ivy_tests/test_array_api/array_api_tests/test_manipulation_functions.py:252: FailedHealthCheck
---------------------------------- Hypothesis ----------------------------------
You can add @seed(339177177210368709403771675328146131636) to this test or run pytest with --hypothesis-seed=339177177210368709403771675328146131636 to reproduce this failure.

and we experience the same thing for other functions, always with JAX, we never experience issues with NumPy, torch or tensorflow.

@honno
Copy link
Member

honno commented Apr 28, 2022

My guess is the unhealthy generation is coming from how xps.arrays() is using array creation/manipulation functions, which could mean either JAX or ivy's wrapper might have some faulty array creation/manipulation functions, or Hypothesis itself has does something wrong. If folks are feeling adventurous you could try running the extensive test suite for hypothesis.extra.array_api for ivy's JAX + Array API namespace.

@mattbarrett98
Copy link
Author

mattbarrett98 commented Apr 28, 2022

I have done some timings of just the execution of inv in test_inv (a test which frequently causes us health check failures for JAX) like so:

t0 = perf_counter()
res = linalg.inv(x)
t1 = perf_counter()
print(t1 - t0, x)

and when using JAX the timings and inputs look like this:

0.28734600000098 ivy.array([], shape=(0, 0), dtype=float32)
0.0008987999972305261 ivy.array([], shape=(0, 0), dtype=float32)
0.001311700001679128 ivy.array([], shape=(0, 0), dtype=float32)
0.000825700000859797 ivy.array([], shape=(0, 0), dtype=float32)
0.4795199000000139 ivy.array([], shape=(1, 0, 34, 34), dtype=float32)

Every time a new shape or a new dtype is used, presumably JAX has to jit compile again resulting in very slow runs. The last time (0.4795) is about 580 times slower than the previous already compiled run (0.0008). You can see that when a shape and dtype is reused, the execution is much quicker. Naturally hypothesis will cover many different shapes and types, meaning that JAX is constantly recompiling functions.

For comparison, these are the first few timings when using NumPy:

0.000525999999808846 ivy.array([], shape=(0, 0), dtype=float32)
0.0002558999985922128 ivy.array([], shape=(0, 0), dtype=float32)
0.0003217999983462505 ivy.array([], shape=(0, 0), dtype=float32)
0.00021559999731834978 ivy.array([], shape=(0, 0), dtype=float32)
0.00022950000129640102 ivy.array([], shape=(0, 45, 45), dtype=float32)

The average time here being about 0.0003, 500x faster than JAX's average of 0.15, which seems sufficient to cause hypothesis' speed concerns.

The health checks being caused by JAX's frequent jit recompilation is also supported by the fact that we don't experience issues with NumPy, tensorflow or torch.

@honno
Copy link
Member

honno commented Apr 28, 2022

Ah thanks for the hard numbers, that's really good to know. There's still a problem of faulty array creation/manipulation (i.e. why are there invalid examples for test_reshape and test_matrix_power), but yes I see that might just be exacerbating an underlying issue using JAX.

Ideally we'd wait and see how an official JAX Array API namespace works, but no idea when that'll happen. As @asmeurer said we really really don't want to ignore health checks—maybe a slightly-less-worse solution is introducing a flag which can change deadline. In any case, I want to explore how ivy runs on Hypothesis' internal test suite first.

@mattbarrett98
Copy link
Author

Makes sense 🙂 Although for some like test_inv above it doesn't seem to get too many invalid examples e.g.

ivy_tests/test_array_api/array_api_tests/test_linalg.py:272 (test_inv)
@pytest.mark.xp_extension('linalg')
>   @given(x=invertible_matrices())
E   hypothesis.errors.FailedHealthCheck: Data generation is extremely slow: Only produced 5 valid examples in 1.21 seconds (0 invalid ones and 1 exceeded maximum size). Try decreasing size of the data you're generating (with e.g. max_size or max_leaves parameters).
E   See https://hypothesis.readthedocs.io/en/latest/healthchecks.html for more information about this. If you want to disable just this health check, add HealthCheck.too_slow to the suppress_health_check settings for this test.

so this might just be happening for the reason explained before. When I add

hypothesis.settings.register_profile("disable_speed_check",
                    suppress_health_check=(hypothesis.HealthCheck(3),))
hypothesis.settings.load_profile("disable_speed_check")

to the code it resolves the issue. This has the effect of disabling only the too_slow health check.

Adding a flag for deadline may be useful as well but this doesn't resolve the errors we get related to the too_slow health check. Would it not also be possible to add a flag which allows the user to disable specific health checks. So in our case this would enable us to disable just the too_slow health check and leave all others enabled (and just leave everything enabled when JAX isn't the backend).

As long as the default behaviour for the test suite is that all health checks are enabled then would that be okay?

@honno
Copy link
Member

honno commented Apr 28, 2022

for some like test_inv above it doesn't seem to get too many invalid examples

Note even 1 invalid example should very rarely happen for a first-party Hypothesis strategy, and I don't see how it could for xps.arrays() without faulty creation/manipulation functions. So I definitely would want to identify why that's happening before anything else (could be related to the relatively slow JAX array-creation).

Would it not also be possible to add a flag which allows the user to disable specific health checks.

As a last resort 😅

@mattbarrett98
Copy link
Author

okay thanks 🙂

I've just run test_reshape against the NumPy array api (not with Ivy) to see how it compares with invalid examples. I get

============================ Hypothesis Statistics =============================
ivy_tests/test_array_api/array_api_tests/test_manipulation_functions.py::test_reshape:

  - during reuse phase (0.20 seconds):
    - Typical runtimes: 20-37 ms, ~ 87% in data generation
    - 5 passing examples, 0 failing examples, 0 invalid examples

  - during generate phase (11.04 seconds):
    - Typical runtimes: 15-48 ms, ~ 95% in data generation
    - 95 passing examples, 0 failing examples, 196 invalid examples
    - Events:
      * 68.73%, Retried draw from ListStrategy(integers(min_value=0), min_size=0, max_size=inf).filter(lambda s: math.prod(s) == size) to satisfy filter
      * 53.61%, Aborted test because unable to satisfy ListStrategy(integers(min_value=0), min_size=0, max_size=inf).filter(lambda s: math.prod(s) == size)
      * 5.15%, Retried draw from lists(integers(min_value=0, max_value=156), max_size=2).map(tuple).filter(lambda shape: prod(i for i in shape if i) < MAX_ARRAY_SIZE) to satisfy filter

  - Stopped because settings.max_examples=100

The test statistics show 196 invalid examples, should this not be occurring?

@honno
Copy link
Member

honno commented Apr 28, 2022

test_reshape

I'm guessing all that filtering is from our custom strategies, like here

assume(all(side <= MAX_SIDE for side in rshape))

and everything we use from hypothesis_helpers.py. Invalidating these examples is infrequent-enough and doesn't take too much time to generate, so they won't fail the health checks. It looks like a lot, but really it's from internal calls of other strategies that also invalidate examples.

@mattbarrett98
Copy link
Author

So we should expect some invalid examples to occur? Before you were questioning why this was happening

@honno
Copy link
Member

honno commented Apr 28, 2022

We expect invalid examples for our custom strategies, certainly. But I'm assuming* the ones you saw where from xps.arrays(), where we don't.

*seems quite likely for multiple reasons, but yes this all needs a proper review

@honno
Copy link
Member

honno commented May 9, 2022

@mattbarrett98 Yeah so running your master branch of ivy on in the internal Hypothesis test suite produces some errors, most of which seem to indeed be because of the timings you demonstrated, and these issues ultimately
propagates in filtered examples. All those tests definitely need to be passing for Ivy's JAX backend to be reliably tested in array-api-tests too.

FYI to run ivy's JAX backend on the internal Hypothesis test suite, I replaced these lines with:

import ivy
ivy.set_framework("jax")
params = [pytest.param(ivy, make_strategies_namespace(ivy), id="ivyjax")]

Now this could very well be a hard limitation of JAX's use of JIT, where we might struggle without indeed introducing a flag like --deadline=... (0 could mean disable). I would want a JAX maintainer to start implementing a namespace before supporting this use case, seeing as Ivy is third-party and could indeed be handling things like device management inefficiently, which would exacerbate slow first-time array creation into something not practical to test with in the first place.

@mattbarrett98
Copy link
Author

This all makes sense, thanks for looking into it 🙂 I've actually just found a way to disable the too_slow health check on our end to allow us to just focus on checking the functionality. We no longer seem to have any issues with health checks.

Thanks again!

@honno
Copy link
Member

honno commented Jun 27, 2022

HypothesisWorks/hypothesis#3369 might be relevant, will explore but low prio.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants