Prevent nulls in index for non-numeric dtypes #8963

jorloplaz · 2022-04-21T10:01:54Z

Closes set_index on a column containing NaN raises confusing error (partition_quantiles not robust to NaNs) #8347
Tests added / passed
Passes pre-commit run --all-files
Passes mypy for type annotations

Major changes

Moved all checks made on index to DataFrame.set_index, so that shuffle.set_index and shuffle.set_sorted_index use index directly with no checks.
Besides the former checks, two other checks are performed:
- Non-existence of nulls. If there are nulls, an informative NotImplementedError is raised.
- Matching lengths (only if the series provided isn't a current column). If there's no match, an informative ValueError is raised.
Added a test_index_errors test that checks former errors and mismatching lengths.
Added a test_index_nulls test that ensures no NaNs are allowed, with different null values (NaT, NA, None, etc.).
Type annotations added to:
- DataFrame.set_index and DataFrame.sort_values
- Several methods in shuffle.py.
Analogous checks and tests in io for from_pandas, where the dataframe passed as argument is checked against this.

Minor changes

Instead of checking for a list or a tuple of columns to index with, the more general Sequence is allowed.
Eliminated context managers in tests that used get as a keyword (a DeprecationError was raised otherwise).
Modified an assertion in a test that checks for non-repeated computations, since now there's an additional compute() call for the NaN checks.

GPUtester · 2022-04-21T10:01:56Z

Can one of the admins verify this patch?

jorloplaz · 2022-04-21T13:54:07Z

No idea why it fails for MacOS and Python 3.8. Seems to be a parquet-related test, but nothing in io has been changed. Any clues?

bryanwweber

@jorloplaz Thanks for the contribution! I think it will definitely improve the user experience here and I like that the checks of the index are being combined into a single place for consistency. I do have a few specific concerns below and one high-level concern about _check_index().

Specifically, it seems like _check_index() is doing both the checking and the getting of the index, which seems like a bit of mixing of concerns in the same function. I personally find it confusing that None is a sentinel value to mean "the index that's been passed is already the index", especially because the other possible return value from _check_index() is the index that should be used. Perhaps it's worth having separate helper functions to handle the various tasks here?

dask/dataframe/core.py

bryanwweber · 2022-04-21T15:09:39Z

dask/dataframe/core.py

@@ -4555,14 +4556,80 @@ def sort_values(
            **kwargs,
        )

+    def _check_index(self, other: Any) -> Optional[Any]:


Do the type annotations help here? I think there's an existing effort to annotate the dask/dask codebase, see #8854

As a rule of thumb, I always add annotations in the code I write, so that it's not another contributor that has to figure out the annotation types later. In this particular case maybe it's not worth it, as other can be of several types, but in set_index I think it's definitely worth it.

I've added more type annotations to several methods in shuffle.py as a bonus.

dask/dataframe/core.py

bryanwweber · 2022-04-21T15:38:23Z

dask/dataframe/core.py

+                )
+
+        # Ensure there are no nulls
+        if s.isna().any().compute():


I realize that it's probably not possible to check for na without doing a compute(). However, I just want to flag that this will eagerly compute the index, which may or may not be otherwise desired.

Yeah, I couldn't figure out any other way of doing this. Anyway, since it's an any() it should be light-weight, because as soon as there's a True, there's no need to calculate anything else. But I'm not really sure it stops there...

@jsignell I'm not sure what happens here, and if this compute() will be a problem (I'm thinking of the recent vindex issue). Do you have any insight?

Here it's not as with the len, where we know that if setting the index to an existing column, the len doesn't need to be checked. On the other hand, we must always check this, as there may be nulls in any column, or in the series provided directly by the user.

Anyway, AFAIK set_index always triggers computations (calculating quantiles, etc.), except for the very particular case where sorted=True and divisions are explicitly provided by the user. So I'd say it's reasonable that there's a .compute() here. Do you agree?

dask/dataframe/core.py

dask/dataframe/shuffle.py

dask/dataframe/tests/test_dataframe.py

bryanwweber · 2022-04-21T15:54:22Z

Just wanted to add that the failure on the macOS test suite appears to be a flaky test, I'm not sure why it's failing, but I've seen similar failures related to parquet tests in other PRs.

jorloplaz · 2022-04-25T09:47:33Z

I'd say all points have been covered @bryanwweber (except for possibly flaky tests in other modules that remain unchanged). I've edited the PR description to be more accurate,

bryanwweber · 2022-04-25T13:25:01Z

@jorloplaz can you rebase this on main again not what a few of the typing commits have been merged? I'm having trouble telling what's one of your changes vs. merged from other PRs.

jorloplaz · 2022-04-25T14:10:48Z

@bryanwweber Not sure how I should do that. It still says that I changed 25 files, but I only changed 3 (core.py, shuffle.py, and test_dataframe.py). I tried this:

git checkout main
git pull upstream main --tags
git checkout prevent_nans_in_index
git rebase main

And then solve all conflicts, etc, but it seems I'm doing something wrong. Any help is appreciated.

bryanwweber · 2022-04-25T17:47:01Z

@jorloplaz I did those same steps and resolved all the conflicts in favor of the incoming change. This resulted in 4 files changed (the three you mentioned, plus test_shuffle.py). I'm not sure if it is equivalent to your changes, but at least it resolved the number of files problem 😄 Unfortunately, I can't push to your branch because I don't have merge rights, but perhaps that gives you a direction to go.

jorloplaz · 2022-04-26T10:19:45Z

I did as you said, but still claims the 25 files (perhaps because my fork was done weeks ago?).

Anyway you are right, my changes are only in those 3 files and I also forgot to mention test_shuffle, so it's actually 4 files. All the other changes are other PRs that have been merged into main meanwhile, I suppose, so you can ignore them.

Another possibility is that I try another fork from current main and open a new PR that should reflect just the changes I made. However, I think we'd lose all this discussion (unless we reference this PR in the second PR). What do you think?

bryanwweber · 2022-04-26T14:33:35Z

@jorloplaz Did you pull from upstream main first?

git switch main
git pull upstream main
git switch prevent_nans_in_index
git rebase main

Another option, if you know which files have changed, is to put them in a commit on a new branch and then move that over to this branch:

git switch main
git pull upstream main
git switch prevent_nans_in_index
git switch -c prevent_nans_rebase # create new branch
git reset main
git add <the files you changed>
git commit
git switch prevent_nans_in_index
git reset --hard prevent_nans_rebase

jorloplaz · 2022-04-26T18:04:54Z

@bryanwweber Had to resort to force pushing eventually, but now it looks fine 🤞

I also changed some small things related to pycodestyle (e.g. replacing lambda functions with defs).

bryanwweber · 2022-04-26T19:19:49Z

@jorloplaz 🎉 Yes, I should have mentioned it would require a force-push. Any rewrite of the commit history like that needs a force-push 😄

I'll look at this again tomorrow morning Eastern US time but I think it's really close to being good now.

jorloplaz · 2022-04-27T14:11:36Z

No idea what's going on with pytest-timeout in the distributed package... 😭

crusaderky · 2022-04-27T14:51:59Z

No idea what's going on with pytest-timeout in the distributed package... sob

it broke with dask/distributed#6218

jorloplaz · 2022-04-27T15:06:00Z

@crusaderky Perhaps we should add pytest-timeout in dask's requirements as well?

crusaderky · 2022-04-27T15:08:18Z

@jorloplaz yes, in a separate PR please

fjetter · 2022-04-27T15:43:14Z

I opened dask/distributed#6224 to remove it as a hard requirement

jsignell · 2022-04-27T15:47:11Z

dask/dataframe/core.py

+            s = self[other]
+
+        # Ensure there are no nulls
+        if s.isna().any().compute():


If we think it's important to compute nulls, then I think we should compute them when we do min, max, and len. That will allow us to only read in the data once. If that means that we miss the case where the index is sorted and the divisions are passed, I think that's ok.

Not sure I'm following you. Currently it is when the code enters the "divisions-figuring-out-part" in shuffle.py that things fail, because it's not robust to nulls. So I'm doing this previously to prevent that happens.

Of course, when that part is robust to allow for nulls (and also Pandas extension dtypes that use pd.NA), then we'll be able to remove this test safely.

Right so could we fail within that method when we encounter a nan? We could wrap this

dask/dask/dataframe/shuffle.py

Line 42 in 138831b

divisions, sizes, mins, maxes = compute(divisions, sizes, mins, maxes)

in a try except or we could drill down into

dask/dask/dataframe/partitionquantiles.py

Line 443 in 138831b

def partition_quantiles(df, npartitions, upsample=1.0, random_state=None):

and raise from somewhere within that function.

Wrapped the compute line into a try-except. Interestingly, that only fails when the column is non-numeric (nan is fine), so I was more flexible regarding what to accept in from_pandas. Now errors only happen when: 1) the index has some null, 2) it is non-numeric.

Even so, I still think that for numeric cases nulls shouldn't be really accepted as part of the index. Correct me if I'm wrong, but the main purpose of Dask's index is to know in what partition to look for a particular value (that is, for loc), right?

However, given that comparisons with null values don't always yield True (np.nan == np.nan yields False!!!), and that you can't really tell whether nan is greater than or less than some non-null value, I think allowing for that can only bring trouble.

Update: I slightly improved this so that nulls_presence is computed along with sizes, mins, maxes, etc. When an exception is raised, there are nulls and the series dtype is non-numeric, we inform the user about those nulls being likely the cause of the problem. Hope you find this reasonable.

bryanwweber

This looks good from my side, thanks for considering all my feedback @jorloplaz! I believe @jsignell still had some concerns to resolve.

jorloplaz · 2022-05-03T07:33:28Z

Again there was a mysterious error for MacOS 3.8 (a flaky test?) and gpuCI failed while building, but I think none of those things is my fault.

jsignell

I have a couple last suggestions. Thanks so much for keeping with this @jorloplaz

jsignell · 2022-05-03T20:43:18Z

dask/dataframe/shuffle.py

+    try:
+        divisions, sizes, mins, maxes, nulls_presence = compute(
+            divisions, sizes, mins, maxes, nulls_presence
+        )
+    except Exception as e:
+        # Check if there are nulls and if so, inform the user about this probably being the cause behing the error
+        if nulls_presence.any().compute() and not is_numeric_dtype(partition_col.dtype):


Thanks for making this change @jorloplaz! I am wondering if we can tell from the error message if there were nans rather than having to do another compute before raising.

Not from the error message, because different things can happen. For example, we could have an error like:

File "/home/jorge.lopez/anaconda3/envs/AML-dev/lib/python3.9/site-packages/numpy/core/_methods.py", line 39, in _amax
return umr_maximum(a, axis, None, out, keepdims, initial, where)
TypeError: '>=' not supported between instances of 'float' and 'str'

But also I've found things like:

dask/array/percentile.py:40: in _percentile
result[0] = min(result[0], values.min())
TypeError: Cannot convert NaTType to pandas._libs.tslibs.timestamps._Timestamp

So what I did is to change the general Exception for a TypeError, without checking the message itself and also without computing the nulls explicitly.

jsignell · 2022-05-03T20:44:27Z

dask/dataframe/tests/test_shuffle.py

-        raise Exception()
-
-    with dask.config.set(get=throw):
-        ddf2 = ddf.set_index("x", divisions=[1, 3, 5])


I think you should be able to go back to the original on these tests now that the compute is only called from within the _calculate_division function.

In test_shuffle.py I could remove it, but gpuCI keeps on failing unless I remove this. Doesn't like the get keyword and requires a specific scheduler.

This makes me a little worried because this test is meant to ensure that no compute is happening in this case. What happens if you just change get= to scheduler=

Replaced with scheduler=. Let's see what happens.

…seem enough

dask/dataframe/shuffle.py

jsignell · 2022-05-09T20:20:44Z

dask/dataframe/core.py

+                return self
+            # Otherwise, check length matches when other isn't one of the data columns
+            is_column = any(other._name == self[c]._name for c in self)
+            if not is_column and len(other) != len(self):


This is another case where we are inadvertently computing, by using len. I don't think this is strictly related to the goal of the work, so can you remove this catch?

jsignell

Thanks for sticking with this @jorloplaz!

jsignell · 2022-05-11T16:15:30Z

I'm just double checking the failure in CI to make sure it is unrelated

jsignell · 2022-05-11T17:38:28Z

Known flakey test (#8795)

jsignell · 2022-05-11T17:40:45Z

This is in!

More informative error messages in `set_index` when : 1. There are nulls (not necessarily `nan`, but also `None`, `pd.NaT`, etc.) 2. The series to become the index has a non-numeric `dtype`.

github-actions bot added the dataframe label Apr 21, 2022

bryanwweber reviewed Apr 21, 2022

View reviewed changes

bryanwweber self-assigned this Apr 21, 2022

github-actions bot added array documentation Improve or add to documentation io labels Apr 22, 2022

jorloplaz mentioned this pull request Apr 26, 2022

set_index on a column containing NaN raises confusing error (partition_quantiles not robust to NaNs) #8347

Closed

jorloplaz force-pushed the prevent_nans_in_index branch from 9772652 to 209aac7 Compare April 26, 2022 17:53

github-actions bot removed io array documentation Improve or add to documentation labels Apr 26, 2022

github-actions bot added the io label Apr 27, 2022

crusaderky mentioned this pull request Apr 27, 2022

Increase test timeout if debugger is running dask/distributed#6218

Merged

jsignell reviewed Apr 27, 2022

View reviewed changes

jorge.lopez added 3 commits April 29, 2022 13:17

shuffle annotations and tests adapted

412509c

corrected docstring

9896c45

bryanwweber suggestions

bd69994

jorloplaz force-pushed the prevent_nans_in_index branch from 29946ba to bd69994 Compare April 29, 2022 15:46

bryanwweber approved these changes Apr 29, 2022

View reviewed changes

jorge.lopez added 4 commits April 29, 2022 19:11

annotations and minor changes in from_pandas

d0d6d78

inform about nulls only if some error arised because of that

2cc01da

nulls only giving errors for non-numeric columns

3abc0cd

nans computed along with divisions

44d484f

pavithraes requested a review from jsignell May 3, 2022 12:43

jsignell reviewed May 3, 2022

View reviewed changes

jorge.lopez added 4 commits May 6, 2022 09:07

reverted shuffle tests

9609f04

removed redundant checking, better exception handling

9da5a0d

removing again dask.config.set with get keyword for gpuci

f954c75

not computing nulls explicitly, as typerror and non-numeric checking …

7f175ca

…seem enough

jorloplaz changed the title ~~Prevent nans in index~~ Prevent nans in index for non-numeric dtypes May 6, 2022

changed test to guarantee non-numeric column

0c5fe3e

jorloplaz changed the title ~~Prevent nans in index for non-numeric dtypes~~ Prevent nulls in index for non-numeric dtypes May 6, 2022

jsignell reviewed May 9, 2022

View reviewed changes

dask/dataframe/shuffle.py Show resolved Hide resolved

jsignell reviewed May 9, 2022

View reviewed changes

more suggestions

96d386a

jsignell approved these changes May 11, 2022

View reviewed changes

jsignell merged commit d458533 into dask:main May 11, 2022

jorloplaz deleted the prevent_nans_in_index branch May 11, 2022 18:58

jorloplaz mentioned this pull request Oct 11, 2022

Forbid columns with duplicate names #9422

Open

3 tasks

Prevent nulls in index for non-numeric dtypes #8963

Prevent nulls in index for non-numeric dtypes #8963

Conversation

jorloplaz commented Apr 21, 2022 • edited

Major changes

Minor changes

GPUtester commented Apr 21, 2022

jorloplaz commented Apr 21, 2022

bryanwweber left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorloplaz Apr 25, 2022 • edited

Choose a reason for hiding this comment

bryanwweber commented Apr 21, 2022

jorloplaz commented Apr 25, 2022 • edited

bryanwweber commented Apr 25, 2022

jorloplaz commented Apr 25, 2022 • edited

bryanwweber commented Apr 25, 2022

jorloplaz commented Apr 26, 2022 • edited

bryanwweber commented Apr 26, 2022

jorloplaz commented Apr 26, 2022 • edited

bryanwweber commented Apr 26, 2022 • edited

jorloplaz commented Apr 27, 2022

crusaderky commented Apr 27, 2022

jorloplaz commented Apr 27, 2022

crusaderky commented Apr 27, 2022

fjetter commented Apr 27, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bryanwweber left a comment

Choose a reason for hiding this comment

jorloplaz commented May 3, 2022

jsignell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorloplaz May 6, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsignell left a comment

Choose a reason for hiding this comment

jsignell commented May 11, 2022

jsignell commented May 11, 2022

jsignell commented May 11, 2022

jorloplaz commented Apr 21, 2022 •

edited

jorloplaz Apr 25, 2022 •

edited

jorloplaz commented Apr 25, 2022 •

edited

jorloplaz commented Apr 25, 2022 •

edited

jorloplaz commented Apr 26, 2022 •

edited

jorloplaz commented Apr 26, 2022 •

edited

bryanwweber commented Apr 26, 2022 •

edited

jorloplaz May 6, 2022 •

edited