Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent nulls in index for non-numeric dtypes #8963

Merged
merged 18 commits into from
May 11, 2022

Conversation

jorloplaz
Copy link
Contributor

@jorloplaz jorloplaz commented Apr 21, 2022

Major changes

  • Moved all checks made on index to DataFrame.set_index, so that shuffle.set_index and shuffle.set_sorted_index use index directly with no checks.
  • Besides the former checks, two other checks are performed:
    • Non-existence of nulls. If there are nulls, an informative NotImplementedError is raised.
    • Matching lengths (only if the series provided isn't a current column). If there's no match, an informative ValueError is raised.
  • Added a test_index_errors test that checks former errors and mismatching lengths.
  • Added a test_index_nulls test that ensures no NaNs are allowed, with different null values (NaT, NA, None, etc.).
  • Type annotations added to:
    • DataFrame.set_index and DataFrame.sort_values
    • Several methods in shuffle.py.
  • Analogous checks and tests in io for from_pandas, where the dataframe passed as argument is checked against this.

Minor changes

  • Instead of checking for a list or a tuple of columns to index with, the more general Sequence is allowed.
  • Eliminated context managers in tests that used get as a keyword (a DeprecationError was raised otherwise).
  • Modified an assertion in a test that checks for non-repeated computations, since now there's an additional compute() call for the NaN checks.

@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

@jorloplaz
Copy link
Contributor Author

No idea why it fails for MacOS and Python 3.8. Seems to be a parquet-related test, but nothing in io has been changed. Any clues?

Copy link
Contributor

@bryanwweber bryanwweber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorloplaz Thanks for the contribution! I think it will definitely improve the user experience here and I like that the checks of the index are being combined into a single place for consistency. I do have a few specific concerns below and one high-level concern about _check_index().

Specifically, it seems like _check_index() is doing both the checking and the getting of the index, which seems like a bit of mixing of concerns in the same function. I personally find it confusing that None is a sentinel value to mean "the index that's been passed is already the index", especially because the other possible return value from _check_index() is the index that should be used. Perhaps it's worth having separate helper functions to handle the various tasks here?

dask/dataframe/core.py Outdated Show resolved Hide resolved
@@ -4555,14 +4556,80 @@ def sort_values(
**kwargs,
)

def _check_index(self, other: Any) -> Optional[Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the type annotations help here? I think there's an existing effort to annotate the dask/dask codebase, see #8854

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a rule of thumb, I always add annotations in the code I write, so that it's not another contributor that has to figure out the annotation types later. In this particular case maybe it's not worth it, as other can be of several types, but in set_index I think it's definitely worth it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added more type annotations to several methods in shuffle.py as a bonus.

dask/dataframe/core.py Outdated Show resolved Hide resolved
dask/dataframe/core.py Outdated Show resolved Hide resolved
dask/dataframe/core.py Outdated Show resolved Hide resolved
dask/dataframe/core.py Outdated Show resolved Hide resolved
)

# Ensure there are no nulls
if s.isna().any().compute():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that it's probably not possible to check for na without doing a compute(). However, I just want to flag that this will eagerly compute the index, which may or may not be otherwise desired.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I couldn't figure out any other way of doing this. Anyway, since it's an any() it should be light-weight, because as soon as there's a True, there's no need to calculate anything else. But I'm not really sure it stops there...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsignell I'm not sure what happens here, and if this compute() will be a problem (I'm thinking of the recent vindex issue). Do you have any insight?

Copy link
Contributor Author

@jorloplaz jorloplaz Apr 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it's not as with the len, where we know that if setting the index to an existing column, the len doesn't need to be checked. On the other hand, we must always check this, as there may be nulls in any column, or in the series provided directly by the user.

Anyway, AFAIK set_index always triggers computations (calculating quantiles, etc.), except for the very particular case where sorted=True and divisions are explicitly provided by the user. So I'd say it's reasonable that there's a .compute() here. Do you agree?

dask/dataframe/core.py Outdated Show resolved Hide resolved
dask/dataframe/shuffle.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_dataframe.py Outdated Show resolved Hide resolved
@bryanwweber
Copy link
Contributor

Just wanted to add that the failure on the macOS test suite appears to be a flaky test, I'm not sure why it's failing, but I've seen similar failures related to parquet tests in other PRs.

@bryanwweber bryanwweber self-assigned this Apr 21, 2022
@github-actions github-actions bot added array documentation Improve or add to documentation io labels Apr 22, 2022
@jorloplaz
Copy link
Contributor Author

jorloplaz commented Apr 25, 2022

I'd say all points have been covered @bryanwweber (except for possibly flaky tests in other modules that remain unchanged). I've edited the PR description to be more accurate,

@bryanwweber
Copy link
Contributor

@jorloplaz can you rebase this on main again not what a few of the typing commits have been merged? I'm having trouble telling what's one of your changes vs. merged from other PRs.

@jorloplaz
Copy link
Contributor Author

jorloplaz commented Apr 25, 2022

@bryanwweber Not sure how I should do that. It still says that I changed 25 files, but I only changed 3 (core.py, shuffle.py, and test_dataframe.py). I tried this:

git checkout main
git pull upstream main --tags
git checkout prevent_nans_in_index
git rebase main

And then solve all conflicts, etc, but it seems I'm doing something wrong. Any help is appreciated.

@bryanwweber
Copy link
Contributor

@jorloplaz I did those same steps and resolved all the conflicts in favor of the incoming change. This resulted in 4 files changed (the three you mentioned, plus test_shuffle.py). I'm not sure if it is equivalent to your changes, but at least it resolved the number of files problem 😄 Unfortunately, I can't push to your branch because I don't have merge rights, but perhaps that gives you a direction to go.

@jorloplaz
Copy link
Contributor Author

jorloplaz commented Apr 26, 2022

I did as you said, but still claims the 25 files (perhaps because my fork was done weeks ago?).

Anyway you are right, my changes are only in those 3 files and I also forgot to mention test_shuffle, so it's actually 4 files. All the other changes are other PRs that have been merged into main meanwhile, I suppose, so you can ignore them.

Another possibility is that I try another fork from current main and open a new PR that should reflect just the changes I made. However, I think we'd lose all this discussion (unless we reference this PR in the second PR). What do you think?

@bryanwweber
Copy link
Contributor

@jorloplaz Did you pull from upstream main first?

git switch main
git pull upstream main
git switch prevent_nans_in_index
git rebase main

Another option, if you know which files have changed, is to put them in a commit on a new branch and then move that over to this branch:

git switch main
git pull upstream main
git switch prevent_nans_in_index
git switch -c prevent_nans_rebase # create new branch
git reset main
git add <the files you changed>
git commit
git switch prevent_nans_in_index
git reset --hard prevent_nans_rebase

@github-actions github-actions bot removed io array documentation Improve or add to documentation labels Apr 26, 2022
@jorloplaz
Copy link
Contributor Author

jorloplaz commented Apr 26, 2022

@bryanwweber Had to resort to force pushing eventually, but now it looks fine 🤞

I also changed some small things related to pycodestyle (e.g. replacing lambda functions with defs).

@bryanwweber
Copy link
Contributor

bryanwweber commented Apr 26, 2022

@jorloplaz 🎉 Yes, I should have mentioned it would require a force-push. Any rewrite of the commit history like that needs a force-push 😄

I'll look at this again tomorrow morning Eastern US time but I think it's really close to being good now.

@github-actions github-actions bot added the io label Apr 27, 2022
@jorloplaz
Copy link
Contributor Author

No idea what's going on with pytest-timeout in the distributed package... 😭

@crusaderky
Copy link
Collaborator

No idea what's going on with pytest-timeout in the distributed package... sob

it broke with dask/distributed#6218

@jorloplaz
Copy link
Contributor Author

@crusaderky Perhaps we should add pytest-timeout in dask's requirements as well?

@crusaderky
Copy link
Collaborator

@jorloplaz yes, in a separate PR please

@fjetter
Copy link
Member

fjetter commented Apr 27, 2022

I opened dask/distributed#6224 to remove it as a hard requirement

s = self[other]

# Ensure there are no nulls
if s.isna().any().compute():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we think it's important to compute nulls, then I think we should compute them when we do min, max, and len. That will allow us to only read in the data once. If that means that we miss the case where the index is sorted and the divisions are passed, I think that's ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I'm following you. Currently it is when the code enters the "divisions-figuring-out-part" in shuffle.py that things fail, because it's not robust to nulls. So I'm doing this previously to prevent that happens.

Of course, when that part is robust to allow for nulls (and also Pandas extension dtypes that use pd.NA), then we'll be able to remove this test safely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right so could we fail within that method when we encounter a nan? We could wrap this

divisions, sizes, mins, maxes = compute(divisions, sizes, mins, maxes)
in a try except or we could drill down into
def partition_quantiles(df, npartitions, upsample=1.0, random_state=None):
and raise from somewhere within that function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapped the compute line into a try-except. Interestingly, that only fails when the column is non-numeric (nan is fine), so I was more flexible regarding what to accept in from_pandas. Now errors only happen when: 1) the index has some null, 2) it is non-numeric.

Even so, I still think that for numeric cases nulls shouldn't be really accepted as part of the index. Correct me if I'm wrong, but the main purpose of Dask's index is to know in what partition to look for a particular value (that is, for loc), right?

However, given that comparisons with null values don't always yield True (np.nan == np.nan yields False!!!), and that you can't really tell whether nan is greater than or less than some non-null value, I think allowing for that can only bring trouble.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I slightly improved this so that nulls_presence is computed along with sizes, mins, maxes, etc. When an exception is raised, there are nulls and the series dtype is non-numeric, we inform the user about those nulls being likely the cause of the problem. Hope you find this reasonable.

Copy link
Contributor

@bryanwweber bryanwweber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good from my side, thanks for considering all my feedback @jorloplaz! I believe @jsignell still had some concerns to resolve.

@jorloplaz
Copy link
Contributor Author

Again there was a mysterious error for MacOS 3.8 (a flaky test?) and gpuCI failed while building, but I think none of those things is my fault.

@pavithraes pavithraes requested a review from jsignell May 3, 2022 12:43
Copy link
Member

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple last suggestions. Thanks so much for keeping with this @jorloplaz

Comment on lines 47 to 53
try:
divisions, sizes, mins, maxes, nulls_presence = compute(
divisions, sizes, mins, maxes, nulls_presence
)
except Exception as e:
# Check if there are nulls and if so, inform the user about this probably being the cause behing the error
if nulls_presence.any().compute() and not is_numeric_dtype(partition_col.dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this change @jorloplaz! I am wondering if we can tell from the error message if there were nans rather than having to do another compute before raising.

Copy link
Contributor Author

@jorloplaz jorloplaz May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not from the error message, because different things can happen. For example, we could have an error like:

File "/home/jorge.lopez/anaconda3/envs/AML-dev/lib/python3.9/site-packages/numpy/core/_methods.py", line 39, in _amax
return umr_maximum(a, axis, None, out, keepdims, initial, where)
TypeError: '>=' not supported between instances of 'float' and 'str'

But also I've found things like:

dask/array/percentile.py:40: in _percentile
result[0] = min(result[0], values.min())
TypeError: Cannot convert NaTType to pandas._libs.tslibs.timestamps._Timestamp

So what I did is to change the general Exception for a TypeError, without checking the message itself and also without computing the nulls explicitly.

raise Exception()

with dask.config.set(get=throw):
ddf2 = ddf.set_index("x", divisions=[1, 3, 5])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be able to go back to the original on these tests now that the compute is only called from within the _calculate_division function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In test_shuffle.py I could remove it, but gpuCI keeps on failing unless I remove this. Doesn't like the get keyword and requires a specific scheduler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me a little worried because this test is meant to ensure that no compute is happening in this case. What happens if you just change get= to scheduler=

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with scheduler=. Let's see what happens.

@jorloplaz jorloplaz changed the title Prevent nans in index Prevent nans in index for non-numeric dtypes May 6, 2022
@jorloplaz jorloplaz changed the title Prevent nans in index for non-numeric dtypes Prevent nulls in index for non-numeric dtypes May 6, 2022
return self
# Otherwise, check length matches when other isn't one of the data columns
is_column = any(other._name == self[c]._name for c in self)
if not is_column and len(other) != len(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another case where we are inadvertently computing, by using len. I don't think this is strictly related to the goal of the work, so can you remove this catch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Copy link
Member

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sticking with this @jorloplaz!

@jsignell
Copy link
Member

I'm just double checking the failure in CI to make sure it is unrelated

@jsignell
Copy link
Member

Known flakey test (#8795)

@jsignell jsignell merged commit d458533 into dask:main May 11, 2022
@jsignell
Copy link
Member

This is in!

@jorloplaz jorloplaz deleted the prevent_nans_in_index branch May 11, 2022 18:58
erayaslan pushed a commit to erayaslan/dask that referenced this pull request May 12, 2022
More informative error messages in `set_index` when :

1. There are nulls (not necessarily `nan`, but also `None`, `pd.NaT`, etc.)
2. The series to become the index has a non-numeric `dtype`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

set_index on a column containing NaN raises confusing error (partition_quantiles not robust to NaNs)
6 participants