-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent nulls in index for non-numeric dtypes #8963
Conversation
Can one of the admins verify this patch? |
No idea why it fails for MacOS and Python 3.8. Seems to be a parquet-related test, but nothing in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorloplaz Thanks for the contribution! I think it will definitely improve the user experience here and I like that the checks of the index are being combined into a single place for consistency. I do have a few specific concerns below and one high-level concern about _check_index()
.
Specifically, it seems like _check_index()
is doing both the checking and the getting of the index, which seems like a bit of mixing of concerns in the same function. I personally find it confusing that None
is a sentinel value to mean "the index that's been passed is already the index", especially because the other possible return value from _check_index()
is the index that should be used. Perhaps it's worth having separate helper functions to handle the various tasks here?
dask/dataframe/core.py
Outdated
@@ -4555,14 +4556,80 @@ def sort_values( | |||
**kwargs, | |||
) | |||
|
|||
def _check_index(self, other: Any) -> Optional[Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the type annotations help here? I think there's an existing effort to annotate the dask/dask codebase, see #8854
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a rule of thumb, I always add annotations in the code I write, so that it's not another contributor that has to figure out the annotation types later. In this particular case maybe it's not worth it, as other
can be of several types, but in set_index
I think it's definitely worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added more type annotations to several methods in shuffle.py
as a bonus.
dask/dataframe/core.py
Outdated
) | ||
|
||
# Ensure there are no nulls | ||
if s.isna().any().compute(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize that it's probably not possible to check for na
without doing a compute()
. However, I just want to flag that this will eagerly compute the index, which may or may not be otherwise desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I couldn't figure out any other way of doing this. Anyway, since it's an any()
it should be light-weight, because as soon as there's a True
, there's no need to calculate anything else. But I'm not really sure it stops there...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsignell I'm not sure what happens here, and if this compute()
will be a problem (I'm thinking of the recent vindex
issue). Do you have any insight?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it's not as with the len
, where we know that if setting the index to an existing column, the len
doesn't need to be checked. On the other hand, we must always check this, as there may be nulls in any column, or in the series provided directly by the user.
Anyway, AFAIK set_index
always triggers computations (calculating quantiles, etc.), except for the very particular case where sorted=True
and divisions
are explicitly provided by the user. So I'd say it's reasonable that there's a .compute()
here. Do you agree?
Just wanted to add that the failure on the macOS test suite appears to be a flaky test, I'm not sure why it's failing, but I've seen similar failures related to parquet tests in other PRs. |
I'd say all points have been covered @bryanwweber (except for possibly flaky tests in other modules that remain unchanged). I've edited the PR description to be more accurate, |
@jorloplaz can you rebase this on |
@bryanwweber Not sure how I should do that. It still says that I changed 25 files, but I only changed 3 ( git checkout main
git pull upstream main --tags
git checkout prevent_nans_in_index
git rebase main And then solve all conflicts, etc, but it seems I'm doing something wrong. Any help is appreciated. |
@jorloplaz I did those same steps and resolved all the conflicts in favor of the incoming change. This resulted in 4 files changed (the three you mentioned, plus |
I did as you said, but still claims the 25 files (perhaps because my Anyway you are right, my changes are only in those 3 files and I also forgot to mention Another possibility is that I try another fork from current |
@jorloplaz Did you pull from upstream main first? git switch main
git pull upstream main
git switch prevent_nans_in_index
git rebase main Another option, if you know which files have changed, is to put them in a commit on a new branch and then move that over to this branch:
|
9772652
to
209aac7
Compare
@bryanwweber Had to resort to force pushing eventually, but now it looks fine 🤞 I also changed some small things related to |
@jorloplaz 🎉 Yes, I should have mentioned it would require a force-push. Any rewrite of the commit history like that needs a force-push 😄 I'll look at this again tomorrow morning Eastern US time but I think it's really close to being good now. |
No idea what's going on with |
it broke with dask/distributed#6218 |
@crusaderky Perhaps we should add |
@jorloplaz yes, in a separate PR please |
I opened dask/distributed#6224 to remove it as a hard requirement |
dask/dataframe/core.py
Outdated
s = self[other] | ||
|
||
# Ensure there are no nulls | ||
if s.isna().any().compute(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we think it's important to compute nulls, then I think we should compute them when we do min, max, and len. That will allow us to only read in the data once. If that means that we miss the case where the index is sorted and the divisions are passed, I think that's ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I'm following you. Currently it is when the code enters the "divisions-figuring-out-part" in shuffle.py
that things fail, because it's not robust to nulls. So I'm doing this previously to prevent that happens.
Of course, when that part is robust to allow for nulls (and also Pandas extension dtypes that use pd.NA
), then we'll be able to remove this test safely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right so could we fail within that method when we encounter a nan? We could wrap this
dask/dask/dataframe/shuffle.py
Line 42 in 138831b
divisions, sizes, mins, maxes = compute(divisions, sizes, mins, maxes) |
dask/dask/dataframe/partitionquantiles.py
Line 443 in 138831b
def partition_quantiles(df, npartitions, upsample=1.0, random_state=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrapped the compute
line into a try-except
. Interestingly, that only fails when the column is non-numeric (nan
is fine), so I was more flexible regarding what to accept in from_pandas
. Now errors only happen when: 1) the index has some null, 2) it is non-numeric.
Even so, I still think that for numeric cases nulls shouldn't be really accepted as part of the index. Correct me if I'm wrong, but the main purpose of Dask's index is to know in what partition to look for a particular value (that is, for loc
), right?
However, given that comparisons with null values don't always yield True
(np.nan == np.nan
yields False
!!!), and that you can't really tell whether nan
is greater than or less than some non-null value, I think allowing for that can only bring trouble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I slightly improved this so that nulls_presence
is computed along with sizes, mins, maxes
, etc. When an exception is raised, there are nulls and the series dtype
is non-numeric, we inform the user about those nulls being likely the cause of the problem. Hope you find this reasonable.
29946ba
to
bd69994
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good from my side, thanks for considering all my feedback @jorloplaz! I believe @jsignell still had some concerns to resolve.
Again there was a mysterious error for MacOS 3.8 (a flaky test?) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple last suggestions. Thanks so much for keeping with this @jorloplaz
dask/dataframe/shuffle.py
Outdated
try: | ||
divisions, sizes, mins, maxes, nulls_presence = compute( | ||
divisions, sizes, mins, maxes, nulls_presence | ||
) | ||
except Exception as e: | ||
# Check if there are nulls and if so, inform the user about this probably being the cause behing the error | ||
if nulls_presence.any().compute() and not is_numeric_dtype(partition_col.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this change @jorloplaz! I am wondering if we can tell from the error message if there were nans rather than having to do another compute before raising.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not from the error message, because different things can happen. For example, we could have an error like:
File "/home/jorge.lopez/anaconda3/envs/AML-dev/lib/python3.9/site-packages/numpy/core/_methods.py", line 39, in _amax
return umr_maximum(a, axis, None, out, keepdims, initial, where)
TypeError: '>=' not supported between instances of 'float' and 'str'
But also I've found things like:
dask/array/percentile.py:40: in _percentile
result[0] = min(result[0], values.min())
TypeError: Cannot convert NaTType to pandas._libs.tslibs.timestamps._Timestamp
So what I did is to change the general Exception
for a TypeError
, without checking the message itself and also without computing the nulls explicitly.
raise Exception() | ||
|
||
with dask.config.set(get=throw): | ||
ddf2 = ddf.set_index("x", divisions=[1, 3, 5]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should be able to go back to the original on these tests now that the compute is only called from within the _calculate_division
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In test_shuffle.py
I could remove it, but gpuCI
keeps on failing unless I remove this. Doesn't like the get
keyword and requires a specific scheduler
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me a little worried because this test is meant to ensure that no compute is happening in this case. What happens if you just change get=
to scheduler=
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced with scheduler=
. Let's see what happens.
dask/dataframe/core.py
Outdated
return self | ||
# Otherwise, check length matches when other isn't one of the data columns | ||
is_column = any(other._name == self[c]._name for c in self) | ||
if not is_column and len(other) != len(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another case where we are inadvertently computing, by using len
. I don't think this is strictly related to the goal of the work, so can you remove this catch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sticking with this @jorloplaz!
I'm just double checking the failure in CI to make sure it is unrelated |
Known flakey test (#8795) |
This is in! |
More informative error messages in `set_index` when : 1. There are nulls (not necessarily `nan`, but also `None`, `pd.NaT`, etc.) 2. The series to become the index has a non-numeric `dtype`.
set_index
on a column containing NaN raises confusing error (partition_quantiles
not robust to NaNs) #8347pre-commit run --all-files
mypy
for type annotationsMajor changes
index
toDataFrame.set_index
, so thatshuffle.set_index
andshuffle.set_sorted_index
useindex
directly with no checks.NotImplementedError
is raised.ValueError
is raised.test_index_errors
test that checks former errors and mismatching lengths.test_index_nulls
test that ensures noNaN
s are allowed, with different null values (NaT
,NA
,None
, etc.).DataFrame.set_index
andDataFrame.sort_values
shuffle.py
.io
forfrom_pandas
, where the dataframe passed as argument is checked against this.Minor changes
list
or atuple
of columns to index with, the more generalSequence
is allowed.get
as a keyword (aDeprecationError
was raised otherwise).compute()
call for theNaN
checks.