Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Partially replace pd.Int64Index with pd.Index #2339

Merged
merged 29 commits into from Apr 4, 2022
Merged

[ENH] Partially replace pd.Int64Index with pd.Index #2339

merged 29 commits into from Apr 4, 2022

Conversation

khrapovs
Copy link
Contributor

@khrapovs khrapovs commented Mar 29, 2022

Reference Issues/PRs

Partially addresses #2332.

What does this implement/fix? Explain your changes.

I have replaced pd.Int64Index with pd.Index only in obvious and harmless places. More complicated like VALID_INDEX_TYPES in sktime.utils.validation.series are left for another ambitious PR.

Does your contribution introduce a new dependency? If yes, which one?

No

PR checklist

For all contributions
  • I've added myself to the list of contributors.
  • Optionally, I've updated sktime's CODEOWNERS to receive notifications about future changes to these files.
  • I've added unit tests and made sure they pass locally.
  • The PR title starts with either [ENH], [DOC] or [BUG] indicating wether the PR topic is related to enhancement, documentation or bug

@fkiraly
Copy link
Collaborator

fkiraly commented Mar 29, 2022

not entirely harmless, is it? 😄
image

@khrapovs
Copy link
Contributor Author

not entirely harmless, is it? smile image

OMG, what did I got myself into! now I see why others abandoned it :D

Stanislav Khrapov added 2 commits March 29, 2022 13:47
@khrapovs
Copy link
Contributor Author

ha! how about that, @fkiraly?
image

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take that, Int64Index!

In-principle, this look ok, but - we need to deprecate!

Why: suppose some poor user's highly valuable deployed pipelines are relying entirely on Int64Index for some reason. This PR would completely break it, wouldn't it?

Breakage occurs wherever we do input checks and the index is no longer allowed.
In those cases, we need to raise a warning (deprecation since 0.12, removal in 0.13), while being defensive in the sense of making sktime robust against the pandas deprecation...

In the many cases where we create an index, there is no deprecation necessary.

@khrapovs
Copy link
Contributor Author

@fkiraly Yes, you are right. But, right now there is no way to say whether a user created the index using pd.Index or pd.Int64Index:

type(pd.Index([1, -1]))
<class 'pandas.core.indexes.numeric.Int64Index'>
type(pd.Int64Index([1, -1]))
<input>:1: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
<class 'pandas.core.indexes.numeric.Int64Index'>

Additionally, from https://pandas.pydata.org/docs/reference/api/pandas.Int64Index.html:

Deprecated since version 1.4.0: In pandas v2.0 Int64Index will be removed and NumericIndex used instead. Int64Index will remain fully functional for the duration of pandas 1.x.

But NumericIndex does not even exist yet!

@fkiraly
Copy link
Collaborator

fkiraly commented Mar 30, 2022

@khrapovs, I see. That makes things difficult.
Is there any way to not make us break our interface contracts?
How about replacing checks for index types by uses of the pandas method Index.is_integer?

@khrapovs
Copy link
Contributor Author

very good suggestion! I completely overlooked that such a method exists. replaced all usages of is_numeric with is_integer. thanks!

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!
I think this could almost go in, but I'm concerned about the maintainability of the type check, where we now repeatedly do sth like isinstance(obj, VALID_TYPES) or (isinstance(obj, Index) and obj.is_integer()).

"Why" is hard to understand to someone who hasn't followed our discussion, and it's also repeated in multiple places across the codebase now (which you probably found via "breakage", but will be difficult to collect again for any future maintainer).

I would hence recommend to bundle this in one function, perhaps
is_valid_fh_index(obj, type="any"), where type can be "relative" or "absolute" - what do you think?

@khrapovs
Copy link
Contributor Author

khrapovs commented Apr 1, 2022

@fkiraly

Very good! Added two new functions:

  • sktime.utils.validation.series.is_in_valid_index_types
  • sktime.utils.validation.series.is_integer_index

The check for RELATIVE_TYPES or ABSOLUTE_TYPES happens only a couple of times in the single file _fh.py. A combined check for either valid index or integer index only happens just two times. I do not think it makes sense to over-engineer for such rare use cases. But if you insist of a function with type argument, I would be happy to write a couple more lines of code.

# Conflicts:
#	sktime/forecasting/base/_fh.py
@fkiraly
Copy link
Collaborator

fkiraly commented Apr 1, 2022

😱
image
what is this now - seems unrelated?

Maybe let's record this as a sporadic failure?

@khrapovs
Copy link
Contributor Author

khrapovs commented Apr 2, 2022

scream image what is this now - seems unrelated?

Maybe let's record this as a sporadic failure?

Reported here: #2368

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, one last change request:

  • change is_in_valid_index_types as suggested to include the is_integer_index check
  • move it to _fh.py, next to the two other functions
  • then use it in the two other places where you check that manually

@fkiraly
Copy link
Collaborator

fkiraly commented Apr 2, 2022

(I'm not insistent about my idea is_valid_fh_index(obj, type="any"), where type can be "relative" or "absolute" - also happy with three loose functions - but just wanted to bring it up again in case it seems more natural now)

@khrapovs
Copy link
Contributor Author

khrapovs commented Apr 3, 2022

  • change is_in_valid_index_types as suggested to include the is_integer_index check

done

  • move it to _fh.py, next to the two other functions

is_in_valid_index_types is very generic and does not have much connection to ForecastingHorizon other than it is used once to check fh values. I would prefer to keep it next to where VALID_INDEX_TYPES are defined. This function is used much more frequently by other modules.

  • then use it in the two other places where you check that manually

done

(I'm not insistent about my idea is_valid_fh_index(obj, type="any"), where type can be "relative" or "absolute" - also happy with three loose functions - but just wanted to bring it up again in case it seems more natural now)

Yes, I remember about this proposal and even now would prefer to keep the functions as they are now.

fkiraly
fkiraly previously approved these changes Apr 3, 2022
Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to approve now, although I leave the merge decision to you after considering my non-blocking comment below.

is_in_valid_index_types is very generic and does not have much connection to ForecastingHorizon other than it is used once to check fh values. I would prefer to keep it next to where VALID_INDEX_TYPES are defined. This function is used much more frequently by other modules.

Ih that case, I would move the other way round, i.e., rename is_relative_fh_type to is_in_valid_relative_index_types or is_in_valid_index_types_relative, and put the three functions together into the validation module because they are so very similar.

@khrapovs
Copy link
Contributor Author

khrapovs commented Apr 3, 2022

Ih that case, I would move the other way round, i.e., rename is_relative_fh_type to is_in_valid_relative_index_types or is_in_valid_index_types_relative, and put the three functions together into the validation module because they are so very similar.

Very good. I moved not only is_in_valid_relative_index_types and is_in_valid_absolute_index_types to utils/validation/series.py, but also RELATIVE_INDEX_TYPES and ABSOLUTE_INDEX_TYPES next to VALID_INDEX_TYPES in the same module. I like that!

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

I see that, in this newest version, you changed INDEX_TYPE_LOOKUP too, but did not address the consequences of that change.

I think we should address those ramifications, before we merge.

There are a number of places in which a check in INDEX_TYPE_LOOKUP.get(index_type) occurs, this should be using is_in_valid_index_types, no?

@khrapovs
Copy link
Contributor Author

khrapovs commented Apr 3, 2022

I see that, in this newest version, you changed INDEX_TYPE_LOOKUP too, but did not address the consequences of that change.

I think we should address those ramifications, before we merge.

There are a number of places in which a check in INDEX_TYPE_LOOKUP.get(index_type) occurs, this should be using is_in_valid_index_types, no?

Correct. Thanks for noticing. I have used is_integer_index in two places of test_fh where assert isinstance was testing index type. I have double checked other usages of INDEX_TYPE_LOOKUP and do not see anything I should change there.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above. Should it not be much shorter than the if/else?
Happy to approve if there is a good reason not to.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion revealed a good reason to leave as is, which resolves the last change request.

@fkiraly fkiraly merged commit 4b9003c into sktime:main Apr 4, 2022
@khrapovs khrapovs deleted the 2338-khrapovs-replace-int64index branch April 4, 2022 14:09
srggrs added a commit to Gridsight/sktime that referenced this pull request Apr 6, 2022
* upstream/main:
  [DOC] Added docstring examples to load data functions (sktime#2393)
  [ENH] Capability inference for transformer and classifier pipelines (sktime#2367)
  [ENH] Proba metric grid search integration (sktime#2234)
  [ENH] Faster classifier example parameters (sktime#2378)
  [ENH] Get rid of `pd.Int64Index` (sktime#2390)
  [ENH] Allow `pd.Timedelta` values in `ForecastingHorizon` (sktime#2333)
  [ENH] Partially replace `pd.Int64Index` with `pd.Index` (sktime#2339)
  relax name rules for multiindex (sktime#2384)
@lmmentel lmmentel added the maintenance Continuous integration, unit testing & package distribution label Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Continuous integration, unit testing & package distribution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants