Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add support for numpy 2's string dtype #58394

Draft
wants to merge 53 commits into
base: main
Choose a base branch
from

Conversation

lithomas1
Copy link
Member

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Just submitting this for CI purposes, nothing to review yet.

@lithomas1 lithomas1 added Enhancement Strings String extension data type and string data labels Apr 23, 2024
ndarray[uint8_int64_object_t] out,
ndarray[object] arr,
ndarray out,
ndarray arr,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make iteration slower.

Look into numpy C API iteration functions.

(e.g. look at Validator._validate)

If this gives perf improvement, maybe can be split out.


@classmethod
def _from_sequence(cls, scalars, *, dtype: Dtype | None = None, copy: bool = False):
na_mask, any_na = libmissing.isnaobj(np.array(scalars, dtype=object), check_for_any_na=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe can use _lib.convert_nans_to_NA instead?

Would get rid of diff in missing.pyx.

return super()._str_endswith(pat, na)
pat = np.asarray(pat, dtype=get_numpy_string_dtype_instance())
result = np.strings.endswith(self._ndarray, pat)
return BooleanArray(result, isna(self._ndarray))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache na mask?


if ret.dtype == object and ret_dtype is not None:
# cast from object back to StringDType
return ret.astype(ret_dtype, copy=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In map_infer_mask there is a convert=True argument, that calls ``maybe_convert_objects on it.

Alternately, we could look at setting the dtype argument to an instance of StringDType, and hope that PyArray_SETITEM will work on it?


def get_numpy_string_dtype_instance(
na_object=libmissing.NA,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is na_object parameter used?

)
raise TypeError(msg)
dtype = self._inferred_dtype
if dtype not in allowed_types:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are numpy dtypes hashable?

Might just want to add to allowed_types

@@ -1479,7 +1481,7 @@ def find_common_type(types):
if t.kind in "iufc":
return np.dtype("object")

return np_find_common_type(*types)
return np.result_type(*types)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: This (other changes in this file) are causing test failures.

if self.values.dtype.kind == 'T':
return NumpyStringArray(self.values)
else:
return NumpyExtensionArray(self.values)
Copy link
Contributor

@ngoldbaum ngoldbaum Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because NumpyExtensionArray inherits from ObjectStringArrayMixin, so unless I make this change there's no way to call the string method overrides in NumpyStringArray.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants