New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add support for numpy 2's string dtype #58394
base: main
Are you sure you want to change the base?
Conversation
ndarray[uint8_int64_object_t] out, | ||
ndarray[object] arr, | ||
ndarray out, | ||
ndarray arr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will make iteration slower.
Look into numpy C API iteration functions.
(e.g. look at Validator._validate)
If this gives perf improvement, maybe can be split out.
|
||
@classmethod | ||
def _from_sequence(cls, scalars, *, dtype: Dtype | None = None, copy: bool = False): | ||
na_mask, any_na = libmissing.isnaobj(np.array(scalars, dtype=object), check_for_any_na=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe can use _lib.convert_nans_to_NA
instead?
Would get rid of diff in missing.pyx.
return super()._str_endswith(pat, na) | ||
pat = np.asarray(pat, dtype=get_numpy_string_dtype_instance()) | ||
result = np.strings.endswith(self._ndarray, pat) | ||
return BooleanArray(result, isna(self._ndarray)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cache na mask?
|
||
if ret.dtype == object and ret_dtype is not None: | ||
# cast from object back to StringDType | ||
return ret.astype(ret_dtype, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In map_infer_mask
there is a convert=True argument, that calls ``maybe_convert_objects on it.
Alternately, we could look at setting the dtype argument to an instance of StringDType
, and hope that PyArray_SETITEM
will work on it?
|
||
def get_numpy_string_dtype_instance( | ||
na_object=libmissing.NA, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is na_object parameter used?
) | ||
raise TypeError(msg) | ||
dtype = self._inferred_dtype | ||
if dtype not in allowed_types: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are numpy dtypes hashable?
Might just want to add to allowed_types
@@ -1479,7 +1481,7 @@ def find_common_type(types): | |||
if t.kind in "iufc": | |||
return np.dtype("object") | |||
|
|||
return np_find_common_type(*types) | |||
return np.result_type(*types) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: This (other changes in this file) are causing test failures.
if self.values.dtype.kind == 'T': | ||
return NumpyStringArray(self.values) | ||
else: | ||
return NumpyExtensionArray(self.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because NumpyExtensionArray inherits from ObjectStringArrayMixin
, so unless I make this change there's no way to call the string method overrides in NumpyStringArray
.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Just submitting this for CI purposes, nothing to review yet.