Dtype Brainstorming

Dtypes in NumPy

A brainstorming session at the SciPy 2018 sprints July 14, 2018

User Stories

units
- unyt (ndarray subclass)
- Pint (wrapper via array_prepare and array_wrap, see comment)
- astropy.units (ndarray subclass)
- several others, see https://www.youtube.com/watch?v=N-edLdxiM40
enumerations / categorical
text
- encoded fixed width text (utf8, latin1, ...)
- variable width
datetime
- 360 day calendar
- Ora: https://github.com/alexhsamuel/ora
shapely/GEOS geometries
- does this include jagged arrays of polygons?
Numerics
- Novel floating point formats
- Decimal (arbitrary precision)
- Big int
- finite fields
- Rationals https://github.com/numpy/numpy-dtypes
- float16?
Missing values
- sentinels
- bitmask
- record-like array
- optional
quaternion
- https://github.com/martinling/numpy_quaternion (outdated)
- https://github.com/moble/quaternion (maintianed)
a general pointer dtype that does memory packagement
xdress (https://github.com/xdress/xdress)
ndtypes (https://ndtypes.readthedocs.io/)

Motivations

Replacing subclassing, which is quite fragile.

Dtype requirements

Hold data (e.g., categories, datetime64, units)
Needs to be able to override dtype specific functionality:
- Arithmetic
- Ufuncs
- Sorting
- Coercion rules
Handle life cycle (e.g. GEOS/shapely)
Push API up to the ndarray
- For example can a unit dtype push a method .to up to the ndarray class to convert to a different unit
- Or can a datetime dtype push a .year or .dayofweek up to the ndarray class
- This can be done -- but should it be done?
Two use cases: writing high-level dtypes in Python and low-level dtypes in C:
- We need new capabilities for C dtypes:
  - At the C level, the current interface is quite cumbersome. It would be nice to have something easier for use with C/C++/Cython.
  - At a low-level, ufunc loops need access to dtype metadata (e.g., this is why we don't have ufuncs for strings in NumPy)
  - A new primitive data type for pointers would be broadly useful (e.g., for managing strings or geometric objects).
- We need to be able to write custom dtypes in Python
  - This would be particularly useful for high level dtypes like units or categorical, which can be written in terms of a primitive data types plus some metadata.
  - Ideally, custom dtypes would reuse existing protocols for duck arrays, e.g., __array_ufunc__ and __array_function__.
Mechanism for extended dtypes to go from strings to dtypes
- Parse dtype='my_dtype[options]' into the dtype constructor somehow.
- DSL? Are we parsing?
- Handle conflicting names by convention (maybe raise a warning)
- Possibly need a registration mechanism (so np.array([1, 2, 3], dtype='my_dtype') would work)
Scalar types should not need to be NumPy scalars?
Should it allow for mix-in like paradigms (say have mydtype based off of np.float64)?
Should we have some thing like isinstance(dtype, (np.float64, np.float32))?
Should not require every .dtype attribute to be a NumPy dtype (e.g., pandas_series.dtype == np.dtype(np.float64) current breaks)

Suggestion: strawman proposal for what writing a dtype should look like.

Straw implementations

Units

From Nathan, based on unyt_array

import numpy as np


class float64_with_unit(np.dtype):
    array_dtype = np.float64
    unit = None

    def __init__(self, unit):
        self.unit = unit

    def __array_ufunc_proxy__(self, ufunc, method, *input_dtypes, **kwargs):
        # do ufunc dispatch
        raise NotImplementedError

    def __array_function_proxy__(self, function, *input_dtypes, **kwargs):
        # do function dispatch
        raise NotImplementedError

    def __setstate__(self, state):
        # do pickle serialization
        raise NotImplementedError

Comments

Do we want to give dtypes the ability to change all functions, or just ufuncs
__array_func__ should call dtype.coerce

From Ryan, out of thin air:

import numpy as np

class UnitDType(np.dtype):
    _ndarray_api = ['convert']
    
    def __init__(self, unit, baseType=np.float64):
        self._unit = unit
        self._base = baseType

    def convert(self, unit):
        # astype()?
        pass

    def __add__(self, other):
        self._check(other):
        self._base.add(self, other)

    def __mul__(self, other):
        self._base.mul(self, other)
        self._update_dimensionality(other)

    def _check(self, other):
        if self._dimensions != other._dimensions:
            raise UnitError

    def _update_dimensionality(self, other)
        self._dimensions[...]


a = np.ones((5,), dtype=UnitDtype('meters'))

b = np.ones((5,), dtype=UnitDType('seconds'))

a + b  # UnitError

a * b == np.ones((5,), dtype=UnitDType('meters/seconds'))

Comments

.astype('units[ft]') could work, but it would be nice to specify just .convert('ft')
__add__ etc should be handled by __array_ufunc__
Units here might be a specific case of something more general

Strings

From Stephan:

class VariableLengthString(np.LogicalDtype):
    physical_dtype = np.object
    name = 'String'

    def __array_ufunc__(self, ufunc, method, args, **kwargs):
        if any(not isinstance(a.dtype, VariableLengthString)
               for a in args):
            return NotImplemented
        physical_args = tuple(a.astype(object) for a in args)
        result = getattr(ufunc, method)(*physical_args, **kwargs)
        return result.astype(VariableLengthString)

    def __array_function__(self, func, types, args, kwargs):
        # can't do it! types only exposes type information, not dtype

    def __dtype_promote__(self, dtypes):
        if all(d in [VariableLengthString, np.unicode_, np.string_]
               for d in dtypes):
            return VariableLengthString()
        return NotImplemented

    def __array_coerce__(self, array, casting):
        if array.dtype.kind == 'U':
            result = array.astype(object)
            result.dtype = VariableLengthString()
        elif array.dtype.kind == 'S':
            # decode as ascii? raise?
        elif array.dtype.kind == 'O':
            # check for all string object
        else:
            raise TypeError
        return result

I used LogicalDtype above to say that this is based off of another dtype so that numpy knows how to handle it. I just want to implement a little bit on top of that.

The __array__function__ protocol doesn't work that well because the dtype wasn't explicitly provided for all of the arrays.

Categorical

From Joris

class CategoricalDtype():
    
    def __init__(self, categories, ordered=False):
        self.categories = categories
        self.ordered = ordered
    
    @classmethod
    def _construct_dtype_from_string()

    def _array_constructor(self, values):
        # convert values to codes
        codes = ...
        # update self to reflect values
        np.array(codes, dtype=self)

    def _array_repr(self):
        # override the repr of the array with this dtype
    
    def _validate_scalar(self):
        # validate if scalar can be stored in the array


np.array(['Red', 'Green', 'Blue', 'Red'], dtype=CategoricalDtype())
np.array(['Red', 'Green', 'Blue', 'Red'], dtype=CategoricalDtype(categories=['Red', 'Green', 'Blue', 'Yellow']))

Comments

Don't need to implement __add__, etc.. due to __array_ufunc__
Should we limit the functions that can to go in __array_function__ for dtypes? Do we need __array_function__?
Mixins for units -- don't want to write a separate dtype for each variation.
Should dtypes specify width

Protocols, Inheritence,

protocols __array_ufunc__, __array_function__
inheritence - subclassing dtype. Probably not a good idea
duckdtype - what is the minumum viable methods and attributes a dtype needs?

Misc

creating a dtype tutorial https://github.com/stefanv/teaching/tree/master/2013_scipy_austin_dive_into_numpy/slides

Themes

It would be great to get past subclassing
It would be nice to write something in Python
It would be nice to be able to interoperate between different array duck types using the same dtype

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dtype Brainstorming

Dtypes in NumPy

User Stories

Motivations

Dtype requirements

Straw implementations

Units

Comments

Comments

Strings

Categorical

Comments

Protocols, Inheritence,

Misc

Themes

Clone this wiki locally