Skip to content

Dtype Brainstorming

Matti Picus edited this page Oct 22, 2018 · 4 revisions

Dtypes in NumPy

A brainstorming session at the SciPy 2018 sprints July 14, 2018

User Stories

Motivations

Replacing subclassing, which is quite fragile.

Dtype requirements

  • Hold data (e.g., categories, datetime64, units)

  • Needs to be able to override dtype specific functionality:

    • Arithmetic
    • Ufuncs
    • Sorting
    • Coercion rules
  • Handle life cycle (e.g. GEOS/shapely)

  • Push API up to the ndarray

    • For example can a unit dtype push a method .to up to the ndarray class to convert to a different unit
    • Or can a datetime dtype push a .year or .dayofweek up to the ndarray class
    • This can be done -- but should it be done?
  • Two use cases: writing high-level dtypes in Python and low-level dtypes in C:

    • We need new capabilities for C dtypes:
      • At the C level, the current interface is quite cumbersome. It would be nice to have something easier for use with C/C++/Cython.
      • At a low-level, ufunc loops need access to dtype metadata (e.g., this is why we don't have ufuncs for strings in NumPy)
      • A new primitive data type for pointers would be broadly useful (e.g., for managing strings or geometric objects).
    • We need to be able to write custom dtypes in Python
      • This would be particularly useful for high level dtypes like units or categorical, which can be written in terms of a primitive data types plus some metadata.
      • Ideally, custom dtypes would reuse existing protocols for duck arrays, e.g., __array_ufunc__ and __array_function__.
  • Mechanism for extended dtypes to go from strings to dtypes

    • Parse dtype='my_dtype[options]' into the dtype constructor somehow.
    • DSL? Are we parsing?
    • Handle conflicting names by convention (maybe raise a warning)
    • Possibly need a registration mechanism (so np.array([1, 2, 3], dtype='my_dtype') would work)
  • Scalar types should not need to be NumPy scalars?

  • Should it allow for mix-in like paradigms (say have mydtype based off of np.float64)?

  • Should we have some thing like isinstance(dtype, (np.float64, np.float32))?

  • Should not require every .dtype attribute to be a NumPy dtype (e.g., pandas_series.dtype == np.dtype(np.float64) current breaks)

Suggestion: strawman proposal for what writing a dtype should look like.

Straw implementations

Units

From Nathan, based on unyt_array

import numpy as np


class float64_with_unit(np.dtype):
    array_dtype = np.float64
    unit = None

    def __init__(self, unit):
        self.unit = unit

    def __array_ufunc_proxy__(self, ufunc, method, *input_dtypes, **kwargs):
        # do ufunc dispatch
        raise NotImplementedError

    def __array_function_proxy__(self, function, *input_dtypes, **kwargs):
        # do function dispatch
        raise NotImplementedError

    def __setstate__(self, state):
        # do pickle serialization
        raise NotImplementedError

Comments

  • Do we want to give dtypes the ability to change all functions, or just ufuncs
  • __array_func__ should call dtype.coerce

From Ryan, out of thin air:

import numpy as np

class UnitDType(np.dtype):
    _ndarray_api = ['convert']
    
    def __init__(self, unit, baseType=np.float64):
        self._unit = unit
        self._base = baseType

    def convert(self, unit):
        # astype()?
        pass

    def __add__(self, other):
        self._check(other):
        self._base.add(self, other)

    def __mul__(self, other):
        self._base.mul(self, other)
        self._update_dimensionality(other)

    def _check(self, other):
        if self._dimensions != other._dimensions:
            raise UnitError

    def _update_dimensionality(self, other)
        self._dimensions[...]


a = np.ones((5,), dtype=UnitDtype('meters'))

b = np.ones((5,), dtype=UnitDType('seconds'))

a + b  # UnitError

a * b == np.ones((5,), dtype=UnitDType('meters/seconds'))

Comments

  • .astype('units[ft]') could work, but it would be nice to specify just .convert('ft')
  • __add__ etc should be handled by __array_ufunc__
  • Units here might be a specific case of something more general

Strings

From Stephan:

class VariableLengthString(np.LogicalDtype):
    physical_dtype = np.object
    name = 'String'

    def __array_ufunc__(self, ufunc, method, args, **kwargs):
        if any(not isinstance(a.dtype, VariableLengthString)
               for a in args):
            return NotImplemented
        physical_args = tuple(a.astype(object) for a in args)
        result = getattr(ufunc, method)(*physical_args, **kwargs)
        return result.astype(VariableLengthString)

    def __array_function__(self, func, types, args, kwargs):
        # can't do it! types only exposes type information, not dtype

    def __dtype_promote__(self, dtypes):
        if all(d in [VariableLengthString, np.unicode_, np.string_]
               for d in dtypes):
            return VariableLengthString()
        return NotImplemented

    def __array_coerce__(self, array, casting):
        if array.dtype.kind == 'U':
            result = array.astype(object)
            result.dtype = VariableLengthString()
        elif array.dtype.kind == 'S':
            # decode as ascii? raise?
        elif array.dtype.kind == 'O':
            # check for all string object
        else:
            raise TypeError
        return result

I used LogicalDtype above to say that this is based off of another dtype so that numpy knows how to handle it. I just want to implement a little bit on top of that.

The __array__function__ protocol doesn't work that well because the dtype wasn't explicitly provided for all of the arrays.

Categorical

From Joris

class CategoricalDtype():
    
    def __init__(self, categories, ordered=False):
        self.categories = categories
        self.ordered = ordered
    
    @classmethod
    def _construct_dtype_from_string()

    def _array_constructor(self, values):
        # convert values to codes
        codes = ...
        # update self to reflect values
        np.array(codes, dtype=self)

    def _array_repr(self):
        # override the repr of the array with this dtype
    
    def _validate_scalar(self):
        # validate if scalar can be stored in the array


np.array(['Red', 'Green', 'Blue', 'Red'], dtype=CategoricalDtype())
np.array(['Red', 'Green', 'Blue', 'Red'], dtype=CategoricalDtype(categories=['Red', 'Green', 'Blue', 'Yellow']))

Comments

  • Don't need to implement __add__, etc.. due to __array_ufunc__
  • Should we limit the functions that can to go in __array_function__ for dtypes? Do we need __array_function__?
  • Mixins for units -- don't want to write a separate dtype for each variation.
  • Should dtypes specify width

Protocols, Inheritence,

  • protocols __array_ufunc__, __array_function__
  • inheritence - subclassing dtype. Probably not a good idea
  • duckdtype - what is the minumum viable methods and attributes a dtype needs?

Misc

Themes

  • It would be great to get past subclassing
  • It would be nice to write something in Python
  • It would be nice to be able to interoperate between different array duck types using the same dtype