General derived property columns #3584

andy-sweet · 2021-11-05T17:56:21Z

andy-sweet
Nov 5, 2021
Maintainer

In PR #3493 @jni | Juan Nunez-Iglesias suggested a different approach for encoding style values, that moves those values closer to the property values they're derived from (and could also generalize to any derived columns). Instead of using that PR as a place to start exploring ideas, I thought this would be a better place for the discussion.

One idea is to use pandas DataFrame multi-indexing as a way to effectively have different groups of columns (e.g. source and derived). One layer API and beginnings of an implementation is something like the following.

class Points:
    _properties: DataFrame

    @property
    def properties(self) -> DataFrame:
        return self._properties['source']

    @property
    def properties(self, properties: DataFrameLike):
        return self._properties['source'] = properties

    @property
    def face_color(self) -> ColorEncoding:
        return self._face_color

    @face_color.setter
    def face_color(self, face_color: ColorEncodingLike):
        self._properties[('derived', 'face_color')] = face_color.apply(properties)
        self._face_color = face_color

This way we don't expose multi-indexing directly, but still have a sensible API.

points = Points(
    data = np.random.rand(3, 2),
    properties = {'class': ['cat', 'cat', 'dog'], 'confidence': [0.1, 0.8, 0.2]},
)
points.face_color = {'property': 'class', 'colormap': {'cat': 'red', 'dog': 'blue'}}
# points._properties[('derived', 'face_color')] -> ['red', 'red', 'blue']
points.face_color = 'red'
# points._properties[('derived', 'face_color')] -> ['red', 'red', 'red']

One benefit of an implementation like this is that indexing derived values by rows becomes very concise. For example:

class Points:
    def _view_face_colors(self) -> array-like:
        return self._properties[('derived', 'face_color')][self._indices_view]

class VispyPointsLayer:
...
    def _on_data_changed(self, event):
        face_colors = self.layer._view_face_colors()

I think the main problem is how to handle operations like append, refresh, and remove, in a way that can be reused by each Layer.

We can handle remove and maybe refresh by extending DataFrame and overriding its mutators to regenerate derived values as needed, which would also be great for responding to manual user changes (but might also have some dangers). One problem here is that I don't know a natural way of storing the encodings (i.e. the functions they define) in the data frame so that new values can be generated.

As _properties is a hidden implementation detail here, another alternative is to define our own PropertyTable type (which is not exposed publicly and users don't have to think about), which would store the encodings and define those operations as methods.

I think we might need a combination of the two to get all the behavior we want, while also having a consistent implementation.

Another problem with DataFrame is that I don't think it can store a no-copy broadcasted column. I.e. you can do something like

properties[('derived', 'face_color')] = 'red'
# properties[('derived', 'face_color')] -> ['red', 'red', 'red']

but I'm pretty sure that allocates the memory needed for all three repeated 'red' values and copies the constant value in. As constants are often our default encodings, I think it's important to keep them performant, but maybe it's worth sacrificing?

I also made some other comments in #3493 on where xarray might be more suitable/helpful than pandas, especially if we start to think of storing the main data using xarray too.

Anyway, I just thought I'd write some stuff down that I'd been thinking about and playing with. Other ideas and comments are very welcome.

jni · 2021-11-08T08:35:52Z

jni
Nov 8, 2021
Maintainer

Another problem with DataFrame is that I don't think it can store a no-copy broadcasted column. I.e. you can do something like

I did a bit of experimentation locally and indeed I couldn't get pandas to not instantiate the full column, based on DataFrame.memory_usage.

We can handle remove and maybe refresh by extending DataFrame and overriding its mutators to regenerate derived values as needed

Quite separately from this issue, @perlman and I are exploring the idea of "separating" slicing from the layer. The rough sketch of this is that the viewer (in the future, each canvas in the viewer) would combine a Layer and the Dims model to produce a SlicedLayer, which would be guaranteed to contain both the data and properties already sliced. That way all of the downstream processing can forget about indices and the like. Things like appending become a bit more complex but we think they can be managed again by methods that take in the current state of the canvas's Dims model as input.

This is a huge lift that interacts quite closely with the properties+encoding work, as @sofroniewn wisely foretold at the start of this endeavour. 😂 So I'm wondering whether we should pause on this while we work on general slicing...

0 replies

jni · 2021-11-08T08:37:15Z

jni
Nov 8, 2021
Maintainer

So I'm wondering whether we should pause on this while we work on general slicing...

Having said this, I don't mind the alternative, which is to press on with this and derived columns, which should anyway be useful in the SlicedLayer class. My gut feeling is that derived columns + simpler encodings is the right path forward.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General derived property columns #3584

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

General derived property columns #3584

andy-sweet Nov 5, 2021 Maintainer

Replies: 2 comments

jni Nov 8, 2021 Maintainer

jni Nov 8, 2021 Maintainer

andy-sweet
Nov 5, 2021
Maintainer

jni
Nov 8, 2021
Maintainer

jni
Nov 8, 2021
Maintainer