General derived property columns #3584
Replies: 2 comments
-
I did a bit of experimentation locally and indeed I couldn't get pandas to not instantiate the full column, based on
Quite separately from this issue, @perlman and I are exploring the idea of "separating" slicing from the layer. The rough sketch of this is that the viewer (in the future, each canvas in the viewer) would combine a Layer and the Dims model to produce a SlicedLayer, which would be guaranteed to contain both the data and properties already sliced. That way all of the downstream processing can forget about indices and the like. Things like appending become a bit more complex but we think they can be managed again by methods that take in the current state of the canvas's Dims model as input. This is a huge lift that interacts quite closely with the properties+encoding work, as @sofroniewn wisely foretold at the start of this endeavour. 😂 So I'm wondering whether we should pause on this while we work on general slicing... |
Beta Was this translation helpful? Give feedback.
-
Having said this, I don't mind the alternative, which is to press on with this and derived columns, which should anyway be useful in the SlicedLayer class. My gut feeling is that derived columns + simpler encodings is the right path forward. |
Beta Was this translation helpful? Give feedback.
-
In PR #3493 @jni | Juan Nunez-Iglesias suggested a different approach for encoding style values, that moves those values closer to the property values they're derived from (and could also generalize to any derived columns). Instead of using that PR as a place to start exploring ideas, I thought this would be a better place for the discussion.
One idea is to use pandas DataFrame multi-indexing as a way to effectively have different groups of columns (e.g. source and derived). One layer API and beginnings of an implementation is something like the following.
This way we don't expose multi-indexing directly, but still have a sensible API.
One benefit of an implementation like this is that indexing derived values by rows becomes very concise. For example:
I think the main problem is how to handle operations like append, refresh, and remove, in a way that can be reused by each Layer.
We can handle remove and maybe refresh by extending
DataFrame
and overriding its mutators to regenerate derived values as needed, which would also be great for responding to manual user changes (but might also have some dangers). One problem here is that I don't know a natural way of storing the encodings (i.e. the functions they define) in the data frame so that new values can be generated.As
_properties
is a hidden implementation detail here, another alternative is to define our ownPropertyTable
type (which is not exposed publicly and users don't have to think about), which would store the encodings and define those operations as methods.I think we might need a combination of the two to get all the behavior we want, while also having a consistent implementation.
Another problem with
DataFrame
is that I don't think it can store a no-copy broadcasted column. I.e. you can do something likebut I'm pretty sure that allocates the memory needed for all three repeated
'red'
values and copies the constant value in. As constants are often our default encodings, I think it's important to keep them performant, but maybe it's worth sacrificing?I also made some other comments in #3493 on where
xarray
might be more suitable/helpful thanpandas
, especially if we start to think of storing the main data usingxarray
too.Anyway, I just thought I'd write some stuff down that I'd been thinking about and playing with. Other ideas and comments are very welcome.
Beta Was this translation helpful? Give feedback.
All reactions