Improve splitting arrays of time intervals #431

jmosbacher · 2021-04-27T19:18:47Z

What is the problem?

Currently a split is only allowed at times when no interval of data overlaps, this gets around the problem of a plugin missing data from one of its dependencies that overlaps with an interval of another dependency. This solution adds a lot of complexity when aligning chunks for plugins that take multiple inputs as well as windowing computation.

Proposed solution

A possible solution to this would be to split inclusively and concat exclusively, meaning the rule for splitting at any given time is to include overlapping intervals in both sides of the split but when concatenating two datasets intervals are only taken from each chunk if they started within the half-open interval of validity of the chunk. This will mean that when you have intervals that overlap the split time those intervals will be processed twice, but if chunk size is reasonable the affect of one additional row should be negligible on compute time. This approach would eliminate the need for a special plugin type for windowing operations, since all plugins can potentially compute overlapping chunks. Each plugin can define how much overlap on each side they want and each of the overlapping chunks would be processed in parallel, the potential extra overlap in the output would only be stripped when concatenating two adjacent chunks. Chunks include "start" and "end" fields, defining the half-open interval on which to select data when concatenating with an adjacent (therefore potentially overlapping) chunk, this can be done when adjacent chunks are collected into local memory for the next step of processing to ensure that all data is included in at least one chunk.

WenzDaniel · 2021-04-28T05:35:46Z

Hej Yossi, thanks but could you maybe add an image for explanation? I think, I got your idea just want to make sure. One question though:

the potential extra overlap in the output would only be stripped when concatenating two adjacent chunks. Chunks include "start" and "end" fields, defining the half-open interval on which to select data when concatenating with an adjacent (therefore potentially overlapping) chunk

Do you mean we write the overlapping data twice to disk and only remove the doubled data during loading?

jmosbacher · 2021-04-28T07:12:07Z

@WenzDaniel indeed an image would explain better. here are the two scenarios:

As far as the delayed cutting of outputs scheme, I think if we go down that route, we should probably re-chunk before saving and use the chance to have a validation step that we are not losing data on the cut when concatenating for the rechunk step.
Alternatively we can just cut the outputs on their validity interval right after running the compute method, but we would probably want to have a flag to set this and start with validating everything and if we see no data would have been lossed then we remove the flag.

jmosbacher added this to To do in Alternative processing models Apr 27, 2021

jmosbacher mentioned this issue Aug 28, 2021

Arbitrary chunk splitting #518

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve splitting arrays of time intervals #431

Improve splitting arrays of time intervals #431

jmosbacher commented Apr 27, 2021

WenzDaniel commented Apr 28, 2021 •

edited

jmosbacher commented Apr 28, 2021

Improve splitting arrays of time intervals #431

Improve splitting arrays of time intervals #431

Comments

jmosbacher commented Apr 27, 2021

What is the problem?

Proposed solution

WenzDaniel commented Apr 28, 2021 • edited

jmosbacher commented Apr 28, 2021

WenzDaniel commented Apr 28, 2021 •

edited