Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve splitting arrays of time intervals #431

Open
jmosbacher opened this issue Apr 27, 2021 · 2 comments
Open

Improve splitting arrays of time intervals #431

jmosbacher opened this issue Apr 27, 2021 · 2 comments

Comments

@jmosbacher
Copy link
Contributor

What is the problem?

Currently a split is only allowed at times when no interval of data overlaps, this gets around the problem of a plugin missing data from one of its dependencies that overlaps with an interval of another dependency. This solution adds a lot of complexity when aligning chunks for plugins that take multiple inputs as well as windowing computation.

Proposed solution

A possible solution to this would be to split inclusively and concat exclusively, meaning the rule for splitting at any given time is to include overlapping intervals in both sides of the split but when concatenating two datasets intervals are only taken from each chunk if they started within the half-open interval of validity of the chunk. This will mean that when you have intervals that overlap the split time those intervals will be processed twice, but if chunk size is reasonable the affect of one additional row should be negligible on compute time. This approach would eliminate the need for a special plugin type for windowing operations, since all plugins can potentially compute overlapping chunks. Each plugin can define how much overlap on each side they want and each of the overlapping chunks would be processed in parallel, the potential extra overlap in the output would only be stripped when concatenating two adjacent chunks. Chunks include "start" and "end" fields, defining the half-open interval on which to select data when concatenating with an adjacent (therefore potentially overlapping) chunk, this can be done when adjacent chunks are collected into local memory for the next step of processing to ensure that all data is included in at least one chunk.

@WenzDaniel
Copy link
Collaborator

WenzDaniel commented Apr 28, 2021

Hej Yossi, thanks but could you maybe add an image for explanation? I think, I got your idea just want to make sure. One question though:

the potential extra overlap in the output would only be stripped when concatenating two adjacent chunks. Chunks include "start" and "end" fields, defining the half-open interval on which to select data when concatenating with an adjacent (therefore potentially overlapping) chunk

Do you mean we write the overlapping data twice to disk and only remove the doubled data during loading?

@jmosbacher
Copy link
Contributor Author

@WenzDaniel indeed an image would explain better. here are the two scenarios:
strax_chunk_splitting
strax_chunk_merging

As far as the delayed cutting of outputs scheme, I think if we go down that route, we should probably re-chunk before saving and use the chance to have a validation step that we are not losing data on the cut when concatenating for the rechunk step.
Alternatively we can just cut the outputs on their validity interval right after running the compute method, but we would probably want to have a flag to set this and start with validating everything and if we see no data would have been lossed then we remove the flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants