Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing last chunk in CHUNK_STORE #976

Open
atamkapoor opened this issue Dec 9, 2022 · 1 comment
Open

Missing last chunk in CHUNK_STORE #976

atamkapoor opened this issue Dec 9, 2022 · 1 comment

Comments

@atamkapoor
Copy link

Arctic Version

1.80.5

Arctic Store

# ChunkStore

Platform and version

Python 3.8.5

Description of problem and/or code sample that reproduces the issue

I noticed that if I save a dataframe where the UTC date carries over to the next day, most functions (reverse_iterator, get_chunk_ranges, get_info, ...) don't return the chunk for the new date. The following example will make this clear (jupyter notebook attached in the zip file):

Set Up

import pandas as pd
from arctic import Arctic, CHUNK_STORE
store = Arctic("localhost")
store.initialize_library("scratch_lib", lib_type=CHUNK_STORE)

lib = store["scratch_lib"]

Create an Index with some times that will change dates when converted to UTC

ind = pd.Index([pd.Timestamp("20121208T16:00", tz="US/Eastern"), pd.Timestamp("20121208T18:00", tz="US/Eastern"), 
                pd.Timestamp("20121208T20:00", tz="US/Eastern"), pd.Timestamp("20121208T22:00", tz="US/Eastern")], name="date")
print(ind)

Output:

DatetimeIndex(['2012-12-08 16:00:00-05:00', '2012-12-08 18:00:00-05:00', '2012-12-08 20:00:00-05:00', '2012-12-08 22:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', name='date', freq=None)

print(ind.tz_convert("UTC"))

Output

DatetimeIndex(['2012-12-08 21:00:00+00:00', '2012-12-08 23:00:00+00:00', '2012-12-09 01:00:00+00:00', '2012-12-09 03:00:00+00:00'], dtype='datetime64[ns, UTC]', name='date', freq=None)

Create dataframe, write it to the library, and read it back out

df = pd.DataFrame([1, 2, 3, 4], index=ind, columns=["col"])
lib.write("example_df", df, chunk_size="D")
df_read = lib.read("example_df")
print(df_read)

Output

date col
2012-12-08 21:00:00 1
2012-12-08 23:00:00 2
2012-12-09 01:00:00 3
2012-12-09 03:00:00 4

This is different from what I expected. Is this behavior expected?

lib.get_info("example_df")

Output

{'chunk_count': 1,
'len': 4,
'appended_rows': 0,
'metadata': {'columns': ['date', 'col']},
'chunker': 'date',
'chunk_size': 'D',
'serializer': 'FrameToArray'}

>> expected chunk_count = 2, not 1

list(lib.get_chunk_ranges("example_df"))

Output

[(b'2012-12-08 00:00:00', b'2012-12-08 23:59:59.999000')]

>> expected [(b'2012-12-08 00:00:00', b'2012-12-08 23:59:59.999000'), (b'2012-12-09 00:00:00', b'2012-12-09 23:59:59.999000')]

iterator = lib.reverse_iterator("example_df")
while True:
    data = next(iterator, None)
    if data is None:
        break
    print(data)

Output

date col
2012-12-08 21:00:00 1
2012-12-08 23:00:00 2

**>> expected the following:
date col
2012-12-09 01:00:00 3
2012-12-09 03:00:00 4

date col
2012-12-08 21:00:00 1
2012-12-08 23:00:00 2**

arctic_issue_example.zip

@atamkapoor
Copy link
Author

@bmoscon #384 is probably related to this issue. Aside from the simple example above, I am saving 1-minute frequency data with a chunk size of D, similar to #384 and noticed that I was not able to get the data for the last day where UTC date had rolled over to the next day, and the chunk was missing from the reverse_iterator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant