reduce chunkstore memory footprint #747

TomTaylorLondon · 2019-04-19T13:43:30Z

Two changes:

Reduce memory footprint when reading data
Handle duplicate columns in the filter.

Using a 1GB dataframe:

this PR:

master:

import numpy as np
import pandas as pd
from datetime import datetime as dt
from datetime import timedelta as td

days = 2000
secs = 15000

a1 = [range(secs) for _ in range(days)]
a2 = [[dt(2000,1,1)+td(days=x)]*secs for x in range(days)]
a3 = [['foo']*secs for _ in range(days)]
a4 = [np.random.rand(secs) for _ in range(days)]
a5 = [np.random.rand(secs) for _ in range(days)]
a6 = [['HOLIDAY INN WORLD CORP']*secs for _ in range(days)]

now = dt.now()
result = []
for i in range(days):
    result.append(pd.DataFrame({'security_id':a1[i], 'date':a2[i], 'c':a3[i], 'd':a4[i], 'e':a5[i], 'f':a6[i]}, copy=True))
df = pd.concat(result)
print(df.shape)
print((dt.now() - now).total_seconds())
df = df.set_index(['date','security_id'])
print(df.memory_usage(index=True).sum() / 1e6)
from arctic import Arctic
import arctic
print(arctic.__file__)
a = Arctic('localhost')
a.initialize_library('test', lib_type='ChunkStoreV1')
lib = a['test']
lib.write('test', df)
del df
df = lib.read('test')

yschimke · 2019-04-19T14:02:53Z

What's the memory saving? Have you measured it? 50% only 1 copy of data instead of 2?

Would be great to have a way to show the saving, and automated test to avoid any accidental regressions.

arctic/serialization/numpy_arrays.py

bmoscon

no changes needed per-se, just questions

arctic/serialization/numpy_arrays.py

bmoscon · 2019-04-19T14:06:54Z

arctic/serialization/numpy_arrays.py

@@ -218,16 +222,19 @@ def deserialize(self, data, columns=None):
            if index:
                columns = columns[:]
                columns.extend(meta[INDEX])
-            if len(columns) > len(set(columns)):
-                raise Exception("Duplicate columns specified, cannot de-serialize")
+                columns = list(set(columns))


I'm not sure I see this as a win. It seems like the caller may have a bug if they are specifying duplicate columns, we're just hiding the error now

The current logic is confusing when subsetting data frames with indexes. For example, if you have the data frame:
index: date, security
columns: price, volume

The logic works if the user passes:
['price']
Raises a duplicate columns error when passing:
['date','security','price']

I don't see the value in the check - it should just do the right thing..

Using pandas nomenclature the columns and the index are separate. If there is an index, you always get it back, even if you specify a subset of columns (and even if they do not include the index columns). Maybe the documentation should be improved. If for example, you specify price and security, you'll still get date as well as price and security, so your fix would only introduce more weirdness (in my opinion).

We could remove index columns from columns and then check for duplicates. This keeps the nomenclature but keeps the user interface 'minimum surprise'. Or raise an error saying they have included index columns in the column list.

The result would be the same though, no? You'd supply index columns and it wont complain. I foresee someone opening a bug complaining they only specified 1 of 3 index columns but still got all 3 back.

Actually, in retrospect, that means breaking the API for clients. How about we keep the fuzziness for clients and simply output a warning instead?

ok, lets see a warning :) but i still think that get info should change, otherwise how would you ever know how to rid yourself of the warning?

Agree with get_info change

so sounds like you just need to fix the broken tests and add the log and we're all set :D

arctic/serialization/numpy_arrays.py

bmoscon · 2019-04-23T20:35:29Z

arctic/serialization/numpy_arrays.py


        Returns
        -------
        pandas dataframe or series
        """
        if not data:
            return pd.DataFrame()
+        if not inplace:
+            data = data[:]


there are some errors in the tests, so I'm thinking this will need to be tweaked a bit more

bmoscon · 2019-05-09T10:44:54Z

@TomTaylorLondon are you going to have the bandwidth to finish this or would you like me to resolve it?

shashank88 · 2019-07-05T11:28:49Z

Hi @TomTaylorLondon any luck with this?

bmoscon · 2019-07-05T11:56:28Z

@shashank88 I spoke with @TomTaylorLondon and am going to take this over from him. I'll get it all fixed up later this week(end).

shashank88 · 2019-07-05T21:12:09Z

@shashank88 I spoke with @TomTaylorLondon and am going to take this over from him. I'll get it all fixed up later this week(end).

👍

TomTaylorLondon requested a review from bmoscon April 19, 2019 13:43

yschimke reviewed Apr 19, 2019

View reviewed changes

arctic/serialization/numpy_arrays.py Outdated Show resolved Hide resolved

bmoscon requested changes Apr 19, 2019

View reviewed changes

arctic/serialization/numpy_arrays.py Outdated Show resolved Hide resolved

arctic/serialization/numpy_arrays.py Show resolved Hide resolved

bmoscon reviewed Apr 19, 2019

View reviewed changes

yschimke reviewed Apr 19, 2019

View reviewed changes

arctic/serialization/numpy_arrays.py Show resolved Hide resolved

reduce chunkstore memory footprint

036a89f

TomTaylorLondon force-pushed the feature/chunkstore-reduce-memory-footprint branch from d4ccf47 to 036a89f Compare April 19, 2019 14:45

bmoscon reviewed Apr 23, 2019

View reviewed changes

shashank88 mentioned this pull request Apr 29, 2019

Pre-commit hook with black #752

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduce chunkstore memory footprint #747

reduce chunkstore memory footprint #747

TomTaylorLondon commented Apr 19, 2019 •

edited

yschimke commented Apr 19, 2019

bmoscon left a comment

bmoscon Apr 19, 2019

TomTaylorLondon Apr 19, 2019

bmoscon Apr 19, 2019

TomTaylorLondon Apr 20, 2019

bmoscon Apr 20, 2019

TomTaylorLondon Apr 27, 2019

TomTaylorLondon Apr 27, 2019

bmoscon Apr 27, 2019

TomTaylorLondon Apr 28, 2019

bmoscon Apr 28, 2019

bmoscon Apr 23, 2019

bmoscon commented May 9, 2019 •

edited

shashank88 commented Jul 5, 2019

bmoscon commented Jul 5, 2019

shashank88 commented Jul 5, 2019

reduce chunkstore memory footprint #747

Are you sure you want to change the base?

reduce chunkstore memory footprint #747

Conversation

TomTaylorLondon commented Apr 19, 2019 • edited

yschimke commented Apr 19, 2019

bmoscon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bmoscon commented May 9, 2019 • edited

shashank88 commented Jul 5, 2019

bmoscon commented Jul 5, 2019

shashank88 commented Jul 5, 2019

TomTaylorLondon commented Apr 19, 2019 •

edited

bmoscon commented May 9, 2019 •

edited