Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential high memory usage at new sources rms measurements step #649

Open
ajstewart opened this issue Mar 31, 2022 · 0 comments
Open

Potential high memory usage at new sources rms measurements step #649

ajstewart opened this issue Mar 31, 2022 · 0 comments
Labels
help wanted Extra attention is needed low priority Issue is not of immediate concern. python Pull requests that update Python code

Comments

@ajstewart
Copy link
Contributor

ajstewart commented Mar 31, 2022

When there are a very high number of new sources, likely single epoch, combined with a lot of images this has the potential for the new source analysis to get unwieldy.

In the example error below - this was from a run of short timescale images that are very susceptible to single epoch artefacts. There were roughly 3000 images in the run with each image having 1 - 10 measurements. Assuming half of the total measurements were single epoch new sources, so say average 5 per image, that's 3000 * 5 measurements that need to be measured in 2999 other images - just short of 45 million rms measurements required...

In particular this became a problem at the stage of the new source analysis where the data frames are merged after fetching the rms pixel measurements.

This could be reduced by addressing #327 and making sure the dataframes are as lightweight as possible. Also there may be scope to improve this dataframe stage of the new source analysis to avoid such a huge merge.

This problem can also be mitigated by tweaking the pipeline settings, namely raising the new source minimum rms image threshold in the config to a high value - this acts to pretty much 'turn off' the new source stage. Probably source monitoring should be turned off as well. Basic association could also be employed to eliminate many-to-one and many-to-many associations.

Eventually some stages of the pipeline will have to revisited in general to see how the pandas memory footprint can be reduced, either by refactoring or brining in other tools. The Dask Cluster transition (#335) could also open up other avenues in how to process the data.

2022-03-30 20:49:24,538 new_sources INFO Starting new source analysis.
2022-03-30 21:40:15,414 runpipeline ERROR Processing error:
Unable to allocate 129. GiB for an array with shape (17295864709,) and data type int64
Traceback (most recent call last):
  File "/usr/src/vast-pipeline/vast-pipeline-dev/vast_pipeline/management/commands/runpipeline.py", line 340, in run_pipe
    pipeline.process_pipeline(p_run)
  File "/usr/src/vast-pipeline/vast-pipeline-dev/vast_pipeline/pipeline/main.py", line 256, in process_pipeline
    new_sources_df = new_sources(
  File "/usr/src/vast-pipeline/vast-pipeline-dev/vast_pipeline/pipeline/new_sources.py", line 413, in new_sources
    new_sources_df = parallel_get_rms_measurements(
  File "/usr/src/vast-pipeline/vast-pipeline-dev/vast_pipeline/pipeline/new_sources.py", line 233, in parallel_get_rms_measurements
    df = df.merge(
  File "/usr/src/vast-pipeline/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 9339, in merge
    return merge(
  File "/usr/src/vast-pipeline/.local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 122, in merge
    return op.get_result()
  File "/usr/src/vast-pipeline/.local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 716, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "/usr/src/vast-pipeline/.local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 967, in _get_join_info
    (left_indexer, right_indexer) = self._get_join_indexers()
  File "/usr/src/vast-pipeline/.local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 941, in _get_join_indexers
    return get_join_indexers(
  File "/usr/src/vast-pipeline/.local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 1509, in get_join_indexers
    return join_func(lkey, rkey, count, **kwargs)  # type: ignore[operator]
  File "pandas/_libs/join.pyx", line 101, in pandas._libs.join.left_outer_join
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 129. GiB for an array with shape (17295864709,) and data type int64
@ajstewart ajstewart added help wanted Extra attention is needed low priority Issue is not of immediate concern. python Pull requests that update Python code labels Mar 31, 2022
@github-actions github-actions bot added this to To do in Pipeline Backlog Mar 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed low priority Issue is not of immediate concern. python Pull requests that update Python code
Projects
Development

No branches or pull requests

1 participant