Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large memory footprint in astropy.io.votable.parse_single_table #8946

Open
manodeep opened this issue Jul 2, 2019 · 19 comments
Open

Large memory footprint in astropy.io.votable.parse_single_table #8946

manodeep opened this issue Jul 2, 2019 · 19 comments

Comments

@manodeep
Copy link
Contributor

manodeep commented Jul 2, 2019

I was trying to read two columns from an ~ 1 GB votable file (demo Gaia dr2 data). The file itself contains ~ 96 columns. The code I used was:

from astropy.io.votable import parse_single_table
columns = ['phot_g_mean_mag', 'parallax']
table = parse_single_table("async_20190630210155.vot", columns=columns)
print("Done reading table")

Here's the file info:

$ ls -alh async_20190630210155.vot 
-rw-rw-rw- 1 msinha 1195219923 1.1G Jul  1 14:01 async_20190630210155.vot

Looking at the memory footprint, I saw that python was taking ~12 GB during the read and I cancelled the kernel (this is within a Jupyter notebook). Here's my screenshot showing the memory usage:
Screen Shot 2019-07-02 at 10 13 50 am

While I know that there is a significant python overhead, it still seems like a lot of memory to read only 2 columns (out of 96). By my math, the (minimum) possible size is 2/96*1 GB ~ 0.02 GB

Since I am new to both astropy and votables, perhaps I am doing something incorrectly. Happy to provide further info or help debug, as necessary. In case there is something inherently wrong with the file itself, here's a dropbox link to the file.

Cheers,
Manodeep

@manodeep
Copy link
Contributor Author

manodeep commented Jul 3, 2019

I ran the memory profiler within the jupyter notebook (new jupyter session, new browser session) to see what might be going on.

%load_ext memory_profiler

%%file mprun_demo.py
def read_votable():
    from astropy.io.votable import parse_single_table
    columns = ['phot_g_mean_mag', 'parallax']
    table = parse_single_table("async_20190630210155.vot", columns=columns)
    print("Done reading table")
    return table.to_table(use_names_over_ids=True)

from mprun_demo import read_votable
%mprun -f read_votable read_votable()

The system activity monitor had the memory usage roughly constant at 11.6 GB for the ~20 mins that I let the code run (the read did not complete, that's why the last two lines were not executed). The function memory profile accounts for less than 1 GB.

Line #    Mem usage    Increment   Line Contents
================================================
     1     53.4 MiB     53.4 MiB   def read_votable():
     2     65.6 MiB     12.2 MiB       from astropy.io.votable import parse_single_table
     3     65.6 MiB      0.0 MiB       columns = ['phot_g_mean_mag', 'parallax']
     4    796.5 MiB    730.9 MiB       table = parse_single_table("async_20190630210155.vot", columns=columns)
     5                                 print("Done reading table")
     6                                 return table.to_table(use_names_over_ids=True)

I am not sure how to include the memory profile for all the called functions, but it seems that the excessive memory is coming from some function within the votable routines.

@manodeep
Copy link
Contributor Author

manodeep commented Jul 3, 2019

For completeness, here's the output from line_profiler:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def read_votable():
     2         1        244.0    244.0      0.0      from astropy.io.votable import parse_single_table
     3         1          4.0      4.0      0.0      columns = ['phot_g_mean_mag', 'parallax']
     4         1  233305074.0 233305074.0    100.0      table = parse_single_table("async_20190630210155.vot", columns=columns)
     5                                               print("Done reading table")
     6                                               return table.to_table(use_names_over_ids=True)

Nothing surprising. All of the runtime is in the parse_single_table function.

@pllim
Copy link
Member

pllim commented Jul 9, 2019

Thanks for reporting this. Sounds like a bug, perhaps due to memory leak. 🤔

To profile the underlying calls, for the memory part, might need to call them directly, which is not very fun. For the performance part, might have to fall back to cProfile.

It all starts here:

def parse_single_table(source, **kwargs):
"""
Parses a VOTABLE_ xml file (or file-like object), reading and
returning only the first `~astropy.io.votable.tree.Table`
instance.
See `parse` for a description of the keyword arguments.
Returns
-------
votable : `~astropy.io.votable.tree.Table` object
"""
if kwargs.get('table_number') is None:
kwargs['table_number'] = 0
votable = parse(source, **kwargs)
return votable.get_first_table()

@manodeep
Copy link
Contributor Author

manodeep commented Jul 9, 2019

Thanks @pllim. Since topcat read in the file just fine, I can confirm that this is indeed a bug within astropy and not something peculiar about the file.

@bsipocz bsipocz added the Bug label Jul 10, 2019
@pllim
Copy link
Member

pllim commented Aug 13, 2019

Finally got around to downloading this file. I just fired up volint and immediately I see one of the CPU at 100% capacity (probably expected, though the CPU switches from one to the other once in a while) and memory use quickly ramped up to about over 20 GB and one point (whaaaat?) and then ramp down a bit and is staying much or less around 16-17 GB. 😱 Given how long you reported that it takes to parse the file (20 mins), I am gonna have to kill this job. Something is obviously not right...

When I kill the job, here is the traceback it spitted out before it died:

  File ".../bin/volint", line 11, in <module>
    load_entry_point('astropy==4.0.dev25415', 'console_scripts', 'volint')()
  File ".../astropy/io/votable/volint.py", line 18, in main
    table.validate(args.filename[0])
  File ".../astropy/io/votable/table.py", line 283, in validate
    votable = parse(content_buffer, verify='warn', filename=filename)
  File ".../astropy/utils/decorators.py", line 521, in wrapper
    return function(*args, **kwargs)
  File ".../astropy/io/votable/table.py", line 167, in parse
    config=config, pos=(1, 1)).parse(iterator, config)
  File ".../astropy/io/votable/tree.py", line 3391, in parse
    iterator, tag, data, config, pos)
  File ".../astropy/io/votable/tree.py", line 3320, in _add_resource
    resource.parse(self, iterator, config)
  File ".../astropy/io/votable/tree.py", line 3142, in parse
    iterator, tag, data, config, pos)
  File ".../astropy/io/votable/tree.py", line 3093, in _add_table
    table.parse(iterator, config)
  File ".../astropy/io/votable/tree.py", line 2396, in parse
    2, iterator, colnumbers, fields, config, pos)
  File ".../astropy/io/votable/tree.py", line 2631, in _parse_binary
    value, value_mask = binparse(careful_read)
  File ".../astropy/io/votable/converters.py", line 649, in binparse
    return result[0], self.is_null(result[0])

This appears to be a BINARY2 format.

Note to self: What is this undocumented _debug_python_based_parser option in table.py? 🤔

@pllim

This comment has been minimized.

@manodeep
Copy link
Contributor Author

Probably wouldn't be the case in the immigrant library ;-)

@pllim

This comment has been minimized.

@manodeep
Copy link
Contributor Author

@pllim Here you go async_subset.vot, 22MB

@pllim
Copy link
Member

pllim commented Mar 13, 2020

Thanks! However, something isn't quite right. The file isn't binary anymore. And while the parsing is instantaneous this time without using much memory, the table looks invalid.

>>> table = parse_single_table("async_subset.vot", columns=columns)         
>>> table                                                                   
<Table length=10000>
solution_id designation source_id ... datalink_url epoch_photometry_url
                                  ...                                  
   int64       object     int64   ...    object           object       
----------- ----------- --------- ... ------------ --------------------
         --          --        -- ...           --                   --
         --          --        -- ...           --                   --
         --          --        -- ...           --                   --
         --          --        -- ...           --                   --

Also, it looks like columns is ignored?

@pllim
Copy link
Member

pllim commented Mar 13, 2020

Going back to the 1GB file, I ran this, walked away, came back an hour later, and amazingly it finished. Just posting this here for future reference. So for sure it did not get stuck in a loop somewhere, just eats up lots of memory and takes up to an hour to parse. 😱

$ volint async_20190630210155.vot 
Validation report for async_20190630210155.vot

5: W35: 'value' attribute required for INFO elements
<INFO name="QUERY" value=""><![CDATA[select * from gaiadr2.gaia_sou...
^

6: W35: 'value' attribute required for INFO elements
<INFO name="CAPTION" value=""><![CDATA[How to cite and acknowledge ...
^

7: W35: 'value' attribute required for INFO elements
<INFO name="PAGE" value=""></INFO>
^

8: W35: 'value' attribute required for INFO elements
<INFO name="PAGE_SIZE" value=""></INFO>
^

10: W35: 'value' attribute required for INFO elements
<INFO name="JOBNAME" value=""></INFO>
^

47: W06: Invalid UCD 'arith.ratio': Secondary word 'arith.ratio' is
  not valid as a primary word
<FIELD datatype="float" name="parallax_over_error" ucd="arith.ratio">
^

50: W50: Invalid unit string 'mas.yr**-1'
<FIELD datatype="double" name="pmra" ucd="pos.pm;pos.eq.ra" unit="m...
^

53: W50: Invalid unit string 'mas.yr**-1'
<FIELD datatype="double" name="pmra_error" ucd="stat.error;pos.pm;p...
^

56: W50: Invalid unit string 'mas.yr**-1'
<FIELD datatype="double" name="pmdec" ucd="pos.pm;pos.eq.dec" unit=...
^

59: W50: Invalid unit string 'mas.yr**-1'
<FIELD datatype="double" name="pmdec_error" ucd="stat.error;pos.pm;...
^

122: W50: Invalid unit string 'mas**-2'
<FIELD datatype="float" name="astrometric_weight_al" ucd="stat.weig...
^

125: W50: Invalid unit string 'um**-1'
<FIELD datatype="double" name="astrometric_pseudo_colour" ucd="em.w...
^

128: W06: Invalid UCD 'em.wavenumber;stat.error': Primary word
  'stat.error' is not valid as a secondary word
<FIELD datatype="double" name="astrometric_pseudo_colour_error" ucd...
^

128: W50: Invalid unit string 'um**-1'
<FIELD datatype="double" name="astrometric_pseudo_colour_error" ucd...
^

140: W06: Invalid UCD 'pos.errorEllipse;stat.max': Secondary word
  'pos.errorEllipse' is not valid as a primary word
<FIELD datatype="float" name="astrometric_sigma5d_max" ucd="pos.err...
^

155: W50: Invalid unit string ''electron'.s**-1'
<FIELD datatype="double" name="phot_g_mean_flux" ucd="phot.flux;sta...
^

158: W50: Invalid unit string ''electron'.s**-1'
<FIELD datatype="double" name="phot_g_mean_flux_error" ucd="stat.er...
^

161: W06: Invalid UCD 'arith.ratio': Secondary word 'arith.ratio' is
  not valid as a primary word
<FIELD datatype="float" name="phot_g_mean_flux_over_error" ucd="ari...
^

170: W50: Invalid unit string ''electron'.s**-1' (suppressing further
  warnings of this type...)
<FIELD datatype="double" name="phot_bp_mean_flux" ucd="phot.flux;st...
^

176: W06: Invalid UCD 'arith.ratio': Secondary word 'arith.ratio' is
  not valid as a primary word
<FIELD datatype="float" name="phot_bp_mean_flux_over_error" ucd="ar...
^

191: W06: Invalid UCD 'arith.ratio': Secondary word 'arith.ratio' is
  not valid as a primary word
<FIELD datatype="float" name="phot_rp_mean_flux_over_error" ucd="ar...
^

203: W06: Invalid UCD 'phot.color': Unknown word 'phot.color'
<FIELD datatype="float" name="bp_rp" ucd="phot.color" unit="mag">
^

206: W06: Invalid UCD 'phot.color': Unknown word 'phot.color'
<FIELD datatype="float" name="bp_g" ucd="phot.color" unit="mag">
^

209: W06: Invalid UCD 'phot.color': Unknown word 'phot.color'
<FIELD datatype="float" name="g_rp" ucd="phot.color" unit="mag">
^

281: W06: Invalid UCD 'phys.size.radius;stat.error': Primary word
  'stat.error' is not valid as a secondary word (suppressing further
  warnings of this type...)
<FIELD datatype="float" name="radius_percentile_lower" ucd="phys.si...
^

@pllim

This comment has been minimized.

@manodeep
Copy link
Contributor Author

Thanks! However, something isn't quite right. The file isn't binary anymore. And while the parsing is instantaneous this time without using much memory, the table looks invalid.

>>> table = parse_single_table("async_subset.vot", columns=columns)         
>>> table                                                                   
<Table length=10000>
solution_id designation source_id ... datalink_url epoch_photometry_url
                                  ...                                  
   int64       object     int64   ...    object           object       
----------- ----------- --------- ... ------------ --------------------
         --          --        -- ...           --                   --
         --          --        -- ...           --                   --
         --          --        -- ...           --                   --
         --          --        -- ...           --                   --

Also, it looks like columns is ignored?

Huh - that's weird. If I remember correctly, I had created that file on my laptop last year after reading in the entire file. Regardless, I am going to blame past me :)

Glad that you could manage to read in the entire file.

@pllim
Copy link
Member

pllim commented Mar 18, 2020

I tried memory profiling by extracting the code that I thought was relevant into its own file for the profiler to crawl through. Rename this to run_profiler.py:
run_profiler.py.txt

Then I ran the command python -m memory_profiler run_profiler.py using memory-profiler 0.55.0. It took a good few hours! Here are the results that I got but I'll have to come back and contemplate this later:

VOTableFile.parse()
Line #    Mem usage    Increment   Line Contents
================================================
   164     69.4 MiB     69.4 MiB       @profile
   165                                 def parse(self, iterator, config):
   166     69.4 MiB      0.0 MiB           config['_current_table_number'] = 0
   167                             
   168     69.5 MiB      0.2 MiB           for start, tag, data, pos in iterator:
   169     69.5 MiB      0.0 MiB               if start:
   170     69.5 MiB      0.0 MiB                   if tag == 'xml':
   171     69.5 MiB      0.0 MiB                       pass
   172     69.5 MiB      0.0 MiB                   elif tag == 'VOTABLE':
   173     69.5 MiB      0.0 MiB                       if 'version' not in data:
   174                                                     warn_or_raise(W20, W20, self.version, config, pos)
   175                                                     config['version'] = self.version
   176                                                 else:
   177     69.5 MiB      0.0 MiB                           config['version'] = self._version = data['version']
   178     69.5 MiB      0.0 MiB                           if config['version'].lower().startswith('v'):
   179                                                         warn_or_raise(
   180                                                             W29, W29, config['version'], config, pos)
   181                                                         self._version = config['version'] = config['version'][1:]  # noqa
   182     69.5 MiB      0.0 MiB                           if config['version'] not in ('1.1', '1.2', '1.3',
   183                                                                                  '1.4'):
   184                                                         vo_warn(W21, config['version'], config, pos)
   185                             
   186     69.5 MiB      0.0 MiB                       if 'xmlns' in data:
   187                                                     # Starting with VOTable 1.3, namespace URIs stop
   188                                                     # incrementing with minor version changes.  See
   189                                                     # this IVOA note for more info:
   190                                                     # http://www.ivoa.net/documents/Notes/XMLVers/20180529/
   191                                                     #
   192                                                     # If this policy is in place for major version 2,
   193                                                     # then this logic will need tweaking.
   194     69.5 MiB      0.0 MiB                           if config['version'] in ('1.3', '1.4'):
   195     69.5 MiB      0.0 MiB                               ns_version = '1.3'
   196                                                     else:
   197                                                         ns_version = config['version']
   198                                                     correct_ns = (
   199     69.5 MiB      0.0 MiB                               'http://www.ivoa.net/xml/VOTable/v{}'.format(
   200     69.5 MiB      0.0 MiB                                   ns_version))
   201     69.5 MiB      0.0 MiB                           if data['xmlns'] != correct_ns:
   202                                                         vo_warn(
   203                                                             W41, (correct_ns, data['xmlns']), config, pos)
   204                                                 else:
   205                                                     vo_warn(W42, (), config, pos)
   206                             
   207     69.5 MiB      0.0 MiB                       break
   208                                             else:
   209                                                 vo_raise(E19, (), config, pos)
   210                                     config['version_1_1_or_later'] = \
   211     69.5 MiB      0.0 MiB               util.version_compare(config['version'], '1.1') >= 0
   212                                     config['version_1_2_or_later'] = \
   213     69.5 MiB      0.0 MiB               util.version_compare(config['version'], '1.2') >= 0
   214                                     config['version_1_3_or_later'] = \
   215     69.5 MiB      0.0 MiB               util.version_compare(config['version'], '1.3') >= 0
   216                                     config['version_1_4_or_later'] = \
   217     69.5 MiB      0.0 MiB               util.version_compare(config['version'], '1.4') >= 0
   218                             
   219                                     tag_mapping = {
   220     69.5 MiB      0.0 MiB               'PARAM': self._add_param,
   221     69.5 MiB      0.0 MiB               'RESOURCE': self._add_resource,
   222     69.5 MiB      0.0 MiB               'COOSYS': self._add_coosys,
   223     69.5 MiB      0.0 MiB               'TIMESYS': self._add_timesys,
   224     69.5 MiB      0.0 MiB               'INFO': self._add_info,
   225     69.5 MiB      0.0 MiB               'DEFINITIONS': self._add_definitions,
   226     69.5 MiB      0.0 MiB               'DESCRIPTION': self._ignore_add,
   227     69.5 MiB      0.0 MiB               'GROUP': self._add_group}
   228                             
   229   4260.0 MiB      0.0 MiB           for start, tag, data, pos in iterator:
   230   4260.0 MiB      0.0 MiB               if start:
   231     69.5 MiB      0.0 MiB                   tag_mapping.get(tag, self._add_unknown_tag)(
   232   4260.0 MiB   4190.4 MiB                       iterator, tag, data, config, pos)
   233   4260.0 MiB      0.0 MiB               elif tag == 'DESCRIPTION':
   234                                             if self.description is not None:
   235                                                 warn_or_raise(W17, W17, 'VOTABLE', config, pos)
   236                                             self.description = data or None
   237                             
   238   4260.0 MiB      0.0 MiB           if not len(self.resources) and config['version_1_2_or_later']:
   239                                         warn_or_raise(W53, W53, (), config, pos)
   240                             
   241   4260.0 MiB      0.0 MiB           return self
Higher level parse()
Line #    Mem usage    Increment   Line Contents
================================================
   500     69.2 MiB     69.2 MiB   @profile
   501                             def parse(source, columns=None, invalid='exception', verify=None,
   502                                       chunk_size=tree.DEFAULT_CHUNK_SIZE, table_number=None,
   503                                       table_id=None, filename=None, unit_format=None,
   504                                       datatype_mapping=None, _debug_python_based_parser=False):
...
   585     69.2 MiB      0.0 MiB       from astropy.io.votable import conf
   586                             
   587     69.2 MiB      0.0 MiB       invalid = invalid.lower()
   588     69.2 MiB      0.0 MiB       if invalid not in ('exception', 'mask'):
   589                                     raise ValueError("accepted values of ``invalid`` are: "
   590                                                      "``'exception'`` or ``'mask'``.")
   591                             
   592     69.2 MiB      0.0 MiB       if verify is None:
   593                             
   594                                     # NOTE: since the pedantic argument isn't fully deprecated yet, we need
   595                                     # to catch the deprecation warning that occurs when accessing the
   596                                     # configuration item, but only if it is for the pedantic option in the
   597                                     # [io.votable] section.
   598     69.2 MiB      0.0 MiB           with warnings.catch_warnings():
   599     69.2 MiB      0.0 MiB               warnings.filterwarnings(
   600     69.2 MiB      0.0 MiB                   "ignore",
   601     69.2 MiB      0.0 MiB                   r"Config parameter \'pedantic\' in section \[io.votable\]",
   602     69.2 MiB      0.0 MiB                   AstropyDeprecationWarning)
   603     69.2 MiB      0.0 MiB               conf_verify_lowercase = conf.verify.lower()
   604                             
   605                                     # We need to allow verify to be booleans as strings since the
   606                                     # configuration framework doesn't make it easy/possible to have mixed
   607                                     # types.
   608     69.2 MiB      0.0 MiB           if conf_verify_lowercase in ['false', 'true']:
   609                                         verify = conf_verify_lowercase == 'true'
   610                                     else:
   611     69.2 MiB      0.0 MiB               verify = conf_verify_lowercase
   612                             
   613     69.2 MiB      0.0 MiB       if isinstance(verify, bool):
   614                                     verify = 'exception' if verify else 'warn'
   615     69.2 MiB      0.0 MiB       elif verify not in VERIFY_OPTIONS:
   616                                     raise ValueError('verify should be one of {}'.format(
   617                                         '/'.join(VERIFY_OPTIONS)))
   618                             
   619     69.2 MiB      0.0 MiB       if datatype_mapping is None:
   620     69.2 MiB      0.0 MiB           datatype_mapping = {}
   621                             
   622                                 config = {
   623     69.2 MiB      0.0 MiB           'columns': columns,
   624     69.2 MiB      0.0 MiB           'invalid': invalid,
   625     69.2 MiB      0.0 MiB           'verify': verify,
   626     69.2 MiB      0.0 MiB           'chunk_size': chunk_size,
   627     69.2 MiB      0.0 MiB           'table_number': table_number,
   628     69.2 MiB      0.0 MiB           'filename': filename,
   629     69.2 MiB      0.0 MiB           'unit_format': unit_format,
   630     69.2 MiB      0.0 MiB           'datatype_mapping': datatype_mapping
   631                                 }
   632                             
   633     69.2 MiB      0.0 MiB       if filename is None and isinstance(source, str):
   634     69.2 MiB      0.0 MiB           config['filename'] = source
   635                             
   636     69.2 MiB      0.0 MiB       with iterparser.get_xml_iterator(
   637     69.2 MiB      0.0 MiB               source,
   638     69.4 MiB      0.2 MiB               _debug_python_based_parser=_debug_python_based_parser) as iterator:
   639     69.4 MiB      0.0 MiB           return VOTableFile(
   640   4260.0 MiB   4190.6 MiB               config=config, pos=(1, 1)).parse(iterator, config)

@pllim
Copy link
Member

pllim commented Mar 23, 2020

Still drilling down to the relevant bits but I think I am getting close. Note to self: Original provided input is a gzipped file. Before gunzip is 1.1 GB. After gunzip is 2.4 GB.

$ file async_20190630210155.vot
async_20190630210155.vot: gzip compressed data, from FAT filesystem (MS-DOS, OS/2, NT)

This gives a human-readable VOTable. Turns out the content is indeed BINARY2.

gunzip -d < async_20190630210155.vot > async_20190630210155.raw

Thanks, @jhunkeler !

@pllim
Copy link
Member

pllim commented Mar 25, 2020

Alas, it traced to iterparse.c and I lost the trail. I tried valgrind but it didn't give me anything useful. I also tried to stare the code down but my C-extension-fu is weak. Anyone else interested to pursue this is welcome to do so.

@manodeep , and just out of curiosity, how much memory did topcat use to parse that same file?

@manodeep
Copy link
Contributor Author

@pllim Don't remember but definitely remember it being absurdly high enough that typical laptops would crash. Based on that memory, I will guess in the 8-12 GB range, and less than 16 GB (that's what I have on this laptop).

@manodeep
Copy link
Contributor Author

Btw, the implementation in these lines only work for 32 bits (reference). If Py_ssize_t is 64 bit, then the implementation will fail. Just guessing based on symmetry that you need this extra line for 64-bits:

    n |= n >> 16;
    if(sizeof(Py_ssize_t) == 8) {
        n |= n >> 32;
    }

@pllim
Copy link
Member

pllim commented Mar 26, 2020

@manodeep , you are saying that topcat (which you claimed could read this catalog successfully) also used a lot of memory? Just want to be sure. If so, then maybe astropy's performance isn't as bad as I thought.

Also, do you want to submit a PR for the 64-bit fix to get the credit? 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants