New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large memory footprint in astropy.io.votable.parse_single_table #8946
Comments
I ran the memory profiler within the jupyter notebook (new jupyter session, new browser session) to see what might be going on. %load_ext memory_profiler
%%file mprun_demo.py
def read_votable():
from astropy.io.votable import parse_single_table
columns = ['phot_g_mean_mag', 'parallax']
table = parse_single_table("async_20190630210155.vot", columns=columns)
print("Done reading table")
return table.to_table(use_names_over_ids=True)
from mprun_demo import read_votable
%mprun -f read_votable read_votable() The system activity monitor had the memory usage roughly constant at 11.6 GB for the ~20 mins that I let the code run (the read did not complete, that's why the last two lines were not executed). The function memory profile accounts for less than 1 GB. Line # Mem usage Increment Line Contents
================================================
1 53.4 MiB 53.4 MiB def read_votable():
2 65.6 MiB 12.2 MiB from astropy.io.votable import parse_single_table
3 65.6 MiB 0.0 MiB columns = ['phot_g_mean_mag', 'parallax']
4 796.5 MiB 730.9 MiB table = parse_single_table("async_20190630210155.vot", columns=columns)
5 print("Done reading table")
6 return table.to_table(use_names_over_ids=True) I am not sure how to include the memory profile for all the called functions, but it seems that the excessive memory is coming from some function within the |
For completeness, here's the output from Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def read_votable():
2 1 244.0 244.0 0.0 from astropy.io.votable import parse_single_table
3 1 4.0 4.0 0.0 columns = ['phot_g_mean_mag', 'parallax']
4 1 233305074.0 233305074.0 100.0 table = parse_single_table("async_20190630210155.vot", columns=columns)
5 print("Done reading table")
6 return table.to_table(use_names_over_ids=True) Nothing surprising. All of the runtime is in the |
Thanks for reporting this. Sounds like a bug, perhaps due to memory leak. 🤔 To profile the underlying calls, for the memory part, might need to call them directly, which is not very fun. For the performance part, might have to fall back to cProfile. It all starts here: astropy/astropy/io/votable/table.py Lines 170 to 187 in 7e751f1
|
Thanks @pllim. Since |
Finally got around to downloading this file. I just fired up When I kill the job, here is the traceback it spitted out before it died:
This appears to be a BINARY2 format. Note to self: What is this undocumented |
This comment has been minimized.
This comment has been minimized.
Probably wouldn't be the case in the immigrant library ;-) |
This comment has been minimized.
This comment has been minimized.
@pllim Here you go async_subset.vot, 22MB |
Thanks! However, something isn't quite right. The file isn't binary anymore. And while the parsing is instantaneous this time without using much memory, the table looks invalid. >>> table = parse_single_table("async_subset.vot", columns=columns)
>>> table
Also, it looks like |
Going back to the 1GB file, I ran this, walked away, came back an hour later, and amazingly it finished. Just posting this here for future reference. So for sure it did not get stuck in a loop somewhere, just eats up lots of memory and takes up to an hour to parse. 😱
|
This comment has been minimized.
This comment has been minimized.
Huh - that's weird. If I remember correctly, I had created that file on my laptop last year after reading in the entire file. Regardless, I am going to blame past me :) Glad that you could manage to read in the entire file. |
I tried memory profiling by extracting the code that I thought was relevant into its own file for the profiler to crawl through. Rename this to Then I ran the command VOTableFile.parse()Line # Mem usage Increment Line Contents ================================================ 164 69.4 MiB 69.4 MiB @profile 165 def parse(self, iterator, config): 166 69.4 MiB 0.0 MiB config['_current_table_number'] = 0 167 168 69.5 MiB 0.2 MiB for start, tag, data, pos in iterator: 169 69.5 MiB 0.0 MiB if start: 170 69.5 MiB 0.0 MiB if tag == 'xml': 171 69.5 MiB 0.0 MiB pass 172 69.5 MiB 0.0 MiB elif tag == 'VOTABLE': 173 69.5 MiB 0.0 MiB if 'version' not in data: 174 warn_or_raise(W20, W20, self.version, config, pos) 175 config['version'] = self.version 176 else: 177 69.5 MiB 0.0 MiB config['version'] = self._version = data['version'] 178 69.5 MiB 0.0 MiB if config['version'].lower().startswith('v'): 179 warn_or_raise( 180 W29, W29, config['version'], config, pos) 181 self._version = config['version'] = config['version'][1:] # noqa 182 69.5 MiB 0.0 MiB if config['version'] not in ('1.1', '1.2', '1.3', 183 '1.4'): 184 vo_warn(W21, config['version'], config, pos) 185 186 69.5 MiB 0.0 MiB if 'xmlns' in data: 187 # Starting with VOTable 1.3, namespace URIs stop 188 # incrementing with minor version changes. See 189 # this IVOA note for more info: 190 # http://www.ivoa.net/documents/Notes/XMLVers/20180529/ 191 # 192 # If this policy is in place for major version 2, 193 # then this logic will need tweaking. 194 69.5 MiB 0.0 MiB if config['version'] in ('1.3', '1.4'): 195 69.5 MiB 0.0 MiB ns_version = '1.3' 196 else: 197 ns_version = config['version'] 198 correct_ns = ( 199 69.5 MiB 0.0 MiB 'http://www.ivoa.net/xml/VOTable/v{}'.format( 200 69.5 MiB 0.0 MiB ns_version)) 201 69.5 MiB 0.0 MiB if data['xmlns'] != correct_ns: 202 vo_warn( 203 W41, (correct_ns, data['xmlns']), config, pos) 204 else: 205 vo_warn(W42, (), config, pos) 206 207 69.5 MiB 0.0 MiB break 208 else: 209 vo_raise(E19, (), config, pos) 210 config['version_1_1_or_later'] = \ 211 69.5 MiB 0.0 MiB util.version_compare(config['version'], '1.1') >= 0 212 config['version_1_2_or_later'] = \ 213 69.5 MiB 0.0 MiB util.version_compare(config['version'], '1.2') >= 0 214 config['version_1_3_or_later'] = \ 215 69.5 MiB 0.0 MiB util.version_compare(config['version'], '1.3') >= 0 216 config['version_1_4_or_later'] = \ 217 69.5 MiB 0.0 MiB util.version_compare(config['version'], '1.4') >= 0 218 219 tag_mapping = { 220 69.5 MiB 0.0 MiB 'PARAM': self._add_param, 221 69.5 MiB 0.0 MiB 'RESOURCE': self._add_resource, 222 69.5 MiB 0.0 MiB 'COOSYS': self._add_coosys, 223 69.5 MiB 0.0 MiB 'TIMESYS': self._add_timesys, 224 69.5 MiB 0.0 MiB 'INFO': self._add_info, 225 69.5 MiB 0.0 MiB 'DEFINITIONS': self._add_definitions, 226 69.5 MiB 0.0 MiB 'DESCRIPTION': self._ignore_add, 227 69.5 MiB 0.0 MiB 'GROUP': self._add_group} 228 229 4260.0 MiB 0.0 MiB for start, tag, data, pos in iterator: 230 4260.0 MiB 0.0 MiB if start: 231 69.5 MiB 0.0 MiB tag_mapping.get(tag, self._add_unknown_tag)( 232 4260.0 MiB 4190.4 MiB iterator, tag, data, config, pos) 233 4260.0 MiB 0.0 MiB elif tag == 'DESCRIPTION': 234 if self.description is not None: 235 warn_or_raise(W17, W17, 'VOTABLE', config, pos) 236 self.description = data or None 237 238 4260.0 MiB 0.0 MiB if not len(self.resources) and config['version_1_2_or_later']: 239 warn_or_raise(W53, W53, (), config, pos) 240 241 4260.0 MiB 0.0 MiB return self Higher level parse()Line # Mem usage Increment Line Contents ================================================ 500 69.2 MiB 69.2 MiB @profile 501 def parse(source, columns=None, invalid='exception', verify=None, 502 chunk_size=tree.DEFAULT_CHUNK_SIZE, table_number=None, 503 table_id=None, filename=None, unit_format=None, 504 datatype_mapping=None, _debug_python_based_parser=False): ... 585 69.2 MiB 0.0 MiB from astropy.io.votable import conf 586 587 69.2 MiB 0.0 MiB invalid = invalid.lower() 588 69.2 MiB 0.0 MiB if invalid not in ('exception', 'mask'): 589 raise ValueError("accepted values of ``invalid`` are: " 590 "``'exception'`` or ``'mask'``.") 591 592 69.2 MiB 0.0 MiB if verify is None: 593 594 # NOTE: since the pedantic argument isn't fully deprecated yet, we need 595 # to catch the deprecation warning that occurs when accessing the 596 # configuration item, but only if it is for the pedantic option in the 597 # [io.votable] section. 598 69.2 MiB 0.0 MiB with warnings.catch_warnings(): 599 69.2 MiB 0.0 MiB warnings.filterwarnings( 600 69.2 MiB 0.0 MiB "ignore", 601 69.2 MiB 0.0 MiB r"Config parameter \'pedantic\' in section \[io.votable\]", 602 69.2 MiB 0.0 MiB AstropyDeprecationWarning) 603 69.2 MiB 0.0 MiB conf_verify_lowercase = conf.verify.lower() 604 605 # We need to allow verify to be booleans as strings since the 606 # configuration framework doesn't make it easy/possible to have mixed 607 # types. 608 69.2 MiB 0.0 MiB if conf_verify_lowercase in ['false', 'true']: 609 verify = conf_verify_lowercase == 'true' 610 else: 611 69.2 MiB 0.0 MiB verify = conf_verify_lowercase 612 613 69.2 MiB 0.0 MiB if isinstance(verify, bool): 614 verify = 'exception' if verify else 'warn' 615 69.2 MiB 0.0 MiB elif verify not in VERIFY_OPTIONS: 616 raise ValueError('verify should be one of {}'.format( 617 '/'.join(VERIFY_OPTIONS))) 618 619 69.2 MiB 0.0 MiB if datatype_mapping is None: 620 69.2 MiB 0.0 MiB datatype_mapping = {} 621 622 config = { 623 69.2 MiB 0.0 MiB 'columns': columns, 624 69.2 MiB 0.0 MiB 'invalid': invalid, 625 69.2 MiB 0.0 MiB 'verify': verify, 626 69.2 MiB 0.0 MiB 'chunk_size': chunk_size, 627 69.2 MiB 0.0 MiB 'table_number': table_number, 628 69.2 MiB 0.0 MiB 'filename': filename, 629 69.2 MiB 0.0 MiB 'unit_format': unit_format, 630 69.2 MiB 0.0 MiB 'datatype_mapping': datatype_mapping 631 } 632 633 69.2 MiB 0.0 MiB if filename is None and isinstance(source, str): 634 69.2 MiB 0.0 MiB config['filename'] = source 635 636 69.2 MiB 0.0 MiB with iterparser.get_xml_iterator( 637 69.2 MiB 0.0 MiB source, 638 69.4 MiB 0.2 MiB _debug_python_based_parser=_debug_python_based_parser) as iterator: 639 69.4 MiB 0.0 MiB return VOTableFile( 640 4260.0 MiB 4190.6 MiB config=config, pos=(1, 1)).parse(iterator, config) |
Still drilling down to the relevant bits but I think I am getting close. Note to self: Original provided input is a gzipped file. Before
This gives a human-readable VOTable. Turns out the content is indeed BINARY2.
Thanks, @jhunkeler ! |
Alas, it traced to iterparse.c and I lost the trail. I tried @manodeep , and just out of curiosity, how much memory did |
@pllim Don't remember but definitely remember it being absurdly high enough that typical laptops would crash. Based on that memory, I will guess in the 8-12 GB range, and less than 16 GB (that's what I have on this laptop). |
Btw, the implementation in these lines only work for 32 bits (reference). If n |= n >> 16;
if(sizeof(Py_ssize_t) == 8) {
n |= n >> 32;
} |
@manodeep , you are saying that Also, do you want to submit a PR for the 64-bit fix to get the credit? 😄 |
I was trying to read two columns from an ~ 1 GB votable file (demo Gaia dr2 data). The file itself contains ~ 96 columns. The code I used was:
Here's the file info:
Looking at the memory footprint, I saw that python was taking ~12 GB during the read and I cancelled the kernel (this is within a Jupyter notebook). Here's my screenshot showing the memory usage:
While I know that there is a significant python overhead, it still seems like a lot of memory to read only 2 columns (out of 96). By my math, the (minimum) possible size is
2/96*1 GB ~ 0.02 GB
Since I am new to both astropy and votables, perhaps I am doing something incorrectly. Happy to provide further info or help debug, as necessary. In case there is something inherently wrong with the file itself, here's a dropbox link to the file.
Cheers,
Manodeep
The text was updated successfully, but these errors were encountered: