Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handle IA caching #7

Open
cdrini opened this issue May 14, 2019 · 2 comments
Open

Better handle IA caching #7

cdrini opened this issue May 14, 2019 · 2 comments

Comments

@cdrini
Copy link
Owner

cdrini commented May 14, 2019

Often get an error when requesting too much data; should handle this more elegantly, because things slow WAY down when we can't cache IA metadata.

2019-05-14 00:09:10,175 [ERROR] Error while caching IA
Traceback (most recent call last):
  File "solr_builder_main.py", line 149, in solr_builder.solr_builder_main.LocalPostgresDataProvider.cache_ia_metadata
    for doc in self._get_lite_metadata(b, rows=batch_size)['docs']:
  File "solr_builder_main.py", line 139, in solr_builder.solr_builder_main.LocalPostgresDataProvider._get_lite_metadata
    return simplejson.loads(resp_str)['response']
  File "/usr/local/lib/python2.7/dist-packages/simplejson/__init__.py", line 518, in loads
    return _default_decoder.decode(s)
  File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 373, in decode
    raise JSONDecodeError("Extra data", s, end, len(s))
JSONDecodeError: Extra data: line 3 column 1 - line 3 column 19998 (char 61 - 20058)
@cdrini cdrini added this to To Do in Solr Builder May 14, 2019
@tfmorris
Copy link

I have a fix for this on my branch. It:

  • caches all ocaids regardless of length
  • queries IA in chunks to keep the URL size manageable
  • doesn't retry anything that's not in the cache (because it'll just fail again, like it did the first time)

@tfmorris
Copy link

Having said that, I don't think we should be querying an API during a bulk load operation at all. I don't think we actually need this data, but if we do, we should get it from a bulk dump from IA and read it from that file as part of the indexing process.

Requiring a low latency network with 100% availability is too fragile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Solr Builder
  
To Do
Development

No branches or pull requests

2 participants