PDFs with GROBID 'success' but empty GROBID fulltext body lack any access option on scholar.archive.org (eg, presentation slides) #93

NeoTheThird · 2022-02-08T15:42:09Z

Edit: was "Some works archived on fatcat appear as unarchived in fatcat-scholar"

I have been unable to find a pattern, but some works archived on fatcat appear as unarchived in fatcat-scholar. They are also not shown in search results on scholar unless filter_availability=everything is set. In the fatcat search, they are listed as bright archives.

The archived files were created by the savepaper-now bot a couple days ago, so i don't think it's an issue of invalid data or fatcat and fatcat-scholar being out of sync.

The text was updated successfully, but these errors were encountered:

bnewbold · 2022-02-08T21:47:06Z

What seems to be happening in this particular corner case is that GROBID parses the PDF successfully, but the "body" is empty. Here is what the "intermediate" hydrated object looks like: https://archive.org/~bnewbold/tmp/release_hneyekhayrdwvpvzdlnbzqalfu.scholar_intermediate.json

Because the body is empty, we don't create a "fulltext" sub-object in the indexed document here: https://github.com/internetarchive/fatcat-scholar/blob/master/fatcat_scholar/transform.py#L250

There are a couple ways we could try harder here. We could link to the file even though the extracted text is empty (aka, add access options even if the fulltext object is emtpy). We could detect the empty GROBID body earlier in the pipeline, and substitute raw extracted text ("pdftotext") earlier, so there would be at least something. We could try to improve GROBID extraction for slides, or detect that case and always use a different tool.

Slides are in-scope for both fatcat and scholar, and it would be good to fix this. I think this would be on the backburner for me to fix in the near future, but if you (or somebody else) would like to dig in and try to improve the behavior, I would be happy to review and give pointers.

I'm going to edit the title of this issue, I hope that is ok with you.

bnewbold added the bug Something isn't working label Feb 8, 2022

bnewbold changed the title ~~Some works archived on fatcat appear as unarchived in fatcat-scholar~~ PDFs with GROBID 'success' but empty GROBID fulltext body lack any access option on scholar.archive.org (eg, presentation slides) Feb 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDFs with GROBID 'success' but empty GROBID fulltext body lack any access option on scholar.archive.org (eg, presentation slides) #93

PDFs with GROBID 'success' but empty GROBID fulltext body lack any access option on scholar.archive.org (eg, presentation slides) #93

NeoTheThird commented Feb 8, 2022 •

edited by bnewbold

bnewbold commented Feb 8, 2022

PDFs with GROBID 'success' but empty GROBID fulltext body lack any access option on scholar.archive.org (eg, presentation slides) #93

PDFs with GROBID 'success' but empty GROBID fulltext body lack any access option on scholar.archive.org (eg, presentation slides) #93

Comments

NeoTheThird commented Feb 8, 2022 • edited by bnewbold

bnewbold commented Feb 8, 2022

NeoTheThird commented Feb 8, 2022 •

edited by bnewbold