Skip to content

GSoC 2021 Improve pdf support in JabRef

Benedikt Tutzer edited this page Aug 23, 2021 · 10 revisions
Student Benedikt Tutzer
Organization JabRef
Primary repository JabRef/jabref
Project name Improve pdf support in JabRef
Project mentors Oliver Kopp and Carl Christian Snethlage
Project page Google Summer of Code 2021 Project Page
Status Complete

Project summary

JabRef had only limited support to interact with pdfs. It could only read XMP metadata and open linked PDFs. Since pdfs are a common format to share scientific papers, this needed to be improved. Thanks to the features implemented by Benedikt Tutzer during Google Summer of Code 2021, JabRef users can now:

  • write XMP metadata to PDFs from the command line
  • extract PDF metadata
    • by sending the PDF to JabRefs Grobid server
    • by importing embedded BibTeX files
    • by importing a verbatim BibTeX entry given on the first page of the PDF
    • by merging the metadata obtained from the methods mentioned above automatically or using a merge dialogue.
  • search the contents of all linked PDF documents

Pull requests to main branch

Project-related work

7814 CLI option to write XMP metadata to pdfs

This expands JabRef's CLI to allow users to write XMP metadata of selected entries in their database to linked PDFs.

2838 Search in PDF Files

Started in May 2017 by Linus Dietz, this PR implements a fulltext-search feature based on Apache Lucene. The PR was taken over by Benedikt Tutzer as Part of this GSoC project. Tasks done by Benedikt:

  • Fix and update dependencies
  • Redefine what fields are indexed
  • Synchronization of Index with Bib-Database
    • At startup:
      • Add all PDF's to the index that were not indexed before
      • Update all index-entries for PDF's that changed since they were indexed
      • Remove all index-entries for PDF's that were removed
    • During use:
      • Add PDFs that are linked by the user
      • Remove PDFs that are unlinked by the user
  • Interface to search in the index
  • Presentation of search results

7931 Fix broken GroupDialog

This PR fixes an issue introduced with the fulltext-search feature

7980 Fulltext Index: Only index local pdf files

This PR makes sure only local PDF files are added to the index.

7981 Improved progress indication for fulltext-index operations

This PR improves the presentation of the indexing-progress.

7989 Improve presentation of fulltext search results

This PR improves how results are presented to the users.

7947 Implement more pdf importers

This PR adds multiple importers that can be used to determine metadata from PDF files:

  • PdfVerbatimBibTextImporter looks for a verbatim BibTeX entry on the first page of the pdf
  • PdfEmbeddedBibFileImporter looks for an embedded BibTeX file in the pdf
  • PdfGrobidMetadataImporter sends the pdf to the Web API at http://grobid.jabref.org to determine the metadata using the Deep-Learning Library Grobid
  • PdfMergeMetadataImporter merges the metadata found by other importers. If identifiers were found (DOI or ISBN), metadata is fetched for the identifier as well.

7963 Remove DOI lookup from PdfContentImporter

As the PdfMergeMetadataImporter now looks-up DOI and ISBN anyhow, there is no need to do that in the individual importers any more. This PR removes the DOI lookup from the previousely existing PdfContentImporter.

7929 Implement an interface to import PDF metadata from multiple sources (XMP, Grobid, ...)

This implements an n-way merge dialog to allow the user to extract metadata from multiple sources and then select what metadata to store in the database.

8001 Reordered Pdf-Importer priorities

This PR reorders the priorities of the pdf-importers.

8002 Preferences for Grobid

This PR makes all interaction with the Grobid-Server Opt-in. This is to make sure JabRef does not send PDF's to the Web-Service without the users clear intent to do so.

8003 Refactor processCitation in GrobidService to match processPdf

Follow up that improves the UnitTests.

More than core scope

7797 Added auto-key-generation task to task-progress

7804 JournalAbbreviation search feature

7907 Removed references to apache commons logging

8006 [PoC] Introduced read/write interface for preferences

This is a proof-of-concept to change how passing preferences objects is handeled in JabRef.

Pull requests related to project in other repositories

The API of Grobid maily returns TEI for most requests. We added BibTeX support for the request we use for the metadata extraction.

800 Accept application/x-bibtex for processHeaderDocument

Before GSoC

6469 Fix bracket collisions

6443 Implement task progress indicator (and dialog) in the toolbar

6437 Fixed entry duplication on file download

6436 Cleanup dead code

6381 Added a download checkbox to the import dialog

Statistics

Total commits 14
Lines added 3273
Lines removed 505

(For commits made by Benedikt Tutzer during GSoC 2021 to JabRef's main branch only. Commits were squashed before counting.)

Blog posts

Project blogpost: July 04, 2021 – JabRef GSoC’21 Projects

Clone this wiki locally