ScienceBeam Parser allows you to parse scientific documents. Initially is starting as a partial Python variation of GROBID and allows you to re-use some of the models. However, it may deviate more in the future.
This currently only supports Linux due to the binaries used (pdfalto
).
Other plaforms are supported Docker.
It may also be used on other platforms without Docker, provided matching binaries are configured.
make dev-venv
There is no implicit "grobid-home" directory. The only configuration file is config.yml.
Paths may point to local or remote files. Remote files are downloaded and cached locally (urls are assumed to be versioned).
You may override config values using environment variables.
Environment variables should start with SCIENCEBEAM_PARSER__
. After that __
is used as a section separator.
For example SCIENCEBEAM_PARSER__LOGGING__HANDLERS__LOG_FILE__LEVEL
would override logging.handlers.log_file.level
.
make dev-test
make dev-start
Run the server in debug mode (including auto-reload and debug logging):
make dev-debug
Run the server with auto reload but no debug logging:
make dev-start-no-debug-logging-auto-reload
curl --fail --show-error \
--form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
--silent "http://localhost:8080/api/pdfalto"
The following output formats are supported:
output_format | description |
---|---|
raw_data | generated data (without using the model) |
data | generated data with predicted labels |
xml | using simple xml elements for predicted labels |
json | json of prediction |
curl --fail --show-error \
--form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
--silent "http://localhost:8080/api/models/header?first_page=1&last_page=1&output_format=xml"
curl --fail --show-error \
--form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
--silent "http://localhost:8080/api/models/name-header?first_page=1&last_page=1&output_format=xml"
curl --fail --show-error \
--form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
--silent "http://localhost:8080/api/processFulltextDocument?first_page=1&last_page=1"
docker pull elifesciences/sciencebeam-parser
docker run --rm \
-p 8070:8070 \
elifesciences/sciencebeam-parser