Microservice to transform the touristic data of Toerisme Vlaanderen to the semantic Logies model as described by the Logies Basis application profile and load it in a triple store. It also publishes a DCAT dataset.
To add the service to your stack, add the following snippet to docker-compose.yml
:
services:
import:
image: redpencil/logies-data-converter-service:1.3.2
volumes:
- ./data/input:/input
- ./data/output:/output
- ./data/files:/share
/input
contains the input data for the mapping.
/output
contains intermediate files generated during the mapping. They will be removed at the end of the process.
/share
contains the TTL files that are published as DCAT dataset. The volume mounted here must be the same volume as mounted in the file-service if files must be downloadable by the end user.
The conversion tasks are configured in ./config/tasks.js
.
enabled
: boolean flag whether task is enabledtitle
: title of the task used to construct file namesquery
: SQL query to fetch the input datamapper
: function to use for mappingtranslations
(optional): additional translations to be mapped. The translations object has the following properties:query
: function generating the SQL query to fetch the translations data. The function gets the language as parameter.languages
: array of languages to map
The following environment variables can be configured:
-
CRON_PATTERN
(default:0 0 2 * * *
): cron pattern defining when the data conversion is triggered -
SQL_USER
: username for the SQL database -
SQL_PASSWORD
: password for the SQL database -
SQL_SERVER
(defaultsqldb
) : hostname of the SQL server -
SQL_DATABASE
: name of the SQL database -
SQL_PORT
(default 1143): port to connect to the SQL database -
SQL_BATCH_SIZE
(default 10000) : page size to fetch data from SQL database -
MAPPED_PUBLIC_GRAPH
(default http://mu.semte.ch/graphs/mapped/public) : graph to write public mapped data to -
MAPPED_PRIVATE_GRAPH_BASE
(default http://mu.semte.ch/graphs/mapped/private/) : base URI of the graphs to write private mapped data to -
PUBLIC_GRAPH
(default http://mu.semte.ch/graphs/public) : graph containing public static data -
HOST_DOMAIN
(default https://linked.toerismevlaanderen.be) : host domain used as base to generate resource URIs -
DCAT_CATALOG
(default http://linked.toerismevlaanderen.be/id/catalogs/c62b30ce-7486-4199-a177-def7e1772a53) : URI of the Toerisme Vlaanderen DCAT catalog -
DCAT_DATASET_TYPE
(default http://linked.toerismevlaanderen.be/id/dataset-types/ca82a1e3-8a7c-438e-ba37-cf36063ba060) : URI of the tourist attractions dataset type -
INPUT_DIRECTORY
(default/input
) : directory to write input files with SQL data to -
OUTPUT_DIRECTORY
(default/output
) : directory to write intermediate files to. They will be removed once the mapping has been finished. -
PUBLICATION_DIRECTORY
(default/share
) : directy to write dataset TTL files to -
RUN_ON_STARTUP
(defaultfalse
) : whether conversion must be trigered on startup -
LOAD_EXTERNAL_SQL_SOURCES
(defaulttrue
) : whether input data must be fetched from SQL database. Can be disabled during development when input files are already provided inINPUT_DIRECTORY
. -
GENERATE_STABLE_URIS
(defaulttrue
) : whether URIs must be stable between multiple runs. Disabling speeds up service startup during development. Needs to be enabled on production. -
BATCH_SIZE
(default 1000) : batch size to use in update SPARQL queries -
RETRY_TIMEOUT_MS
(default 1000) : number of milliseconds between to SPARQL query retries -
RECORDS_CHUNK_SIZE
(default 1000) : number of records to map in 1 batch. Intermediate results are written to a TTL file. -
DIRECT_SPARQL_ENDPOINT
(defaulthttp://triplestore:8890/sparql
): endpoint to execute SPARQL queries directly on Virtuoso instead of via mu-authorization. Typically used for data operations on tmp graphs.
The conversion process consists of the following steps:
For each task configured in ./config/tasks.js
:
- Fetch input data from the SQL database and write to a JSON file
- Convert the JSON records to triple statements, grouped in graphs
- Write the triples to TTL files per graph
Once all tasks are executed:
- TTL files of the several tasks are merged per graph
- Load data in a temporary graph in the triplestore
- Update published data in the
http://mu.semte.ch/graphs/mapped/...
graphs based on the data in the temporary graph - Remove data from the temporary graph
- Publish a DCAT dataset
For each Linked Data resource a URI and random uuid gets generated. To ensure stable URIs across conversion processes, identifiers of the SQL database are mapped on the Linked Data URIs. The mapping is stored in the http://mu.semte.ch/graphs/uri-mapping
graph as follows:
<resource> <http://mu.semte.ch/vocabularies/hasUuid> <generated-mu-uuid> ;
<http://mu.semte.ch/vocabularies/hasTvlId> <generated-hash-based-on-SQL-id> .
On startup of the service, the mappings are loaded in memory to improve lookup performance during the conversion process.
To disable the generation of stable URIs set the GENERATE_STABLE_URIS
environment variable to false
. This is typically useful to speed up service start up during development. The flag MUST be enabled on production.