Skip to content

eflows4hpc/datacatalog

Repository files navigation

DOI

Datacatalogue

This is Data Catalogue for eFlows4HPC project

This work has been supported by the eFlows4HPC project, contract #955558. This project has received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 955558. The JU receives support from the European Union’s Horizon 2020 research and innovation programme and Spain, Germany, France, Italy, Poland, Switzerland, Norway.

The project has recieved funding from German Federal Ministry of Education and Research agreement no. 16GPC016K.

Find architecture in arch folder.

Frontend Server for the Data Catalogue

This part is the frontend for the Data Catalogue. It will be the user interface, so that no one is forced to manually do http calls to the api. Since the content is managed by the api-server, this can be deployed as a static website, containing only html, css and javascript. To make the different pages more uniform and avoid duplicate code, the static pages will be generated by the jinja2 template engine.

To compile the static pages to the ./site/ directory (will be created if required), simply run

pip install -r requirements.txt
python frontend/createStatic.py

The site can then be deployed to any webserver that is capable of serving files, as no other server functionality is strictly required. However, in a proper deployment, access and certificates should be considered.

For development (and only for development), an easy way to deploay a local server is

python -m http.server <localport> --directory site/

The python http.server package should not be used for deployment, as it does not ensure that current security standards are met, and is only intended for local testing.

API-Server for the Data Catalogue

This part is the the API-server for the Data Catalogue, which will provide the backend functionality.

It is implemented via fastAPI and provides an api documentation via openAPI.

For deployment via docker, a docker image is included.

Configuration

Some server settings can be changed. This can either be used during testing, so that a test api server can be launched with testing data, or for deployment, if the appdata or the userdb is not in the default location.

These settings can be set either via environment variables, changed in the apiserver/config.env file, or a different .env file can be configured via the DATACATALOG_API_DOTENV_FILE_PATH environment variable.

At the moment, the settings are considered at launch, and can not be updated while the server is running.

Variable Name Default Value Description
DATACATALOG_API_DOTENV_FILE_PATH apiserver/config.env Location of the .env file considered at launch
DATACATALOG_APISERVER_JSON_STORAGE_PATH ./app/data Directory where the data (i.e. dataset info) is stored
DATACATALOG_APISERVER_USERDB_PATH ./app/userdb.json Location of the .json file containing the accounts
DATACATALOG_APISERVER_CLIENT_ID Client ID for a configured OIDC server
DATACATALOG_APISERVER_CLIENT_SECRET Client Secret for a configured OIDC server
DATACATALOG_APISERVER_SERVER_METADATA_URL Metadata URL for a configured OIDC server

There is also the logging configuration to consider:

The apiserver/log_conf.yaml contains the settings for the loggers. Information on how to change these settings can be found here.

Security

Certain operations will only be possible, if the request is authenticated. The API has an endpoint at /token where a username/password login is possible. The endpoint will return a token, which is valid for 1 hour. This token has to be provided with every api call that requires authentication. Currently, these calls are GET /me - PUT /dataset - PUT /dataset/dataset-id - DELETE /dataset/dataset-id. The passwords are stored as bcrypt hashes and are not visible to anyone.

A CLI is provided for server admins to add new users. It will soon be extended to allow direct hash entry, so that the user does not have to provide their password in clear text.

For testing, a default userdb.json is provided with a single user "testuser" with the password "test".

API Documentation

If the api-server is running, you can see the documentation at <server-url>/docs or <server-url>/redoc.

These pages can also be used as a clunky frontend, allowing the authentication and execution of all api functions.

Running without docker

First ensure that your python version is 3.6 or newer.

Then, if they are not yet installed on your machine, install the requirements via pip:

pip install -r requirements.txt

To start the server, run

uvicorn apiserver:app --reload --reload-dir apiserver

while in the project root directory.

Without any other options, this starts your server on <localhost:8000>. The --reload --reload-dir apiserver options ensure, that any changes to files in the apiserver-directory will cause an immediate reload of the server, which is especially useful during development. If this is not required, just don't include the options.

If you want to have more detailed logs and/ or store the logs in a file, add the --log-level (debug|info|...) --log-config=./apiserver/log_conf.yaml options. The details of the logging behavior can be changed via the ./apiserver/log_conf.yaml file, and most logging entries will be debug or info.

More information about uvicorn settings (including information about how to bind to other network interfaces or ports) can be found here.

Testing

First ensure that the pytest package is installed (It is included in the testing_requirements.txt).

Tests are located in the apiserver_tests directory. They can be executed by simply running pytest while in the project folder. You can also use nose for test (also included in testing_requirements.txt), for instance for tests with coverage report in html format run following:

nosetests --with-coverage --cover-package=apiserver --cover-html

If more test-files should be added, they should be named with a test_ prefix and put into a similarily named folder, so that they can be auto-detected.

The context.py file helps with importing the apiserver-packages, so that the tests function independent of the local python path setup.

Using the docker image

Building the docker image

To build the docker image of the current version, simply run

docker build -t datacatalog-apiserver -f ./apiserver/Dockerfile .

while in the project root directory.

datacatalog-apiserver is a local tag to identify the built docker image. You can change it if you want.

Running the docker image

To run the docker image in a local container, run

docker run -d --name <container_name> -p <local_port>:8000 datacatalog-apiserver

<container_name> is the name of your container, that can be used to refer to it with other docker commands.

<local_port> is the port of your local machine, which will be forwarded to the docker container. For example, if it is set to 8080, you will be able to reach the api-server at http://localhost:8080.

For more production ready deployments consider using --restart=always flag, as well as inject path for data:

docker run -d --name <container_name> --restart=always -v /localvol/:/app/data/ -p <local_port>:8000 datacatalog-apiserver

Stopping the docker image

To stop the docker image, run

docker stop <container name>

Note, that this will only stop the container, and not delete it fully. To do that, run

docker rm <container name>

For more information about docker, please see the docker docs

CI/CD

The gitlab repository is set up to automatically build the datacat image and deploy to the production and testing environment. The pipeline and jobs for this are defined in the .gitlab-ci.yml file. In general, pushes to the master branch update the testing deployment, and tags containing "stable" update the production deployment.

To avoid unneeded downtime, the VMs hosting the deployments are usuallly not re-created, and instead only the updated docker image, as well as updated config is uploaded to the VM. After this, the docker containers are restarted. If a "full-deployment" is required (i.e. the VMs shuld be newly created), the pipeline has to be started with a variable MANUAL_FULL_DEPLOY=true. This can be done while starting the pipeline via the web interface.