Skip to content

madminer-tool/madminer-workflow-ml

Repository files navigation

Madminer workflow ML

CI/CD Status Docker pulls MIT license Code style

About

This repository defines a Machine Learning workflow using the Madminer package as the base for performing ML analysis. To learn more about Madminer and its ML tools check the Madminer documentation.

The workflow requires a file of Pythia generated - Delphes showered events as input. This file can be generated by executing the Madminer physics workflow, which originally was a preliminary-linked workflow to execute this one.

For debugging purposes, a dummy generated file is placed in the data folder.

Workflow definition

The workflow specification is composed of 4 hierarchical layers. From top to bottom:

  1. Workflow spec: description of how steps are coordinated.
  2. Shell scripts: entry points for each of the steps.
  3. MLproject rules: entry point for parametrized steps.
  4. Python scripts: set of actions to interact with Madminer.

The division into multiple layers is very useful for debugging. It provides developers an easy way to test individual steps before testing the full workflow coordination.

Considering the workflow steps:

flowchart TD
    init[Init] --> sampling[Sampling]
    sampling --> train_0[Training 0]
    sampling --> train_1[Training 1]
    sampling --> train_n[Training ...]
    train_0 --> eval_0[Evaluation 0]
    train_1 --> eval_1[Evaluation 1]
    train_n --> eval_n[Evaluation ...]
    eval_0 --> plot[Plot]
    eval_1 --> plot[Plot]
    eval_n --> plot[Plot]

MLFlow

For hyper-parameter tuning and models tracking, MLFlow has been introduced on certain parts of the workflow.

Follow the MLFlow guide to review the implications.

Formatting

All Python files are formatted using Black:

make check

Execution

When executing the workflow (either fully or some of its parts) it is important to consider that each individual step received inputs and generates outputs. Outputs are usually files, which need to be temporarily stored to serve as input for later steps.

The shell script layer provides an easy way to redirect outputs to what is called a WORKDIR. The WORKDIR is just a folder where steps output files are stored to be use for other ones. In addition, this layer provides a way of specifying the path where the project code and scripts are stored. This becomes useful to allow both local and within-docker executions.

When executing the workflow there are 2 alternatives:

A) Individual steps

Individual steps can be launched using their shell script. Be aware their execution may depend on previous step outputs, so a sequential order must be followed.

Example:

scripts/1_sampling.sh \
    --project_path . \
    --data_file data/dummy_data.h5 \
    --input_file workflow/input.yml \
    --output_dir .workdir

B) Coordinated

The full workflow can be launched using Yadage. Yadage is a YAML specification language over a set of utilities that are used to coordinate workflows. Please consider that it can be hard to define Yadage workflows as the Yadage documentation is incomplete. For learning about Yadage hidden features contact Lukas Heinrich, Yadage creator.

Yadage depends on having the Docker image used as environment available on DockerHub. For pushing the Docker image for this workflow, jump to the Docker section.

Once the Docker image has been pushed:

pip3 install yadage
make yadage-run

Docker

To build a new Docker image:

make build

To push a new Docker image, bump up the VERSION number and execute:

export DOCKERUSER=<your_dockerhub_username>
export DOCKERPASS=<your_dockerhub_password>
make push