Workflow design choices¶

The choices described here are the ones currently implemented in the RTMet workflow. They are subject to change, and could be brought up for discussion.

Following Cylc’s best practices¶

Our workflow generally follows Cylc’s Workflow Design Guide .

Some notable exceptions are:

Self-Contained Workflows : RTMet relies on a user-wide (or system-wide) conda installation to handle most of its dependencies. This means they are vulnerable to external changes.
Workflow Housekeeping : Not implemented yet.
Automating Failure Recovery : Not implemented yet.

Jinja2 templating¶

Jinja2 templating is used extensively in the workflow definition file, flow.cylc. It allows text to be generated dynamically, based on the values of variables passed to the template.

Since the workflow source code contained in the flow.cylc is basically text, Jinja2 templating is a way for Cylc’s devs to add logic without having to write a full-fledged programming language.

User configuration options are passed down from the rose-suite.conf file to the workflow as Jinja2 variables. Some of these variables are used for branching logic:

Switching between input strategies¶

[scheduling]
    cycling mode = integer
    initial cycle point = 0
    [[xtriggers]]
{% if cfg__input_strategy == 'internal' %}
        catch_raw = catch_raw_internal('%(point)s', '%(workflow_run_dir)s')
{% elif cfg__input_strategy == 'local' %}
        catch_raw = catch_raw_local('%(point)s', '%(workflow_run_dir)s', {{ cfg__local_runs_dir }})
{% endif %}

Whole parts of the workflow can be enabled or disabled based on the value of a variable:

Enabling InfluxDB support¶

[scheduling]
    [[graph]]
        ...
        R1/+P3 = quantify => compute_fluxes
        +P4/P1 = quantify & compute_fluxes[-P1] => compute_fluxes
{% if cfg__toggle_influxdb %}
        R1/^ = validate_cfg => create_bucket => is_setup
        +P1/P1 = """
            annotate => upload_features
            quantify => upload_concentrations
        """
{% endif %}

Other Jinja2 variables are used to define environment variables for tasks:

Allowing the user to set the number of scans to trim¶

[runtime]
    [[trim_spectra]]
        inherit = None, CONDA_OPENMS
        script = """
            trimms ${mzml} ${n_start} ${n_end}
        """
        [[[environment]]]
            mzml = ${MAIN_RESULTS_DIR}/${RAWFILE_STEM}.mzML
            n_start = {{ cfg__trim_values[0] }}
            n_end = {{ cfg__trim_values[1] }}
        [[[meta]]]
            title = Trim Spectra
            description = """
                Remove the first `n_start` and last `n_end` scans from the mzML file. This is useful
                if the shape of the flowgram is not stable at the beginning or end of the run.
            """
            categories = bioinformatics

Rose for configuration management¶

Rose is used for its Suite Configuration capabilities. It interfaces with our workflow using the Cylc Rose plugin. Just think of it as workflow configuration being outsourced to another package, since Cylc doesn’t have it built-in (yet?)

User configuration options are stored in the rose-suite.conf file at the root of the workflow directory. They are in the [template variables] section, which means they are passed down to the workflow as Jinja2 variables.

The chosen naming convention for configuration items is cfg__<item_name>. This is both to avoid conflicts with other environment variables and to make it clear that these are configuration items.

Task inheritance to avoid code duplication¶

Workflow tasks can inherit from other tasks, which mean script blocks ([script], [pre-script] and [post-script]) but also [environment] variables are taken from the parent task. Our workflow uses this feature for:

Conda environment activation (see below)
Sharing InfluxDB configuration (URL, token, organization, etc.)
Format some of the intermediary tables in a [post-script] block (adding datetime, cycle and instrument_id columns).

Run setup is done at the first cyclepoint¶

This include user configuration validation, input data validation, and other tasks that need to be done before the main workflow starts:

[validate_cfg]
[validate_compounds_db]
[validate_met_model] (to be implemented)
[[INFLUXDB][create_bucket]]

Cyclepoint 0 is reserved for setup tasks. processing of .raw files starts at cyclepoint 1.

Tasks can run in specific conda environments¶

Conda environments activation is handled by a pre-script . envs/conda.cylc defines family tasks, one for each conda environment:

flow.cylc¶

# Create task families for conda environments.
%include 'envs/conda.cylc'

conda.cylc¶

{% set conda_envs = {
    'CONDA_TRFP': 'wf-trfp',
    'CONDA_BINNER': 'wf-binner',
    'CONDA_DATAMUNGING': 'wf-datamunging',
    'CONDA_INFLUX': 'wf-influx',
    'CONDA_OPENMS': 'wf-pyopenms',
    } %}

[runtime]
{% for env, conda_env_name in conda_envs.items() %}
    [[{{env}}]]
        pre-script = """
            set +eu
            conda activate {{ conda_env_name }}
            set -eu
        """
{% endfor %}

Individual tasks in the workflow can then inherit from these families to run in the desired conda environment:

flow.cylc¶

[runtime]
    [[trim_spectra]]
        inherit = None, CONDA_OPENMS
        script = """
            trimms ${mzml} ${n_start} ${n_end}
        """
        [[[environment]]]
            mzml = ${MAIN_RESULTS_DIR}/${RAWFILE_STEM}.mzML
            n_start = {{ cfg__trim_values[0] }}
            n_end = {{ cfg__trim_values[1] }}

Warning

If you override the pre-script in a task while inheriting from a conda family task, you will lose the conda environment activation.

`dataflow/` and `qc/` directories for results¶

Our workflow follows the convention described in Shared Task IO Paths . In addition, the share/cycle/{n} directories are further divided into dataflow/ and qc/.

dataflow/ contains the results of the main workflow tasks. It is used to pass data between tasks.
qc/ contains quality control results to be analyzed by the user: plots, statistics, etc.

Data tables are stored in plain text CSV files¶

Intermediary results in dataflow/ are stored in a delimiter-separated format, using semicolons as separators. It allows for easy inspection and debugging, as well as compatibility with most spreadsheet softwares.

Furthermore, they can easily be edited using awk/sed/grep or csvkit without the need to load them as dataframes in Python or R.

Libraries/packages to be favored¶

Data wrangling: csvtk (CLI), pandas (Python) and tidyverse (R).
Data validation: frictionless
Editing/Querying mzML files: pyopenms

InfluxDB is an optional dependency¶

InfluxDB is used for real-time visualization of the results. It is not a strict requirement for the workflow to run. It can be enabled by setting rose-suite.conf[template variables]cfg__toggle_influxdb to True.

Data is uploaded to InfluxDB using its Python API. influx_utils.py contains functions to convert our CSV files into the correct upload format.

Workflow design choices¶

Following Cylc’s best practices¶

Jinja2 templating¶

Rose for configuration management¶

Task inheritance to avoid code duplication¶

Run setup is done at the first cyclepoint¶

Tasks can run in specific conda environments¶

dataflow/ and qc/ directories for results¶

Data tables are stored in plain text CSV files¶

Libraries/packages to be favored¶

InfluxDB is an optional dependency¶

`dataflow/` and `qc/` directories for results¶