Workflow design choices

The choices described here are the ones currently implemented in the RTMet workflow. They are subject to change, and could be brought up for discussion.

Following Cylc’s best practices

Our workflow generally follows Cylc’s Workflow Design Guide .

Some notable exceptions are:

Jinja2 templating

Jinja2 templating is used extensively in the workflow definition file, flow.cylc. It allows text to be generated dynamically, based on the values of variables passed to the template.

Since the workflow source code contained in the flow.cylc is basically text, Jinja2 templating is a way for Cylc’s devs to add logic without having to write a full-fledged programming language.

User configuration options are passed down from the rose-suite.conf file to the workflow as Jinja2 variables. Some of these variables are used for branching logic:

Switching between input strategies
[scheduling]
    cycling mode = integer
    initial cycle point = 0
    [[xtriggers]]
{% if cfg__input_strategy == 'internal' %}
        catch_raw = catch_raw_internal('%(point)s', '%(workflow_run_dir)s')
{% elif cfg__input_strategy == 'local' %}
        catch_raw = catch_raw_local('%(point)s', '%(workflow_run_dir)s', {{ cfg__local_runs_dir }})
{% endif %}

Whole parts of the workflow can be enabled or disabled based on the value of a variable:

Enabling InfluxDB support
[scheduling]
    [[graph]]
        ...
        R1/+P3 = quantify => compute_fluxes
        +P4/P1 = quantify & compute_fluxes[-P1] => compute_fluxes
{% if cfg__toggle_influxdb %}
        R1/^ = validate_cfg => create_bucket => is_setup
        +P1/P1 = """
            annotate => upload_features
            quantify => upload_concentrations
        """
{% endif %}

Other Jinja2 variables are used to define environment variables for tasks:

Allowing the user to set the number of scans to trim
[runtime]
    [[trim_spectra]]
        inherit = None, CONDA_OPENMS
        script = """
            trimms ${mzml} ${n_start} ${n_end}
        """
        [[[environment]]]
            mzml = ${MAIN_RESULTS_DIR}/${RAWFILE_STEM}.mzML
            n_start = {{ cfg__trim_values[0] }}
            n_end = {{ cfg__trim_values[1] }}
        [[[meta]]]
            title = Trim Spectra
            description = """
                Remove the first `n_start` and last `n_end` scans from the mzML file. This is useful
                if the shape of the flowgram is not stable at the beginning or end of the run.
            """
            categories = bioinformatics

See also

Jinja2 in Cylc’s documentation.

Rose for configuration management

Rose is used for its Suite Configuration capabilities. It interfaces with our workflow using the Cylc Rose plugin. Just think of it as workflow configuration being outsourced to another package, since Cylc doesn’t have it built-in (yet?)

User configuration options are stored in the rose-suite.conf file at the root of the workflow directory. They are in the [template variables] section, which means they are passed down to the workflow as Jinja2 variables.

The chosen naming convention for configuration items is cfg__<item_name>. This is both to avoid conflicts with other environment variables and to make it clear that these are configuration items.

Task inheritance to avoid code duplication

Workflow tasks can inherit from other tasks, which mean script blocks ([script], [pre-script] and [post-script]) but also [environment] variables are taken from the parent task. Our workflow uses this feature for:

  • Conda environment activation (see below)

  • Sharing InfluxDB configuration (URL, token, organization, etc.)

  • Format some of the intermediary tables in a [post-script] block (adding datetime, cycle and instrument_id columns).

See also

Sharing By Inheritance in Cylc’s documentation.

Run setup is done at the first cyclepoint

This include user configuration validation, input data validation, and other tasks that need to be done before the main workflow starts:

  • [validate_cfg]

  • [validate_compounds_db]

  • [validate_met_model] (to be implemented)

  • [[INFLUXDB][create_bucket]]

Cyclepoint 0 is reserved for setup tasks. processing of .raw files starts at cyclepoint 1.

Tasks can run in specific conda environments

Conda environments activation is handled by a pre-script . envs/conda.cylc defines family tasks, one for each conda environment:

flow.cylc
10# Create task families for conda environments.
11%include 'envs/conda.cylc'
conda.cylc
{% set conda_envs = {
    'CONDA_TRFP': 'wf-trfp',
    'CONDA_BINNER': 'wf-binner',
    'CONDA_DATAMUNGING': 'wf-datamunging',
    'CONDA_INFLUX': 'wf-influx',
    'CONDA_OPENMS': 'wf-pyopenms',
    } %}

[runtime]
{% for env, conda_env_name in conda_envs.items() %}
    [[{{env}}]]
        pre-script = """
            set +eu
            conda activate {{ conda_env_name }}
            set -eu
        """
{% endfor %}

Individual tasks in the workflow can then inherit from these families to run in the desired conda environment:

flow.cylc
[runtime]
    [[trim_spectra]]
        inherit = None, CONDA_OPENMS
        script = """
            trimms ${mzml} ${n_start} ${n_end}
        """
        [[[environment]]]
            mzml = ${MAIN_RESULTS_DIR}/${RAWFILE_STEM}.mzML
            n_start = {{ cfg__trim_values[0] }}
            n_end = {{ cfg__trim_values[1] }}

Warning

If you override the pre-script in a task while inheriting from a conda family task, you will lose the conda environment activation.

dataflow/ and qc/ directories for results

Our workflow follows the convention described in Shared Task IO Paths . In addition, the share/cycle/{n} directories are further divided into dataflow/ and qc/.

  • dataflow/ contains the results of the main workflow tasks. It is used to pass data between tasks.

  • qc/ contains quality control results to be analyzed by the user: plots, statistics, etc.

Data tables are stored in plain text CSV files

Intermediary results in dataflow/ are stored in a delimiter-separated format, using semicolons as separators. It allows for easy inspection and debugging, as well as compatibility with most spreadsheet softwares.

Furthermore, they can easily be edited using awk/sed/grep or csvkit without the need to load them as dataframes in Python or R.

Libraries/packages to be favored

InfluxDB is an optional dependency

InfluxDB is used for real-time visualization of the results. It is not a strict requirement for the workflow to run. It can be enabled by setting rose-suite.conf[template variables]cfg__toggle_influxdb to True.

Data is uploaded to InfluxDB using its Python API. influx_utils.py contains functions to convert our CSV files into the correct upload format.