Adding a task to the workflow

In this tutorial, we will see how to add a new task to the workflow. We will use the example of a task that extract the number of scans from a mzML file, using the pyOpenMS library.

Adding a python script to the workflow executables

In cylc-src/bioreactor-workflow/bin/, create a new file named get-scans-number and paste the following content:

bin/get-scans-number
#!/usr/bin/env python

import os
import sys
from pathlib import Path

from pyopenms import MzMLFile, MSExperiment

MZML = os.getenv("mzml")


def main():
    """
    Usage:
        ./get-scans-number

    Get number of scans from mzML file. `$mzml` shell
    environment variable must be set to the path of the file.
    """
    exp = MSExperiment()
    MzMLFile().load(MZML, exp)
    sys.stdout.write(str(exp.getNrSpectra()))


if __name__ == "__main__":
    if len(sys.argv) > 1:
        sys.stderr.write(main.__doc__)
    elif not MZML:
        sys.stderr.write("$mzml environment variable not set.\n")
        sys.exit()
    elif not Path(MZML).exists():
        sys.stderr.write(f"mzML file not found: {MZML}\n")
        sys.exit()

    main()

Make the script executable:

$ chmod +x get-scans-number

Creating a new task in the [runtime] section

Open cylc-src/bioreactor-workflow/flow.cylc and add the following task definition at the end:

flow.cylc
[runtime]
    # ...
    [[get_scans_number]]
        # The task will run in the wf-openms conda environment
        # Adding None makes the task appear at the root in the TUI/GUI
        inherit = None, CONDA_OPENMS
        script = """
            echo "The script lauched by this task will extract the number of scans from the mzML file."

            get-scans-number > ${output_file}

            echo "The number of scans has been saved to ${output_file}"
            echo "Number of scans: $(cat ${output_file})"
        """
        [[[environment]]]
            # The python script will use the $mzml environment
            # variable to get the path of the file.
            mzml = ${MAIN_RESULTS_DIR}/${RAWFILE_STEM}.mzML
            output_file = ${MAIN_RESULTS_DIR}/scans_number.txt

This task will run the get-scans-number script and save the output to a file named scans_number.txt in the main results directory. This directory (share/cycle/n/dataflow/) is specific to each cyclepoint n.

Adding the task to the graph

Add a new graph string to the +P1/P1 recurrence, inside the [graph] section of the workflow definition:

flow.cylc
[[graph]]
        R1/^ = validate_cfg => validate_compounds_db & validate_met_model => is_setup
        R1/+P1 = convert_raw => get_instrument => extract_features
        +P1/P1 = """
            is_setup[^] => _catch_raw
            @catch_raw => _catch_raw => convert_raw => get_timestamp &
                trim_spectra => extract_features => annotate => quantify
            convert_raw => get_scans_number
        """

The task will be executed for each cyclepoint (/P1) starting from the second one (+P1). It will run after the convert_raw task as it depends on the mzML file generated by it. No other task depends on the one we just added.

You can check that the task has been added correctly by running:

$ cylc graph bioreactor-workflow 0 1
Graph with the new task added

Testing the new task

Install and start a new run of the workflow, and add a mzML file to the raws/ directory. The task should start immediately after the convert_raw task and generate a scans_number.txt file in the cylc-run/your_run_name/share/cycle/1/dataflow/ directory.

job.out in logs
Workflow : bioreactor-workflow/task-added
Job : 1/get_scans_number/01 (try 1)
User@Host: elliotfontaine@MBP-Elliot.local

2024-07-22T14:18:50+02:00 INFO - started
The script lauched by this task will extract the number of scans from the mzML file.
The number of scans has been saved to /Users/elliotfontaine/cylc-run/bioreactor-workflow/task-added/share/cycle/1/dataflow/scans_number.txt
Number of scans: 35
2024-07-22T14:18:52+02:00 INFO - succeeded