Helsinki-NLP · svirpioj · Feb 4, 2026 · Feb 11, 2026 · Mar 4, 2026 · Mar 4, 2026
diff --git a/bin/opusfilter-slurm b/bin/opusfilter-slurm
@@ -0,0 +1,15 @@
+#!/usr/bin/env python3
+
+import sys
+import warnings
+
+from opusfilter.cli.slurm import main
+
+warnings.warn(
+    "opusfilter-slurm is experimental. Please report issues at "
+    "https://github.com/Helsinki-NLP/OpusFilter/issues",
+    UserWarning,
+    stacklevel=2
+)
+
+sys.exit(main())
diff --git a/docs/slurm_integration.md b/docs/slurm_integration.md
@@ -0,0 +1,287 @@
+# OpusFilter SLURM Integration
+
+OpusFilter now supports running workflows on SLURM clusters with resource-optimized job scheduling.
+
+## Overview
+
+The `opusfilter-slurm` command converts OpusFilter workflows into SLURM jobs, handling:
+- Automatic dependency management
+- Per-step resource allocation
+- Concurrent execution of independent steps
+- Job monitoring and status tracking
+- Resume capability
+
+## Installation
+
+The SLURM integration is part of the standard OpusFilter installation. No additional dependencies are required.
+
+## Quick Start
+
+1. Create a configuration with SLURM settings:
+
+```yaml
+common:
+  output_directory: /scratch/myproject/output
+  slurm:
+    account: myaccount
+    partition: cpu
+    mail_user: user@institution.edu
+    resources:
+      filter:
+        time: 04:00:00
+        mem: 8G
+        array_size: 8
+      score:
+        partition: gpu
+        time: 08:00:00
+        mem: 32G
+        gres: gpu:1
+        n_jobs: 4
+
+steps:
+  - type: opus_read
+    parameters:
+      # ...
+  - type: filter
+    parameters:
+      # ...
+```
+
+2. Run the workflow:
+
+```bash
+# Basic execution
+opusfilter-slurm config.yaml
+
+# With options
+opusfilter-slurm config.yaml \
+    --resume \
+    --max-concurrent 10 \
+    --workdir /scratch/myproject-work
+```
+
+## Configuration
+
+### SLURM Section
+
+Add a `slurm` section under `common` in your YAML configuration:
+
+#### Global Settings
+
+- `account`: SLURM account name
+- `partition`: Default partition
+- `qos`: Quality of Service
+- `mail_type`: Email notification types
+- `mail_user`: Email for notifications
+- `workdir`: Directory for SLURM scripts and logs
+- `modules`: Modules to load for all jobs
+- `default`: Default resource specifications
+
+#### Per-Step Resources
+
+Under `slurm.resources`, specify resources per step type:
+
+```yaml
+resources:
+  filter:
+    time: 04:00:00    # Wall time limit
+    mem: 8G              # Memory requirement
+    cpus-per-task: 4      # CPU cores
+    partition: gpu           # Override default partition
+    gres: gpu:1            # GPU resources
+    array_size: 8          # For parallel steps
+    modules:               # Additional modules
+      - cuda/11.8
+```
+
+## Features
+
+### Dependency Management
+
+- Automatic detection based on input/output file matching
+- Support for complex workflows with branch dependencies
+- No manual job ID tracking required
+
+### Concurrent Execution
+
+- Independent steps run simultaneously (up to `--max-concurrent`)
+- Efficient resource utilization
+- Reduced queue wait times
+
+### Resource Optimization
+
+- Per-step resource allocation
+- Array job support for parallelizable steps
+- GPU allocation for GPU-intensive tasks
+
+### Monitoring and Resumption
+
+- `--resume`: Skip completed steps, continue from failures
+- Automatic cleanup of failed outputs
+
+## Best Practices
+
+1. **Estimate Resources**
+   - Start with conservative time/memory limits
+   - Check actual usage with `seff` after completion
+   - Adjust based on historical data
+
+2. **Organize Workflows**
+   - Place I/O-intensive steps early (opus_read, concatenate)
+   - Group similar resource requirements
+   - Avoid unnecessary dependencies
+
+3. **Use Arrays**
+   - Enable array_size for filter/score steps
+   - Parallelizes within step, not just between steps
+
+4. **Monitor Progress**
+   - Check logs in `${workdir}/logs/`
+   - Set up email notifications
+
+## Example Workflow
+
+```yaml
+# Complete example with all features
+common:
+  output_directory: corpus/processed
+  slurm:
+    account: nlp_project
+    partition: gpu
+    mail_type: END,FAIL,TIME_LIMIT_90
+    mail_user: researcher@university.edu
+    workdir: ${SLURM_TMP}/opusfilter
+    modules:
+      - python/3.10
+      - cuda/11.8
+    default:
+      time: 02:00:00
+      mem: 4G
+      cpus-per-task: 2
+    resources:
+      opus_read:
+        time: 01:00:00
+        mem: 2G
+      train_ngram:
+        time: 24:00:00
+        mem: 32G
+        cpus-per-task: 16
+      filter:
+        time: 12:00:00
+        mem: 16G
+        array_size: 16
+      score:
+        time: 06:00:00
+        mem: 64G
+        partition: gpu
+        gres: gpu:2
+        n_jobs: 8
+
+steps:
+  - type: opus_read
+    parameters:
+      corpus_name: OpenSubtitles
+      source_language: en
+      target_language: es
+      src_output: raw.en.gz
+      tgt_output: raw.es.gz
+
+  - type: train_ngram
+    parameters:
+      data: raw.en.gz
+      parameters:
+        norder: 5
+      model: en.arpa.gz
+
+  - type: filter
+    parameters:
+      inputs: [raw.en.gz, raw.es.gz]
+      outputs: [filtered.en.gz, filtered.es.gz]
+      filters: &common_filters
+        - LengthFilter:
+            unit: word
+            min_length: 1
+            max_length: 100
+        - LanguageIDFilter:
+            languages: [en, es]
+
+  - type: score
+    parameters:
+      inputs: [filtered.en.gz, filtered.es.gz]
+      output: scores.jsonl.gz
+      filters: *common_filters
+```
+
+## Troubleshooting
+
+### Jobs Not Submitting
+- Check account name and partition availability
+- Verify module names
+- Ensure workdir is writable
+- Check with `--dry-run` first
+
+### Jobs Failing
+- Check logs in `${workdir}/logs/`
+- Verify input/output paths
+- Use `scontrol show job <id>` for details
+- Ensure requested resources are available
+
+### Performance Issues
+- Reduce `array_size` if jobs are too small
+- Increase memory if OOM errors occur
+- Use appropriate partition (cpu/gpu)
+
+## Integration with Other Tools
+
+The SLURM integration outputs standard OpusFilter files that can be used with:
+- `opusfilter-diagram`: Visualize workflow
+- `opusfilter-scores`: Analyze job scores
+- Custom monitoring scripts
+
+## Advanced Usage
+
+### Explicit Dependencies (depends_on)
+
+Some steps have implicit file dependencies that are not tracked through standard input/output fields. For example, `LMClassifierFilter` loads model files specified in `lm_params.*.filename`, but these are not automatically detected as dependencies.
+
+Use the `depends_on` field to explicitly declare these dependencies:
+
+```yaml
+steps:
+  # Step that produces model files
+  - type: train_ngram
+    parameters:
+      data: data.txt.gz
+      model: model.arpa.gz
+
+  # Step that uses the model (implicit dependency via lm_params)
+  - type: filter
+    parameters:
+      inputs: [data.txt.gz]
+      outputs: [filtered.txt.gz]
+      filters:
+        - LMClassifierFilter:
+            lm_params:
+              en: {filename: model.arpa.gz}
+    depends_on:
+      - model.arpa.gz
+```
+
+The `depends_on` field supports:
+- Single file (string): `depends_on: model.arpa.gz`
+- Multiple files (list): `depends_on: [file1.gz, file2.gz]`
+- Variable expansion: `depends_on: ['!varstr "{lang}.arpa.gz"']`
+
+This ensures the filter step waits for the train_ngram step to complete before starting.
+
+### Resource Usage Collection
+Track actual resource usage:
+
+```bash
+# After jobs complete
+for jobid in $(squeue -u $USER -h | awk '/JobId=/ {print $2}'); do
+    seff $jobid
+done
+```
+
+This enables better resource estimation for future runs.
diff --git a/docs/usage.md b/docs/usage.md
@@ -12,19 +12,24 @@ sections:
 
 The syntax for the `opusfilter` command is
 ```
-opusfilter [--overwrite] [--last LAST] [--single SINGLE] [--n-jobs N_JOBS] CONFIG
+opusfilter [--overwrite] [--last LAST] [--single SINGLE] [--substep SUBSTEP] [--n-jobs N_JOBS] CONFIG
 ```
 where `CONFIG` is path to the configuration file.
 The script will run the steps one by one and stops when the final step
 has been processed (if no exceptions were raised). The script has
 options for setting the last step to run (`--last`) and running
-only a single step (`--single`). It the latter, the user has to
+only a single step (`--single`). In the latter case, the user has to
 make sure that all input files for the step already exist. The first
 step has number 1, and -1 points to the last step, -2 to the second to
 last, and so on. The `--n-jobs` option indicate number of processes to
 use when running `score`, `filter` and `preprocess` steps. This value will
 overwrite `default_n_jobs` in the `common` section.
 
+When using `--single` with a step that has [variables](#variables-and-constants),
+use `--substep` to run a specific variant. For example, if a step has
+`variables: {lang: [de, en, fr]}`, then `--substep 2` runs the English
+variant (1 for the first, 2 for the second, etc.).
+
 By default, existing output files will be re-used, and the steps
 producing them skipped. The `--overwrite` option will force overwrite
 for all the steps.
@@ -240,6 +245,55 @@ there are values in the lists. Note that if you need to use the
 same lists of variable values in multiple steps, you can exploit
 the standard YAML node anchors and references.
 
+When running steps with the `opusfilter` command, all variants are
+executed sequentially. Use the `--single` and `--substep` options
+to run a specific variant. For example, `opusfilter config.yaml --single 3 --substep 2`
+runs the second variant of step 3.
+
+## Step Dependencies
+
+By default, OpusFilter detects dependencies between steps automatically
+by matching input files to output files. However, some steps have
+implicit file dependencies that are not captured through standard
+input/output parameters.
+
+For example, `LMClassifierFilter` loads model files specified in
+`lm_params.*.filename`, but these are not automatically detected as
+dependencies because they are nested in filter configurations.
+
+Use the `depends_on` field to explicitly declare these dependencies:
+
+```yaml
+steps:
+  # Step that produces model files
+  - type: train_ngram
+    parameters:
+      data: data.txt.gz
+      model: model.arpa.gz
+
+  # Step that uses the model (implicit dependency via lm_params)
+  - type: filter
+    parameters:
+      inputs: [data.txt.gz]
+      outputs: [filtered.txt.gz]
+      filters:
+        - LMClassifierFilter:
+            lm_params:
+              en: {filename: model.arpa.gz}
+    depends_on:
+      - model.arpa.gz
+```
+
+The `depends_on` field supports:
+- Single file: `depends_on: model.arpa.gz`
+- Multiple files: `depends_on: [file1.gz, file2.gz]`
+- Variable expansion: `depends_on: ['!varstr "{lang}.arpa.gz"']`
+
+Note: The `depends_on` field is primarily used by `opusfilter-slurm`
+for SLURM job scheduling and `opusfilter-diagram` for visualization.
+The standard `opusfilter` command runs steps sequentially and does
+not require explicit dependency declaration.
+
 ## Running a single command
 
 If you need to run a single OpusFilter function wihtout the need of