Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions bin/opusfilter-slurm
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/usr/bin/env python3

import sys
import warnings

from opusfilter.cli.slurm import main

warnings.warn(
"opusfilter-slurm is experimental. Please report issues at "
"https://github.com/Helsinki-NLP/OpusFilter/issues",
UserWarning,
stacklevel=2
)

sys.exit(main())
287 changes: 287 additions & 0 deletions docs/slurm_integration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,287 @@
# OpusFilter SLURM Integration

OpusFilter now supports running workflows on SLURM clusters with resource-optimized job scheduling.

## Overview

The `opusfilter-slurm` command converts OpusFilter workflows into SLURM jobs, handling:
- Automatic dependency management
- Per-step resource allocation
- Concurrent execution of independent steps
- Job monitoring and status tracking
- Resume capability

## Installation

The SLURM integration is part of the standard OpusFilter installation. No additional dependencies are required.

## Quick Start

1. Create a configuration with SLURM settings:

```yaml
common:
output_directory: /scratch/myproject/output
slurm:
account: myaccount
partition: cpu
mail_user: user@institution.edu
resources:
filter:
time: 04:00:00
mem: 8G
array_size: 8
score:
partition: gpu
time: 08:00:00
mem: 32G
gres: gpu:1
n_jobs: 4

steps:
- type: opus_read
parameters:
# ...
- type: filter
parameters:
# ...
```

2. Run the workflow:

```bash
# Basic execution
opusfilter-slurm config.yaml

# With options
opusfilter-slurm config.yaml \
--resume \
--max-concurrent 10 \
--workdir /scratch/myproject-work
```

## Configuration

### SLURM Section

Add a `slurm` section under `common` in your YAML configuration:

#### Global Settings

- `account`: SLURM account name
- `partition`: Default partition
- `qos`: Quality of Service
- `mail_type`: Email notification types
- `mail_user`: Email for notifications
- `workdir`: Directory for SLURM scripts and logs
- `modules`: Modules to load for all jobs
- `default`: Default resource specifications

#### Per-Step Resources

Under `slurm.resources`, specify resources per step type:

```yaml
resources:
filter:
time: 04:00:00 # Wall time limit
mem: 8G # Memory requirement
cpus-per-task: 4 # CPU cores
partition: gpu # Override default partition
gres: gpu:1 # GPU resources
array_size: 8 # For parallel steps
modules: # Additional modules
- cuda/11.8
```

## Features

### Dependency Management

- Automatic detection based on input/output file matching
- Support for complex workflows with branch dependencies
- No manual job ID tracking required

### Concurrent Execution

- Independent steps run simultaneously (up to `--max-concurrent`)
- Efficient resource utilization
- Reduced queue wait times

### Resource Optimization

- Per-step resource allocation
- Array job support for parallelizable steps
- GPU allocation for GPU-intensive tasks

### Monitoring and Resumption

- `--resume`: Skip completed steps, continue from failures
- Automatic cleanup of failed outputs

## Best Practices

1. **Estimate Resources**
- Start with conservative time/memory limits
- Check actual usage with `seff` after completion
- Adjust based on historical data

2. **Organize Workflows**
- Place I/O-intensive steps early (opus_read, concatenate)
- Group similar resource requirements
- Avoid unnecessary dependencies

3. **Use Arrays**
- Enable array_size for filter/score steps
- Parallelizes within step, not just between steps

4. **Monitor Progress**
- Check logs in `${workdir}/logs/`
- Set up email notifications

## Example Workflow

```yaml
# Complete example with all features
common:
output_directory: corpus/processed
slurm:
account: nlp_project
partition: gpu
mail_type: END,FAIL,TIME_LIMIT_90
mail_user: researcher@university.edu
workdir: ${SLURM_TMP}/opusfilter
modules:
- python/3.10
- cuda/11.8
default:
time: 02:00:00
mem: 4G
cpus-per-task: 2
resources:
opus_read:
time: 01:00:00
mem: 2G
train_ngram:
time: 24:00:00
mem: 32G
cpus-per-task: 16
filter:
time: 12:00:00
mem: 16G
array_size: 16
score:
time: 06:00:00
mem: 64G
partition: gpu
gres: gpu:2
n_jobs: 8

steps:
- type: opus_read
parameters:
corpus_name: OpenSubtitles
source_language: en
target_language: es
src_output: raw.en.gz
tgt_output: raw.es.gz

- type: train_ngram
parameters:
data: raw.en.gz
parameters:
norder: 5
model: en.arpa.gz

- type: filter
parameters:
inputs: [raw.en.gz, raw.es.gz]
outputs: [filtered.en.gz, filtered.es.gz]
filters: &common_filters
- LengthFilter:
unit: word
min_length: 1
max_length: 100
- LanguageIDFilter:
languages: [en, es]

- type: score
parameters:
inputs: [filtered.en.gz, filtered.es.gz]
output: scores.jsonl.gz
filters: *common_filters
```

## Troubleshooting

### Jobs Not Submitting
- Check account name and partition availability
- Verify module names
- Ensure workdir is writable
- Check with `--dry-run` first

### Jobs Failing
- Check logs in `${workdir}/logs/`
- Verify input/output paths
- Use `scontrol show job <id>` for details
- Ensure requested resources are available

### Performance Issues
- Reduce `array_size` if jobs are too small
- Increase memory if OOM errors occur
- Use appropriate partition (cpu/gpu)

## Integration with Other Tools

The SLURM integration outputs standard OpusFilter files that can be used with:
- `opusfilter-diagram`: Visualize workflow
- `opusfilter-scores`: Analyze job scores
- Custom monitoring scripts

## Advanced Usage

### Explicit Dependencies (depends_on)

Some steps have implicit file dependencies that are not tracked through standard input/output fields. For example, `LMClassifierFilter` loads model files specified in `lm_params.*.filename`, but these are not automatically detected as dependencies.

Use the `depends_on` field to explicitly declare these dependencies:

```yaml
steps:
# Step that produces model files
- type: train_ngram
parameters:
data: data.txt.gz
model: model.arpa.gz

# Step that uses the model (implicit dependency via lm_params)
- type: filter
parameters:
inputs: [data.txt.gz]
outputs: [filtered.txt.gz]
filters:
- LMClassifierFilter:
lm_params:
en: {filename: model.arpa.gz}
depends_on:
- model.arpa.gz
```

The `depends_on` field supports:
- Single file (string): `depends_on: model.arpa.gz`
- Multiple files (list): `depends_on: [file1.gz, file2.gz]`
- Variable expansion: `depends_on: ['!varstr "{lang}.arpa.gz"']`

This ensures the filter step waits for the train_ngram step to complete before starting.

### Resource Usage Collection
Track actual resource usage:

```bash
# After jobs complete
for jobid in $(squeue -u $USER -h | awk '/JobId=/ {print $2}'); do
seff $jobid
done
```

This enables better resource estimation for future runs.
58 changes: 56 additions & 2 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,19 +12,24 @@ sections:

The syntax for the `opusfilter` command is
```
opusfilter [--overwrite] [--last LAST] [--single SINGLE] [--n-jobs N_JOBS] CONFIG
opusfilter [--overwrite] [--last LAST] [--single SINGLE] [--substep SUBSTEP] [--n-jobs N_JOBS] CONFIG
```
where `CONFIG` is path to the configuration file.
The script will run the steps one by one and stops when the final step
has been processed (if no exceptions were raised). The script has
options for setting the last step to run (`--last`) and running
only a single step (`--single`). It the latter, the user has to
only a single step (`--single`). In the latter case, the user has to
make sure that all input files for the step already exist. The first
step has number 1, and -1 points to the last step, -2 to the second to
last, and so on. The `--n-jobs` option indicate number of processes to
use when running `score`, `filter` and `preprocess` steps. This value will
overwrite `default_n_jobs` in the `common` section.

When using `--single` with a step that has [variables](#variables-and-constants),
use `--substep` to run a specific variant. For example, if a step has
`variables: {lang: [de, en, fr]}`, then `--substep 2` runs the English
variant (1 for the first, 2 for the second, etc.).

By default, existing output files will be re-used, and the steps
producing them skipped. The `--overwrite` option will force overwrite
for all the steps.
Expand Down Expand Up @@ -240,6 +245,55 @@ there are values in the lists. Note that if you need to use the
same lists of variable values in multiple steps, you can exploit
the standard YAML node anchors and references.

When running steps with the `opusfilter` command, all variants are
executed sequentially. Use the `--single` and `--substep` options
to run a specific variant. For example, `opusfilter config.yaml --single 3 --substep 2`
runs the second variant of step 3.

## Step Dependencies

By default, OpusFilter detects dependencies between steps automatically
by matching input files to output files. However, some steps have
implicit file dependencies that are not captured through standard
input/output parameters.

For example, `LMClassifierFilter` loads model files specified in
`lm_params.*.filename`, but these are not automatically detected as
dependencies because they are nested in filter configurations.

Use the `depends_on` field to explicitly declare these dependencies:

```yaml
steps:
# Step that produces model files
- type: train_ngram
parameters:
data: data.txt.gz
model: model.arpa.gz

# Step that uses the model (implicit dependency via lm_params)
- type: filter
parameters:
inputs: [data.txt.gz]
outputs: [filtered.txt.gz]
filters:
- LMClassifierFilter:
lm_params:
en: {filename: model.arpa.gz}
depends_on:
- model.arpa.gz
```

The `depends_on` field supports:
- Single file: `depends_on: model.arpa.gz`
- Multiple files: `depends_on: [file1.gz, file2.gz]`
- Variable expansion: `depends_on: ['!varstr "{lang}.arpa.gz"']`

Note: The `depends_on` field is primarily used by `opusfilter-slurm`
for SLURM job scheduling and `opusfilter-diagram` for visualization.
The standard `opusfilter` command runs steps sequentially and does
not require explicit dependency declaration.

## Running a single command

If you need to run a single OpusFilter function wihtout the need of
Expand Down
Loading