Skip to content
Open
Show file tree
Hide file tree
Changes from 18 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
/data
/misc
/model
/config
/tests
.venv
.flake8
.git
.github
.mypy_cache
.pytest_cache
.coverage
__pycache__
s3-creds.env
.vscode
56 changes: 56 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# use .flake8 until we can move this config to pyproject.toml (not possible yet (27/02/2024) according to issue below)
# https://github.com/PyCQA/flake8/issues/234

[flake8]
select =
# B: bugbear warnings
B,

# B950: bugbear max-linelength warning
# as suggested in the black docs
# https://github.com/psf/black/blob/d038a24ca200da9dacc1dcb05090c9e5b45b7869/docs/the_black_code_style/current_style.md#line-length
B950,

# C: currently only C901, mccabe code complexity
C,

# E: pycodestyle errors
E,

# F: flake8 codes for pyflakes
F,

# W: pycodestyle warnings
W,

extend-ignore =
# E203: pycodestyle's "whitespace before ',', ';' or ':'" error
# ignored as suggested in the black docs
# https://github.com/psf/black/blob/d038a24ca200da9dacc1dcb05090c9e5b45b7869/docs/the_black_code_style/current_style.md#slices
E203,

# E501: pycodestyle's "line too long (82 > 79) characters" error
# ignored in favor of B950 as suggested in the black docs
# https://github.com/psf/black/blob/d038a24ca200da9dacc1dcb05090c9e5b45b7869/docs/the_black_code_style/current_style.md#line-length
E501,

# W503 line break before binary operator
W503,

# set max-line-length to be black compatible, as suggested in the black docs
# https://github.com/psf/black/blob/d038a24ca200da9dacc1dcb05090c9e5b45b7869/docs/the_black_code_style/current_style.md#line-length
max-line-length = 88

# set max cyclomatic complexity for mccabe plugin
max-complexity = 10

# show total number of errors, set exit code to 1 if tot is not empty
count = True

# show the source generating each error or warning
show-source = True

# count errors and warnings
statistics = True
exclude =
.venv
8 changes: 8 additions & 0 deletions .gitignore

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add data and config dirs

Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
__pycache__/
*.wav
*.mp4
*_audio_segments
*_video_segments
*.txt
temp_outputs
speaker_transcript.txt
27 changes: 27 additions & 0 deletions Dockerfile

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove anything you have not worked on from the PR/branch

Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
FROM docker.io/python:3.10

# Create dirs for:
# - Injecting config.yml: /root/.DANE
# - Mount point for input & output files: /mnt/dane-fs
# - Storing the source code: /src
# - Storing the input file to be used while testing: /src/data
RUN mkdir /root/.DANE /mnt/dane-fs /src /data

WORKDIR /src

ENV POETRY_NO_INTERACTION=1 \
POETRY_VIRTUALENVS_IN_PROJECT=1 \
POETRY_VIRTUALENVS_CREATE=1 \
POETRY_CACHE_DIR=/tmp/poetry_cache

RUN pip install poetry==1.8.2

COPY pyproject.toml poetry.lock ./
RUN poetry install --without dev --no-root && rm -rf $POETRY_CACHE_DIR

# Write provenance info about software versions to file
RUN echo "dane-example-worker;https://github.com/beeldengeluid/dane-example-worker/commit/$(git rev-parse HEAD)" >> /software_provenance.txt

COPY . /src

ENTRYPOINT ["./docker-entrypoint.sh"]
47 changes: 46 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,46 @@
# dane-speaker-diarisation-worker
# dane-speaker-diarisation-worker

## File description
The worker files are copied from the example worker and not modified. Find bellow a description of the main additions:

### **helpers.py**
- Vocal_extrcation: Function to perform vocal extraction with htdemucs. As input it is given the path to the input audio file.
- Text_speaker_map: Function to read the predicted speaker label for each segment of speech, and the predicted timestamps from the transcription and match them. Generating a readable file with each text segment and the corresponding speaker label. As input it takes the path to the input audio file and the collection of whisper's results, from which we are only interested on the word timestamps.
- Transcribe: Function to perform speech-to-text with faster_whisper. This function won't be needed once the ASR worker is completed. As input it takes the path to the input audio file, language (if known) of the input audio, model version, compute type and device.
- Cleanup: Function to remove all temporary results belonging to vocal_extraction, transcription and diarization; not the final output generated by the "text_speaker_map" function. As input it takes the path to the directory with the temporary results, "temp_outputs".

### **diarize.py**
- Config_setup: Function to configure the MSDD module. In it there is specified the domain type either telephonic, meeting or general, as well as parameter settings regarding the speaker embedding models, clustering and VAD. As input it takes the output directory, namely "temp_outputs".
- Diarize: Function to perform speaker diarization using the MSDD module. Firstly, the audio is set to one channel for NeMo compatibility, subsequently, diarization is performed taking as input the configuration setup defined in the afore explained funciton. As input it takes the path to the directory for temporary results and the vocal target, which instead of being the raw input audio file, is the extracted vocals with htdemucs.

### **torun.py**
With this file the speaker diarization pipeline can be ran on the input audio file that has to be specified inside, after the "audio_path" variable. Other settings that can be modified in this file include whether or not to perform vocal extraction, faster_whisper's model version, language and device.

The code is not adapted to run in the server, it can be ran locally. This will run in order: vocal extraction, transcription, speaker diarization, text to speaker mapping and finally a temporary results clean up.

### **Transcript_diarize.ipynb**
This notebook contains the whole pipeline to be run on Google Colab for instance, giving the option to use a GPU if not available locally. Using the notebook can also circumvent possible dependancy issues when trying to run the pipeline locally, allowing for quick tests.

## **Package Installation**
The following list should take care of the pipeline's dependancies:
```
pip install torch
pip install faster_whisper
pip install pydub
pip install wget
pip install nemo_toolkit[asr]==1.22.0
pip install -U git+https://github.com/facebookresearch/demucs#egg=demucs
pip install cython
pip install transformers -U
```

Faster Whisper requires a tokenizers version <0.16, >=0.13

`pip install tokenizers==0.15.2`

For issues related to libstdc++.so.6.0.30: https://stackoverflow.com/questions/73317676/importerror-usr-lib-aarch64-linux-gnu-libstdc-so-6-version-glibcxx-3-4-30. The first answer solved my error.

## **Usage**
The pipeline can be run locally by doing `python torun.py`.


Loading