beeldengeluid · Martaesplo · Apr 16, 2024 · Apr 16, 2024 · Apr 16, 2024 · Apr 16, 2024
diff --git a/.dockerignore b/.dockerignore
@@ -0,0 +1,15 @@
+/data
+/misc
+/model
+/config
+/tests
+.venv
+.flake8
+.git
+.github
+.mypy_cache
+.pytest_cache
+.coverage
+__pycache__
+s3-creds.env
+.vscode
diff --git a/.flake8 b/.flake8
@@ -0,0 +1,56 @@
+# use .flake8 until we can move this config to pyproject.toml (not possible yet (27/02/2024) according to issue below)
+# https://github.com/PyCQA/flake8/issues/234
+
+[flake8]
+select =
+    # B: bugbear warnings
+    B,
+
+    # B950: bugbear max-linelength warning
+    # as suggested in the black docs
+    # https://github.com/psf/black/blob/d038a24ca200da9dacc1dcb05090c9e5b45b7869/docs/the_black_code_style/current_style.md#line-length
+    B950,
+
+    # C: currently only C901, mccabe code complexity
+    C,
+
+    # E: pycodestyle errors
+    E,
+
+    # F: flake8 codes for pyflakes
+    F,
+
+    # W: pycodestyle warnings
+    W,
+
+extend-ignore =
+    # E203: pycodestyle's "whitespace before ',', ';' or ':'" error
+    # ignored as suggested in the black docs
+    # https://github.com/psf/black/blob/d038a24ca200da9dacc1dcb05090c9e5b45b7869/docs/the_black_code_style/current_style.md#slices
+    E203,
+
+    # E501: pycodestyle's "line too long (82 > 79) characters" error
+    # ignored in favor of B950 as suggested in the black docs
+    # https://github.com/psf/black/blob/d038a24ca200da9dacc1dcb05090c9e5b45b7869/docs/the_black_code_style/current_style.md#line-length
+    E501,
+
+    # W503 line break before binary operator
+    W503,
+
+# set max-line-length to be black compatible, as suggested in the black docs
+# https://github.com/psf/black/blob/d038a24ca200da9dacc1dcb05090c9e5b45b7869/docs/the_black_code_style/current_style.md#line-length
+max-line-length = 88
+
+# set max cyclomatic complexity for mccabe plugin
+max-complexity = 10
+
+# show total number of errors, set exit code to 1 if tot is not empty
+count = True
+
+# show the source generating each error or warning
+show-source = True
+
+# count errors and warnings
+statistics = True
+exclude = 
+    .venv
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,8 @@
+__pycache__/
+*.wav
+*.mp4
+*_audio_segments
+*_video_segments
+*.txt
+temp_outputs
+speaker_transcript.txt
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,27 @@
+FROM docker.io/python:3.10
+
+# Create dirs for:
+# - Injecting config.yml: /root/.DANE
+# - Mount point for input & output files: /mnt/dane-fs
+# - Storing the source code: /src
+# - Storing the input file to be used while testing: /src/data
+RUN mkdir /root/.DANE /mnt/dane-fs /src /data
+
+WORKDIR /src
+
+ENV POETRY_NO_INTERACTION=1 \
+    POETRY_VIRTUALENVS_IN_PROJECT=1 \
+    POETRY_VIRTUALENVS_CREATE=1 \
+    POETRY_CACHE_DIR=/tmp/poetry_cache
+
+RUN pip install poetry==1.8.2
+
+COPY pyproject.toml poetry.lock ./
+RUN poetry install --without dev --no-root && rm -rf $POETRY_CACHE_DIR
+
+# Write provenance info about software versions to file
+RUN echo "dane-example-worker;https://github.com/beeldengeluid/dane-example-worker/commit/$(git rev-parse HEAD)" >> /software_provenance.txt
+
+COPY . /src
+
+ENTRYPOINT ["./docker-entrypoint.sh"]
diff --git a/README.md b/README.md
@@ -1 +1,46 @@
-# dane-speaker-diarisation-worker
+# dane-speaker-diarisation-worker
+
+## File description
+The worker files are copied from the example worker and not modified. Find bellow a description of the main additions:
+
+### **helpers.py** 
+- Vocal_extrcation: Function to perform vocal extraction with htdemucs. As input it is given the path to the input audio file.
+- Text_speaker_map: Function to read the predicted speaker label for each segment of speech, and the predicted timestamps from the transcription and match them. Generating a readable file with each text segment and the corresponding speaker label. As input it takes the path to the input audio file and the collection of whisper's results, from which we are only interested on the word timestamps.
+- Transcribe: Function to perform speech-to-text with faster_whisper. This function won't be needed once the ASR worker is completed. As input it takes the path to the input audio file, language (if known) of the input audio, model version, compute type and device.
+- Cleanup: Function to remove all temporary results belonging to vocal_extraction, transcription and diarization; not the final output generated by the "text_speaker_map" function. As input it takes the path to the directory with the temporary results, "temp_outputs".
+
+### **diarize.py**
+- Config_setup: Function to configure the MSDD module. In it there is specified the domain type either telephonic, meeting or general, as well as parameter settings regarding the speaker embedding models, clustering and VAD. As input it takes the output directory, namely "temp_outputs".
+- Diarize: Function to perform speaker diarization using the MSDD module. Firstly, the audio is set to one channel for NeMo compatibility, subsequently, diarization is performed taking as input the configuration setup defined in the afore explained funciton. As input it takes the path to the directory for temporary results and the vocal target, which instead of being the raw input audio file, is the extracted vocals with htdemucs.
+
+### **torun.py**
+With this file the speaker diarization pipeline can be ran on the input audio file that has to be specified inside, after the "audio_path" variable. Other settings that can be modified in this file include whether or not to perform vocal extraction, faster_whisper's model version, language and device. 
+
+The code is not adapted to run in the server, it can be ran locally. This will run in order: vocal extraction, transcription, speaker diarization, text to speaker mapping and finally a temporary results clean up.
+
+### **Transcript_diarize.ipynb**
+This notebook contains the whole pipeline to be run on Google Colab for instance, giving the option to use a GPU if not available locally. Using the notebook can also circumvent possible dependancy issues when trying to run the pipeline locally, allowing for quick tests.
+
+## **Package Installation**
+The following list should take care of the pipeline's dependancies:
+```
+pip install torch
+pip install faster_whisper
+pip install pydub
+pip install wget
+pip install nemo_toolkit[asr]==1.22.0
+pip install -U git+https://github.com/facebookresearch/demucs#egg=demucs
+pip install cython
+pip install transformers -U
+```
+
+Faster Whisper requires a tokenizers version <0.16, >=0.13
+
+`pip install tokenizers==0.15.2`
+
+For issues related to libstdc++.so.6.0.30: https://stackoverflow.com/questions/73317676/importerror-usr-lib-aarch64-linux-gnu-libstdc-so-6-version-glibcxx-3-4-30. The first answer solved my error.
+
+## **Usage**
+The pipeline can be run locally by doing `python torun.py`.
+
+