Skip to content

Commit b008610

Browse files
authored
Merge pull request #1295 from PyThaiNLP/copilot/add-bleu-rouge-metrics
Add pure Python BLEU, ROUGE, WER, and CER metrics with automatic Thai tokenization
2 parents ffdaf3b + 1ee0d5b commit b008610

File tree

5 files changed

+883
-7
lines changed

5 files changed

+883
-7
lines changed

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,8 @@ See PR for prompt and details.
3636
- Ensure thread-safety for tokenizers #1213
3737
- Add Thai-NNER integration with top-level entity filtering #1221
3838
- Reorganize noauto test suite by dependency groups
39-
(torch, tensorflow, onnx, cython, network) #935
39+
(torch, tensorflow, onnx, cython, network) #1290
40+
- Add BLEU, ROUGE, WER, and CER metrics to pythainlp.benchmarks #1295
4041
- Improved documentation; code cleanup; more tests
4142

4243
## Version 5.1.2 -> 5.2.0

docs/api/benchmarks.rst

Lines changed: 93 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ pythainlp.benchmarks
66
Introduction
77
------------
88

9-
The `pythainlp.benchmarks` module is a collection of utility functions designed for benchmarking tasks related to Thai Natural Language Processing (NLP). Currently, the module includes tools for word tokenization benchmarking. Please note that additional benchmarking tasks will be incorporated in the future.
9+
The `pythainlp.benchmarks` module is a collection of utility functions designed for benchmarking tasks related to Thai Natural Language Processing (NLP). The module includes tools for word tokenization benchmarking and evaluation metrics for text generation tasks (BLEU and ROUGE).
1010

1111
Tokenization
1212
------------
@@ -23,8 +23,8 @@ The quality of word tokenization can significantly impact the accuracy of downst
2323

2424
Qualitative evaluation of word tokenization.
2525

26-
Functions
27-
---------
26+
Tokenization Functions
27+
^^^^^^^^^^^^^^^^^^^^^^
2828

2929
.. autofunction:: pythainlp.benchmarks.word_tokenization.compute_stats
3030

@@ -38,7 +38,96 @@ Functions
3838

3939
Preprocessing is a crucial step in NLP tasks. The `preprocessing` function assists in preparing text data for tokenization, which is essential for accurate and consistent benchmarking.
4040

41+
Evaluation Metrics
42+
------------------
43+
44+
The module provides pure Python implementations of common evaluation metrics (BLEU and ROUGE) that automatically handle Thai text tokenization. These metrics are essential for evaluating machine translation, text summarization, and other text generation tasks.
45+
46+
BLEU Score
47+
^^^^^^^^^^
48+
49+
BLEU (Bilingual Evaluation Understudy) is a metric for evaluating the quality of machine-translated text. It compares the generated text against one or more reference translations by measuring n-gram precision with a brevity penalty.
50+
51+
.. autofunction:: pythainlp.benchmarks.bleu_score
52+
53+
**Example:**
54+
55+
.. code-block:: python
56+
57+
from pythainlp.benchmarks import bleu_score
58+
59+
# Single reference
60+
references = ["สวัสดีครับ วันนี้อากาศดีมาก"]
61+
hypotheses = ["สวัสดีค่ะ วันนี้อากาศดี"]
62+
score = bleu_score(references, hypotheses)
63+
print(f"BLEU: {score['bleu']:.2f}")
64+
65+
# Multiple references per hypothesis
66+
references = [
67+
["สวัสดีครับ", "สวัสดีค่ะ"],
68+
["ลาก่อนครับ", "ลาก่อนค่ะ"],
69+
]
70+
hypotheses = ["สวัสดี", "ลาก่อน"]
71+
score = bleu_score(references, hypotheses)
72+
print(f"BLEU: {score['bleu']:.2f}")
73+
74+
ROUGE Score
75+
^^^^^^^^^^^
76+
77+
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics for evaluating automatic summarization and machine translation. It measures the overlap between the generated text and reference text(s).
78+
79+
.. autofunction:: pythainlp.benchmarks.rouge_score
80+
81+
**Example:**
82+
83+
.. code-block:: python
84+
85+
from pythainlp.benchmarks import rouge_score
86+
87+
reference = "สวัสดีครับ วันนี้อากาศดีมาก"
88+
hypothesis = "สวัสดีค่ะ วันนี้อากาศดี"
89+
scores = rouge_score(reference, hypothesis)
90+
91+
for rouge_type, (precision, recall, fmeasure) in scores.items():
92+
print(f"{rouge_type}: P={precision:.4f}, R={recall:.4f}, F={fmeasure:.4f}")
93+
94+
Word Error Rate (WER)
95+
^^^^^^^^^^^^^^^^^^^^^
96+
97+
Word Error Rate is a common metric for evaluating speech recognition and machine translation systems. It measures the minimum number of word-level edits (insertions, deletions, substitutions) needed to transform the hypothesis into the reference.
98+
99+
.. autofunction:: pythainlp.benchmarks.word_error_rate
100+
101+
**Example:**
102+
103+
.. code-block:: python
104+
105+
from pythainlp.benchmarks import word_error_rate
106+
107+
reference = "สวัสดีครับ วันนี้อากาศดีมาก"
108+
hypothesis = "สวัสดีค่ะ วันนี้อากาศดี"
109+
wer = word_error_rate(reference, hypothesis)
110+
print(f"WER: {wer:.4f}")
111+
112+
Character Error Rate (CER)
113+
^^^^^^^^^^^^^^^^^^^^^^^^^^
114+
115+
Character Error Rate is a metric for evaluating speech recognition and optical character recognition (OCR) systems. It measures the minimum number of character-level edits (insertions, deletions, substitutions) needed to transform the hypothesis into the reference.
116+
117+
.. autofunction:: pythainlp.benchmarks.character_error_rate
118+
119+
**Example:**
120+
121+
.. code-block:: python
122+
123+
from pythainlp.benchmarks import character_error_rate
124+
125+
reference = "สวัสดีครับ"
126+
hypothesis = "สวัสดีค่ะ"
127+
cer = character_error_rate(reference, hypothesis)
128+
print(f"CER: {cer:.4f}")
129+
41130
Usage
42131
-----
43132

44-
To make use of these benchmarking functions, you can follow the provided examples and guidelines in the official PyThaiNLP documentation. These tools are invaluable for researchers, developers, and anyone interested in improving and evaluating Thai word tokenization methods.
133+
To make use of these benchmarking functions, you can follow the provided examples and guidelines in the official PyThaiNLP documentation. These tools are invaluable for researchers, developers, and anyone interested in improving and evaluating Thai word tokenization methods and text generation systems.

pythainlp/benchmarks/__init__.py

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,18 @@
33
# SPDX-License-Identifier: Apache-2.0
44
"""Performance benchmarking."""
55

6-
__all__: list[str] = ["benchmark"]
6+
__all__: list[str] = [
7+
"benchmark",
8+
"bleu_score",
9+
"character_error_rate",
10+
"rouge_score",
11+
"word_error_rate",
12+
]
713

14+
from pythainlp.benchmarks.metrics import (
15+
bleu_score,
16+
character_error_rate,
17+
rouge_score,
18+
word_error_rate,
19+
)
820
from pythainlp.benchmarks.word_tokenization import benchmark

0 commit comments

Comments
 (0)