PyThaiNLP
diff --git a/‎CHANGELOG.md‎
Lines changed: 2 additions & 1 deletion b/‎CHANGELOG.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎docs/api/benchmarks.rst‎
Lines changed: 93 additions & 4 deletions b/‎docs/api/benchmarks.rst‎
Lines changed: 93 additions & 4 deletions
diff --git a/‎pythainlp/benchmarks/__init__.py‎
Lines changed: 13 additions & 1 deletion b/‎pythainlp/benchmarks/__init__.py‎
Lines changed: 13 additions & 1 deletion
@@ -36,7 +36,8 @@ See PR for prompt and details.
 - Ensure thread-safety for tokenizers #1213
 - Add Thai-NNER integration with top-level entity filtering #1221
 - Reorganize noauto test suite by dependency groups
-  (torch, tensorflow, onnx, cython, network) #935
+  (torch, tensorflow, onnx, cython, network) #1290
+- Add BLEU, ROUGE, WER, and CER metrics to pythainlp.benchmarks #1295
 - Improved documentation; code cleanup; more tests
 
 ## Version 5.1.2 -> 5.2.0
 
@@ -6,7 +6,7 @@ pythainlp.benchmarks
 Introduction
 ------------
 
-The `pythainlp.benchmarks` module is a collection of utility functions designed for benchmarking tasks related to Thai Natural Language Processing (NLP). Currently, the module includes tools for word tokenization benchmarking. Please note that additional benchmarking tasks will be incorporated in the future.
+The `pythainlp.benchmarks` module is a collection of utility functions designed for benchmarking tasks related to Thai Natural Language Processing (NLP). The module includes tools for word tokenization benchmarking and evaluation metrics for text generation tasks (BLEU and ROUGE).
 
 Tokenization
 ------------
@@ -23,8 +23,8 @@ The quality of word tokenization can significantly impact the accuracy of downst
 
    Qualitative evaluation of word tokenization.
 
-Functions
----------
+Tokenization Functions
+^^^^^^^^^^^^^^^^^^^^^^
 
 .. autofunction:: pythainlp.benchmarks.word_tokenization.compute_stats
 
@@ -38,7 +38,96 @@ Functions
 
     Preprocessing is a crucial step in NLP tasks. The `preprocessing` function assists in preparing text data for tokenization, which is essential for accurate and consistent benchmarking.
 
+Evaluation Metrics
+------------------
+
+The module provides pure Python implementations of common evaluation metrics (BLEU and ROUGE) that automatically handle Thai text tokenization. These metrics are essential for evaluating machine translation, text summarization, and other text generation tasks.
+
+BLEU Score
+^^^^^^^^^^
+
+BLEU (Bilingual Evaluation Understudy) is a metric for evaluating the quality of machine-translated text. It compares the generated text against one or more reference translations by measuring n-gram precision with a brevity penalty.
+
+.. autofunction:: pythainlp.benchmarks.bleu_score
+
+**Example:**
+
+.. code-block:: python
+
+    from pythainlp.benchmarks import bleu_score
+
+    # Single reference
+    references = ["สวัสดีครับ วันนี้อากาศดีมาก"]
+    hypotheses = ["สวัสดีค่ะ วันนี้อากาศดี"]
+    score = bleu_score(references, hypotheses)
+    print(f"BLEU: {score['bleu']:.2f}")
+    
+    # Multiple references per hypothesis
+    references = [
+        ["สวัสดีครับ", "สวัสดีค่ะ"],
+        ["ลาก่อนครับ", "ลาก่อนค่ะ"],
+    ]
+    hypotheses = ["สวัสดี", "ลาก่อน"]
+    score = bleu_score(references, hypotheses)
+    print(f"BLEU: {score['bleu']:.2f}")
+
+ROUGE Score
+^^^^^^^^^^^
+
+ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics for evaluating automatic summarization and machine translation. It measures the overlap between the generated text and reference text(s).
+
+.. autofunction:: pythainlp.benchmarks.rouge_score
+
+**Example:**
+
+.. code-block:: python
+
+    from pythainlp.benchmarks import rouge_score
+
+    reference = "สวัสดีครับ วันนี้อากาศดีมาก"
+    hypothesis = "สวัสดีค่ะ วันนี้อากาศดี"
+    scores = rouge_score(reference, hypothesis)
+    
+    for rouge_type, (precision, recall, fmeasure) in scores.items():
+        print(f"{rouge_type}: P={precision:.4f}, R={recall:.4f}, F={fmeasure:.4f}")
+
+Word Error Rate (WER)
+^^^^^^^^^^^^^^^^^^^^^
+
+Word Error Rate is a common metric for evaluating speech recognition and machine translation systems. It measures the minimum number of word-level edits (insertions, deletions, substitutions) needed to transform the hypothesis into the reference.
+
+.. autofunction:: pythainlp.benchmarks.word_error_rate
+
+**Example:**
+
+.. code-block:: python
+
+    from pythainlp.benchmarks import word_error_rate
+
+    reference = "สวัสดีครับ วันนี้อากาศดีมาก"
+    hypothesis = "สวัสดีค่ะ วันนี้อากาศดี"
+    wer = word_error_rate(reference, hypothesis)
+    print(f"WER: {wer:.4f}")
+
+Character Error Rate (CER)
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Character Error Rate is a metric for evaluating speech recognition and optical character recognition (OCR) systems. It measures the minimum number of character-level edits (insertions, deletions, substitutions) needed to transform the hypothesis into the reference.
+
+.. autofunction:: pythainlp.benchmarks.character_error_rate
+
+**Example:**
+
+.. code-block:: python
+
+    from pythainlp.benchmarks import character_error_rate
+
+    reference = "สวัสดีครับ"
+    hypothesis = "สวัสดีค่ะ"
+    cer = character_error_rate(reference, hypothesis)
+    print(f"CER: {cer:.4f}")
+
 Usage
 -----
 
-To make use of these benchmarking functions, you can follow the provided examples and guidelines in the official PyThaiNLP documentation. These tools are invaluable for researchers, developers, and anyone interested in improving and evaluating Thai word tokenization methods.
+To make use of these benchmarking functions, you can follow the provided examples and guidelines in the official PyThaiNLP documentation. These tools are invaluable for researchers, developers, and anyone interested in improving and evaluating Thai word tokenization methods and text generation systems.
@@ -3,6 +3,18 @@
 # SPDX-License-Identifier: Apache-2.0
 """Performance benchmarking."""
 
-__all__: list[str] = ["benchmark"]
+__all__: list[str] = [
+    "benchmark",
+    "bleu_score",
+    "character_error_rate",
+    "rouge_score",
+    "word_error_rate",
+]
 
+from pythainlp.benchmarks.metrics import (
+    bleu_score,
+    character_error_rate,
+    rouge_score,
+    word_error_rate,
+)
 from pythainlp.benchmarks.word_tokenization import benchmark