Skip to content

acoli-repo/hammy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hammy. HMM tagging for languages without training corpora

Hammy has been created for didactic purposes and thus does not provide any optimizations as commonly found in real-world HMM implementations. If you nevertheless like to use it, please cite the following paper:

 Christian Chiarcos (2026), Towards the Morphological Annotation of North Markian (Low German), LREC-2026.

The innovative idea is to use an unannotated corpus and to compare it with related languages to obtain transition probabilities

  • extrapolate emission probability from frequency
  • observe transition probabilities from related languages
  • optionally: refine emission probabilities with prefix- / suffix-matching
  • optionally: tag top k expressions manually

We include data for an experiment on Middle High German, using ReM as gold standard and UD corpora as basis

build data with

$> make 
  • target/mhd/UD_split: gold corpus, automatically mapped from original annotation, split: 80-10-10. note that we split by files, not by lines, so, the orthographies are different

So far, we have a vanilla HMM implementation:

training (over train split, UD version)

$> python3 train.py rem_train.model target/mhd/UD_split/train.conllu

tagging (and evaluation)

$> python3 tag.py rem_train.model target/mhd/UD_split/test.conllu -e 3

To train and tag on full tags (incl. morphosyntactic features), specify the columns

$> python3 train.py rem_train.xpos.model target/mhd/UD_split/train.conllu -c 4
$> python3 tag.py rem_train.xpos.model target/mhd/UD_split/test.conllu -e 4

Use adopt.py to port a trained model to another language:

  • input data should be pre-annotated against the same tagset, e.g., from a dictionary
    • input data can contain ambiguities, marked by |, then count both, weighted by current probabilities for UNKNOWN
    • input data can contain gaps
  • replace emission probabilities and word2freq
    • keep transition probabilities

Use merge.py to merge two or more models:

  • average transition probabilities
  • normalize initialization probabiliies
  • average emission probabilities (if provided by at least one of the models)
  • add word2freq scores (if provided by at least one of the models)
  • apply lowercasing to non-lowercased models if merged with a lowercased model

Notes:

  • if trained on ReM train corpus, test corpus accurracy (UD) is 87.8%
  • if trained on dev corpus, test corpus accuracy (UD) is 85.8%
  • we don't benefit much from more data ;)
  • tagging with the full tagset is rather slow, note that the morphosyntactic features include aspects that are actually morphological rather than morphosyntactic, e.g., inflection class. For these, context-based disambiguation is likely to fail.

The fancy part comes now by combining transition probabilities from one or more other models with dictionary-based or bootstrapped emission probabilities.

Known issues

The code words ok on small tagsets, e.g., UD. On tagsets with a few thousand tags, it is prohibitively slow, because the full Viterbi decoding is quadratic relative to the number of states. (For 3183 ReM tags, this means around 9 million multiplications per word.)

Two ways to address this:

  • beam search: keep the top k current states only, not all
  • model as matrixes and multiply via Numpy
  • use Numpy HMM implementation instead ;)
    • a fragment based on Numpy-ML is included in (but commented out from) tag.py: Numpy-ML hasn't been updated for Python 3.10, yet
    • maybe https://pypi.org/project/seqlearn/ (fork from SciKit-Learn), but this is even less properly maintained

use log instead of plain probabilities, less loss of precision

history

  • 2024-01-25: change tag.py to only evaluate if gold tag isn't blank or '_'. This is because we might decide to not evaluate certain sentences, e.g., because they are part of another language. So, if they are systematically ambiguous between foreign language X and their in-language tag, use one in favour of the other shouldn't count against a tagger.
  • 2024-01-25: added adopt.py
  • 2024-12-17: added merge.py

About

HMM tagging for languages without training corpora

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors