Hammy has been created for didactic purposes and thus does not provide any optimizations as commonly found in real-world HMM implementations. If you nevertheless like to use it, please cite the following paper:
Christian Chiarcos (2026), Towards the Morphological Annotation of North Markian (Low German), LREC-2026.
The innovative idea is to use an unannotated corpus and to compare it with related languages to obtain transition probabilities
- extrapolate emission probability from frequency
- observe transition probabilities from related languages
- optionally: refine emission probabilities with prefix- / suffix-matching
- optionally: tag top k expressions manually
We include data for an experiment on Middle High German, using ReM as gold standard and UD corpora as basis
build data with
$> make
target/mhd/UD_split: gold corpus, automatically mapped from original annotation, split: 80-10-10. note that we split by files, not by lines, so, the orthographies are different
So far, we have a vanilla HMM implementation:
training (over train split, UD version)
$> python3 train.py rem_train.model target/mhd/UD_split/train.conllu
tagging (and evaluation)
$> python3 tag.py rem_train.model target/mhd/UD_split/test.conllu -e 3
To train and tag on full tags (incl. morphosyntactic features), specify the columns
$> python3 train.py rem_train.xpos.model target/mhd/UD_split/train.conllu -c 4
$> python3 tag.py rem_train.xpos.model target/mhd/UD_split/test.conllu -e 4
Use adopt.py to port a trained model to another language:
- input data should be pre-annotated against the same tagset, e.g., from a dictionary
- input data can contain ambiguities, marked by |, then count both, weighted by current probabilities for UNKNOWN
- input data can contain gaps
- replace emission probabilities and word2freq
- keep transition probabilities
Use merge.py to merge two or more models:
- average transition probabilities
- normalize initialization probabiliies
- average emission probabilities (if provided by at least one of the models)
- add word2freq scores (if provided by at least one of the models)
- apply lowercasing to non-lowercased models if merged with a lowercased model
Notes:
- if trained on ReM train corpus, test corpus accurracy (UD) is 87.8%
- if trained on dev corpus, test corpus accuracy (UD) is 85.8%
- we don't benefit much from more data ;)
- tagging with the full tagset is rather slow, note that the morphosyntactic features include aspects that are actually morphological rather than morphosyntactic, e.g., inflection class. For these, context-based disambiguation is likely to fail.
The fancy part comes now by combining transition probabilities from one or more other models with dictionary-based or bootstrapped emission probabilities.
The code words ok on small tagsets, e.g., UD. On tagsets with a few thousand tags, it is prohibitively slow, because the full Viterbi decoding is quadratic relative to the number of states. (For 3183 ReM tags, this means around 9 million multiplications per word.)
Two ways to address this:
- beam search: keep the top k current states only, not all
- model as matrixes and multiply via Numpy
- use Numpy HMM implementation instead ;)
- a fragment based on Numpy-ML is included in (but commented out from)
tag.py: Numpy-ML hasn't been updated for Python 3.10, yet - maybe https://pypi.org/project/seqlearn/ (fork from SciKit-Learn), but this is even less properly maintained
- a fragment based on Numpy-ML is included in (but commented out from)
use log instead of plain probabilities, less loss of precision
- 2024-01-25: change
tag.pyto only evaluate if gold tag isn't blank or '_'. This is because we might decide to not evaluate certain sentences, e.g., because they are part of another language. So, if they are systematically ambiguous between foreign languageXand their in-language tag, use one in favour of the other shouldn't count against a tagger. - 2024-01-25: added
adopt.py - 2024-12-17: added
merge.py