Hammy. HMM tagging for languages without training corpora

Hammy has been created for didactic purposes and thus does not provide any optimizations as commonly found in real-world HMM implementations. If you nevertheless like to use it, please cite the following paper:

 Christian Chiarcos (2026), Towards the Morphological Annotation of North Markian (Low German), LREC-2026.

The innovative idea is to use an unannotated corpus and to compare it with related languages to obtain transition probabilities

extrapolate emission probability from frequency
observe transition probabilities from related languages
optionally: refine emission probabilities with prefix- / suffix-matching
optionally: tag top k expressions manually

We include data for an experiment on Middle High German, using ReM as gold standard and UD corpora as basis

build data with

$> make

target/mhd/UD_split: gold corpus, automatically mapped from original annotation, split: 80-10-10. note that we split by files, not by lines, so, the orthographies are different

So far, we have a vanilla HMM implementation:

training (over train split, UD version)

$> python3 train.py rem_train.model target/mhd/UD_split/train.conllu

tagging (and evaluation)

$> python3 tag.py rem_train.model target/mhd/UD_split/test.conllu -e 3

To train and tag on full tags (incl. morphosyntactic features), specify the columns

$> python3 train.py rem_train.xpos.model target/mhd/UD_split/train.conllu -c 4
$> python3 tag.py rem_train.xpos.model target/mhd/UD_split/test.conllu -e 4

Use adopt.py to port a trained model to another language:

input data should be pre-annotated against the same tagset, e.g., from a dictionary
- input data can contain ambiguities, marked by |, then count both, weighted by current probabilities for UNKNOWN
- input data can contain gaps
replace emission probabilities and word2freq
- keep transition probabilities

Use merge.py to merge two or more models:

average transition probabilities
normalize initialization probabiliies
average emission probabilities (if provided by at least one of the models)
add word2freq scores (if provided by at least one of the models)
apply lowercasing to non-lowercased models if merged with a lowercased model

Notes:

if trained on ReM train corpus, test corpus accurracy (UD) is 87.8%
if trained on dev corpus, test corpus accuracy (UD) is 85.8%
we don't benefit much from more data ;)
tagging with the full tagset is rather slow, note that the morphosyntactic features include aspects that are actually morphological rather than morphosyntactic, e.g., inflection class. For these, context-based disambiguation is likely to fail.

The fancy part comes now by combining transition probabilities from one or more other models with dictionary-based or bootstrapped emission probabilities.

Known issues

The code words ok on small tagsets, e.g., UD. On tagsets with a few thousand tags, it is prohibitively slow, because the full Viterbi decoding is quadratic relative to the number of states. (For 3183 ReM tags, this means around 9 million multiplications per word.)

Two ways to address this:

beam search: keep the top k current states only, not all
model as matrixes and multiply via Numpy
use Numpy HMM implementation instead ;)
- a fragment based on Numpy-ML is included in (but commented out from) tag.py: Numpy-ML hasn't been updated for Python 3.10, yet
- maybe https://pypi.org/project/seqlearn/ (fork from SciKit-Learn), but this is even less properly maintained

use log instead of plain probabilities, less loss of precision

history

2024-01-25: change tag.py to only evaluate if gold tag isn't blank or '_'. This is because we might decide to not evaluate certain sentences, e.g., because they are part of another language. So, if they are systematically ambiguous between foreign language X and their in-language tag, use one in favour of the other shouldn't count against a tagger.
2024-01-25: added adopt.py
2024-12-17: added merge.py

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
target		target
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
adopt.py		adopt.py
merge.py		merge.py
rem2ud.py		rem2ud.py
tag.py		tag.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hammy. HMM tagging for languages without training corpora

Known issues

history

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Hammy. HMM tagging for languages without training corpora

Known issues

history

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages