Skip to content

Initial support for CoNLL-U like constructions#11

Draft
harisont wants to merge 2 commits into
mainfrom
conlluc
Draft

Initial support for CoNLL-U like constructions#11
harisont wants to merge 2 commits into
mainfrom
conlluc

Conversation

@harisont

@harisont harisont commented Sep 3, 2025

Copy link
Copy Markdown
Collaborator

Initial support for the format used here (i.e. @ellepannitto's format).

The parser for this format is under development. It will be possible to fetch newer versions from here.

Support is currently very partial, missing:

  • support for the IDENTITY column (format: COLNAME=ID), which is used to specify that a certain field of a given token is the same as a certain field of another token with id ID, e.g. "day by day", "year by year"... (as far as I understand, this is actually possible to do with the current backend)
  • support for the ADJACENCY field, which is used to add order constraints (@Niklas-Deworetzki: am I right that this isn't supported by the backend yet?)
  • possibility to underspecify dependency structures
  • support for the WITHOUT field: this is the most challenging (at least when it comes to structural properties, e.g. CHILDREN:DEPREL=advmod) because it requires universal quantification, a bit like TREE in deptreepy
  • support for the token REQUIRED, which, if set to 0, makes the token optional (not sure about this one).

@Niklas-Deworetzki

Copy link
Copy Markdown
Owner

It is possible to add order AND distance constraints.
When creating a Query object, there is a constructor parameter constraints which accepts a collection of these constraints.

# Create some identifiers
a, b, c = Identifier(), Identifier(), Identifier()

examples = [
  # a has to be before b
  Constraint(a, b, enforces_order=True),
  # The distance between a and c will be 3 tokens
  Constraint(a, c, distance=3),
]
Query(tokens=[Token(i) for i in (a, b, c)], constraints=examples)

By default enforces_order is False and distance is Constraint.ARBITRARY_DISTANCE (inserting []* a sequence of arbitrary tokens). But you can omit the default values.

The distance is probably a bit broken, as it only uses the distance constraint in the places where the two tokens follow each other in the generated query. For this example, you would get the following two alternatives (using a, b and c as placeholder for the tokens):

a []* b []* c
a [] [] [] c []* b

Note that a is always before b and that there are exactly 3 tokens between a and c in the second line.
I will think about the distance a bit and see how this can be fixed in the general case where you wan't a fixed size between two tokens but there might be other tokens in between.

@harisont

harisont commented Sep 5, 2025

Copy link
Copy Markdown
Collaborator Author

Seems promising!

Unfortunately we don't have the time mental space to make any more progress on this this week (we implemented this during a break from the validator, and we're not done with that yet), but I'll try to get back to this as soon as I can once I'm back to Sweden (so, late September at the earliest).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants