Skip to content

Extend analytic rank with standard tie strategies #641

@noahboerger

Description

@noahboerger

Motivation

The current rank operator assigns the same rank to tied values and leaves gaps in the sequence (standard competition ranking). This is sometimes desirable, but users also need other well-known tie strategies.

Example:

DS_1

Id_1 Id_2 Me_1
IDENTIFIER IDENTIFIER MEASURE
Integer Integer Integer
1 1 1
1 2 1
1 3 2
1 4 3
  • DS_r := DS_1[ calc Me2 := rank( over( partition by Id_1 order by Me_1 ) ) ];
  • Current rank, with DS_1 ranked by Me_1 within Id_1:
Id_1 Id_2 Me_1 Me_2
IDENTIFIER IDENTIFIER MEASURE MEASURE
Integer Integer Integer Integer
1 1 1 1
1 2 1 1
1 3 2 3
1 4 3 4
  • In practice, users often need different behaviors:
    • dense (“1,1,2,3”)
    • first (break ties deterministically by a secondary rule - in SQL/Pandas based on the ordering)
    • max (highest rank in the tied group)
    • average (average rank over the tied group)
  • These options are standard in popular data tools (e.g., Pandas) and would avoid verbose workarounds (secondary sorts, additional calculations, or post-processing).

Proposal

  • Extend the rank analytic operator with an optional tie-resolution policy

Signature (abstract)

  • rank ( over ( { partitionClause } orderClause ) { tie_resolution method } )
  • method ∈ { min | max | avg | firstByIdentifiers | dense }, default min.

Small naming note for “first”

  • Because VTL datasets are not physically ordered, the “first” policy could not be realized as in Pandas/SQL
  • Therefore firstByIdentifiers/byIdentifiers can be introduced
    • Deterministically break ties by extending the order with all Identifier components (implicitly, unless they are already present in orderBy)
    • Appending all Identifier components to the orderBy clause in a well-specified way (lexicographic by identifier name, ascending by value)
    • This produces a fully deterministic total order within each partition, independent of any physical row order. Ties are fully resolved, and ranks increase without gaps (1,2,3,4) when identical ordering-key values occur.
    • If all identifiers are already included in orderBy, no implicit addition is made

Backward compatability

Fully backward compatible: The current VTL behavior remains the default (min)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions