Skip to content

New enums syntax to allow enum-level properties (doc / doc-ref, required underlying type) #1288

@generalmimon

Description

@generalmimon

Since the current enums syntax contains the enum values directly under the name of the enum, it doesn't allow for any "global" properties for the entire enum.

#358 discusses that this prevents support for doc / doc-ref keys to document the enum (if we disregard the ugly hack of mixing doc and integer keys on the same YAML level, as shown in #358 (comment), which IMO shouldn't even be considered). These would be optional properties that have no impact on the overall behavior of the language, though.

What's more important is that in #862 (comment), @GreyCat pointed out that the "untyped enums" that we have right now are not a very good idea. Pretty much all statically typed target languages require or encourage specifying the underlying integer type that the enum is based on. Sometimes you can technically omit specifying it explicitly, but that just means that some underlying type will be decided by the target language instead, which is often not what we want.

For example, #862 shows that C# uses int (a 32-bit signed integer type) by default, which effectively limits our C# enum support to signed 32-bit integers. In C++, the enum type can by default legally only represent integers with the minimum bit width that can fit all enum constants. If you try to convert a wider integer value to the enum type (#778 is related), you are "loading out-of-range values to enums without fixed underlying type", which is undefined behavior - see #959. The page https://pvs-studio.com/en/docs/warnings/v1016/ referenced in that issue has a good example:

Example 3.

enum EN { low = 2, high = 4 }; // Uses 3 bits, range: [0; 7]
EN a1 = static_cast<EN>(7);    // ok

According to the standard, the underlying type for this enum is 'int'. Inside this type, the compiler uses the minimum width of the bit field that can fit all the values of enum constants.

In this case, you will need at least 3 bits to fit all the values (2 = 0b010 and 4 = 0b100), so an EN variable can fit numbers from 0 (0b000) to 7 (0b111) inclusively. The number 8 already occupies four bits (0b1000), so it no longer fits in the EN type:

EN a2 = static_cast<EN>(8);    // UB

UndefinedBehaviorSanitizer also finds an error in this example: https://godbolt.org/z/GGYo7z.

Therefore, "untyped enums" are not really a thing - the compiler should always know the exact underlying type of each enum and declare it in the generated code, if the target language supports that. Another reason that illustrates why this is important is the to_i method of enums, see the User Guide:

6.4.5. Enums

Method name Return type Description
to_i Integer Converts an enum into the corresponding integer representation

It says that the return type is "integer", but that's not specific enough - the compiler must know the exact type. In theory, this should be the enum's underlying type, but since the compiler doesn't know that at the moment, it uses CalcIntType, i.e. signed 32-bit integer:

TypeDetector.scala:220-222

      case et: EnumType =>
        attr.name match {
          case "to_i" => CalcIntType

So in statically typed languages, the result of the enum.to_i method is currently truncated to a signed 32-bit integer, regardless of the actual underlying type.


Therefore, I propose a new syntax for enum definitions (which combines @GreyCat's suggestions in #358 (comment) and #862 (comment)):

enums:
  animals:
    doc: Animals you might find at the office.
    type: u4
    values:
      1: dogs
      2: cats

Obviously, the enum values themselves need to be moved to the values key to allow specifying other enum properties. That's a relatively trivial change. A bigger change is the new type key, which will be REQUIRED and will explicitly declare the enum's underlying type.

To address @GreyCat's concerns in #862 (comment), the underlying type will be enforced. First, this means that only integer constants representable by the underlying type will be allowed in the enum definition. Second, it will no longer be possible to directly convert an integer to an enum whose underlying type has a different width or signedness. So for the u4-based enum animals defined above, trying to use it on a s4 field would be a compile-time error:

seq:
  - id: animal
    type: s4  # error: 's4' does not match the underlying type 'u4' of the enum 'animals'
    enum: animals

Of course, this is a breaking change, but I believe it's the only future-proof solution that ensures consistent behavior across languages. What do you think?

Metadata

Metadata

Assignees

No one assigned
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions