Skip to content

Reject encoding: UTF-16 or encoding: UTF-32 with an error (instead of just a warning) #1302

@generalmimon

Description

@generalmimon

As part of #1290, @GreyCat wrote the following in the newly added Encoding name (encoding) section of the style guide - ksy_style_guide.adoc:291-292:

UTF-16 and UTF-32 (without an explicit byte order suffix) are deliberately not accepted.

Technically, they are accepted by the compiler - only a warning is issued (which BTW isn't even visible in the Web IDE, because the JS build of the compiler only throws the first error as an exception and ignores warnings: kaitai-io/kaitai_struct_compiler@8913518):

utf16.ksy

meta:
  id: utf16
seq:
  - id: foo
    type: strz
    encoding: UTF-16
$ kaitai-struct-compiler -t python utf16.ksy
utf16.ksy: /seq/0/encoding:
        warning: unrecognized encoding name 'UTF-16'

Perhaps UTF-16 and UTF-32 shouldn't be just "unrecognized", but the compiler should know about them and explicitly ban them? This was suggested in the past, see #391 (comment):

And, while we're there, some encoding names should be definitely banned, for example, utf16 and ucs2 (as it lacks information on endianness and relies of current platform's native endianness) and I'm somewhat reluctant about ucs2le and ucs2be (as it's kind of hard to find true UCS2 encoding parser, not UTF16 one).

In #393, they would fall into the "black list" (note: I'm not entirely sure why this box was checked, because we don't have any explicit list of prohibited encodings):

  • add a "black list" of encodings that should not be allowed; using these should break the build and clear error should be given to the user

Personally, I wouldn't block all unknown encodings by default (because that could screw someone over if they're using a legitimate encoding that isn't on our list), but if we recognize that it's specifically an unwanted encoding like UTF-16 (or its alias like utf16), then I think it's fine to throw an error. We should probably have an error message specifically for UTF-16 and UTF-32 that includes a clear explanation for the ban, because people who try to use these names most likely have no idea that they are ambiguous.

As I mentioned in #1290 (comment), this will affect some users if they want these specs to keep working after we implement this: Code search results: -org:kaitai-io path:*.ksy /encoding: utf-16$/

Metadata

Metadata

Assignees

No one assigned
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions