Skip to content

feat: add support for fine-grained assembly representation#1527

Draft
ThinkOpenly wants to merge 4 commits intoriscv:mainfrom
ThinkOpenly:operands
Draft

feat: add support for fine-grained assembly representation#1527
ThinkOpenly wants to merge 4 commits intoriscv:mainfrom
ThinkOpenly:operands

Conversation

@ThinkOpenly
Copy link
Copy Markdown
Collaborator

Currently, assembly syntax is represented as a simple string of comma-separated operands with a heuristic naming convention to indicate their respective type and purpose.

Make this more rigorous by adding a schema which supports:

  • registers
  • the registers' register file (GPR, FPR, VR, CSR, etc.)
  • dereference syntax "(reg)"
  • dereference+offset syntax "offset(reg)"
  • immediates
  • floating-point rounding mode and possible values
  • FENCE scopes
  • register lists (for POP/PUSH)

The new "operands" YAML field is currently optional and coexists with the existing "assembly" field. So, support can be added over time to both the YAML files and the infrastructure to support generation of actual assembly syntax where needed (documentation) until the "assembly" field is no longer needed.

@ThinkOpenly
Copy link
Copy Markdown
Collaborator Author

Not terribly worried about this, but DO NOT MERGE.

I had to specify all required fields and could not depend on the "default" values being used when trying to get past pre-commit run check-jsonschema.

@lenary
Copy link
Copy Markdown
Member

lenary commented Feb 3, 2026

Thank you for getting this ball rolling, Paul.

I would like to talk about testing unified-db against existing tools, and how we want to go about it. Currently, we test LLVM by inspecting the internal data descriptions (TableGen), but I'm not sure this is sustainable long-term. I would prefer to do the testing end-to-end - i.e. given a set of extensions(+parameters?), generate a random valid assembly string and check whether the assembler turns it into the expected instruction (+relocations). Similarly, generate a random invalid assembly string and check it is rejected. We should also be able to do the same for generating random encoding bits for disassembling (this is easier today, i think), for both valid and invalid cases. While these all seem like they should mirror each other, they don't quite, because the output of a disassembler is never meant to be assembled again, so the reality is you need an oracle to work out what the correct assembly should have been - in our case, that likely means writing an assembler/disassembler that can do so entirely based on UDB information alone. These are not small undertakings, but I think would stand us in a better situation long-term.

I think you're right that we need to see how to join this up with the field information at some point. I'm keen for us to have a side table of "these are operands others have used, please re-use them", which hopefully would cover some combo of fields and operands, but it's hard to know how this would look today, and will need iteration.

One kind of operand you've missed out is Symbol Expressions (foo, foo+1, %pcrel_hi(foo)) - exactly which specifiers (the <name> in %<name>(expr)) are allowed depends on the instruction/field, and corresponds to available relocations on the field.

I started writing a much longer comment, which I'm going to hide below, because I think it's useful to write down these cases, but they start to look like the longer tail of things. I do think it's important to think about the more complex cases earlier, though, rather than designing something that works for simpler cases like add rd, rs1, rs2 and needs to be entirely changed later for complex cases.

More Hard Cases Sorry for the stream of conciousness thoughts, I want to provide some degree of "here's complex cases we need to get right", rather than just thinking about simple cases like `add rd, rs1, rs2`:

One thing to be really careful about right now is PC-relative immediate operands, such as the offset in beq and jal. These are treated differently by GCC and by Clang -- beq a0, a1, 28 in GCC means branch to address 28, and in clang means branch to address pc+28. Fixing this incompatibility is not something we should seek to do with the specification at this time, as fixing one of the assemblers is not a very easy thing to do. Note that beq a0, a1, symbol is treated identically by both.

Another "fun" snare is the xlen-dependent operands, which come up in shifts, where the immediate range accepted depends on xlen.

We probably also want to be careful with "which registers are valid to write", but that's quite difficult to do right now, and is closely connected to "what is encodable". This is especially the case when we want to say "this operand actually represents a GPR Pair, not a GPR" such in zilsd, but similar also occurs in C/Zca instructions.

Zfinx/Zdinx are going to be a nightmare on the "which registers are valid to write", especially rv32 zdinx. I haven't looked at how these are represented, but they're a clear case of "one mnemonic can mean a bunch of different instructions depending on the operands", which is a joy to deal with.

We eventually want to cover pseudos (which expand to sequences of instructions). call and tail are probably good places to start, lw <reg>, <sym> and sw <reg>, <sym>, <reg> are harder instances of similar things, as are the la.tls.ie and la.tls.gd pseudos.

We probably also need to cover optional operands, somewhere along the way. the 0 in 0(reg) is optional, and in the atomic instructions does not correspond to any encoding bits (the offset in these can only be 0). There are similar complexities in vsetvli.

We also made the decision in the toolchains recently that MOP/HINT-compatible instructions can always be written as their non-hint variant - i.e. c.sspush ra (from zicfiss) can always be written if you have c.mop.1 available (from zcmop) - see discussion here: riscv-non-isa/riscv-elf-psabi-doc#474 - this ends up adding some complexity to udb, but we don't believe that MOPs/HINTs can be re-allocated anyway.

@codecov
Copy link
Copy Markdown

codecov bot commented Feb 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.94%. Comparing base (e6fdeeb) to head (3cb2886).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1527   +/-   ##
=======================================
  Coverage   71.94%   71.94%           
=======================================
  Files          54       54           
  Lines       27976    27976           
  Branches     6183     6183           
=======================================
  Hits        20128    20128           
  Misses       7848     7848           
Flag Coverage Δ
idlc 75.90% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@dhower-qc
Copy link
Copy Markdown
Collaborator

This PR is timely -- see #1435, which is just about ready. It introduces the new, new instruction schema that has dedicated operand objects that you could attach this information to.

@ThinkOpenly
Copy link
Copy Markdown
Collaborator Author

This PR is timely -- see #1435

Indeed. Looks like we're headed in the same rough direction.

which is just about ready.

Is it? ;-) Is the versioning in some of the file names an indicator of "works in progress"?

It introduces the new, new instruction schema that has dedicated operand objects that you could attach this information to.

OK. Looking...

Should we discuss this here, there, in a meeting, in a GitHub "discussion"? This is a fairly big topic.

@dhower-qc
Copy link
Copy Markdown
Collaborator

which is just about ready.

Is it? ;-) Is the versioning in some of the file names an indicator of "works in progress"?

The versioning is supposed to indicate that we are nearing 1.0 of a schema. I was thinking that we'd get 0.9 in main and then start a review process.

Should we discuss this here, there, in a meeting, in a GitHub "discussion"? This is a fairly big topic.

Probably big enough for a meeting. Maybe you, me, and Sam to start, then we can report back in the SIG?

@ThinkOpenly
Copy link
Copy Markdown
Collaborator Author

Probably big enough for a meeting. Maybe you, me, and Sam to start, then we can report back in the SIG?

I opened discussion #1532, since time zones can make getting everyone to a meeting challenging. I understand the gist of #1435, but I'm going to spend more time with it.

Currently, assembly syntax is represented as a simple string of
comma-separated operands with a heuristic naming convention to
indicate their respective type and purpose.

Make this more rigorous by adding a schema which supports:
- registers
- the registers' register file (GPR, FPR, VR, CSR, etc.)
- dereference syntax "(reg)"
- dereference+offset syntax "offset(reg)"
- immediates
- floating-point rounding mode and possible values
- FENCE scopes
- register lists (for POP/PUSH)

The new "operands" YAML field is currently optional and coexists with
the existing "assembly" field. So, support can be added over time to
both the YAML files and the infrastructure to support generation of
actual assembly syntax where needed (documentation) until the
"assembly" field is no longer needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants