feat: add support for fine-grained assembly representation by ThinkOpenly · Pull Request #1527 · riscv/riscv-unified-db

ThinkOpenly · 2026-02-02T21:47:02Z

Currently, assembly syntax is represented as a simple string of comma-separated operands with a heuristic naming convention to indicate their respective type and purpose.

Make this more rigorous by adding a schema which supports:

registers
the registers' register file (GPR, FPR, VR, CSR, etc.)
dereference syntax "(reg)"
dereference+offset syntax "offset(reg)"
immediates
floating-point rounding mode and possible values
FENCE scopes
register lists (for POP/PUSH)

The new "operands" YAML field is currently optional and coexists with the existing "assembly" field. So, support can be added over time to both the YAML files and the infrastructure to support generation of actual assembly syntax where needed (documentation) until the "assembly" field is no longer needed.

ThinkOpenly · 2026-02-02T21:48:45Z

Not terribly worried about this, but DO NOT MERGE.

I had to specify all required fields and could not depend on the "default" values being used when trying to get past pre-commit run check-jsonschema.

lenary · 2026-02-03T04:52:52Z

Thank you for getting this ball rolling, Paul.

I would like to talk about testing unified-db against existing tools, and how we want to go about it. Currently, we test LLVM by inspecting the internal data descriptions (TableGen), but I'm not sure this is sustainable long-term. I would prefer to do the testing end-to-end - i.e. given a set of extensions(+parameters?), generate a random valid assembly string and check whether the assembler turns it into the expected instruction (+relocations). Similarly, generate a random invalid assembly string and check it is rejected. We should also be able to do the same for generating random encoding bits for disassembling (this is easier today, i think), for both valid and invalid cases. While these all seem like they should mirror each other, they don't quite, because the output of a disassembler is never meant to be assembled again, so the reality is you need an oracle to work out what the correct assembly should have been - in our case, that likely means writing an assembler/disassembler that can do so entirely based on UDB information alone. These are not small undertakings, but I think would stand us in a better situation long-term.

I think you're right that we need to see how to join this up with the field information at some point. I'm keen for us to have a side table of "these are operands others have used, please re-use them", which hopefully would cover some combo of fields and operands, but it's hard to know how this would look today, and will need iteration.

One kind of operand you've missed out is Symbol Expressions (foo, foo+1, %pcrel_hi(foo)) - exactly which specifiers (the <name> in %<name>(expr)) are allowed depends on the instruction/field, and corresponds to available relocations on the field.

I started writing a much longer comment, which I'm going to hide below, because I think it's useful to write down these cases, but they start to look like the longer tail of things. I do think it's important to think about the more complex cases earlier, though, rather than designing something that works for simpler cases like add rd, rs1, rs2 and needs to be entirely changed later for complex cases.

More Hard Cases

Sorry for the stream of conciousness thoughts, I want to provide some degree of "here's complex cases we need to get right", rather than just thinking about simple cases like `add rd, rs1, rs2`:

One thing to be really careful about right now is PC-relative immediate operands, such as the offset in beq and jal. These are treated differently by GCC and by Clang -- beq a0, a1, 28 in GCC means branch to address 28, and in clang means branch to address pc+28. Fixing this incompatibility is not something we should seek to do with the specification at this time, as fixing one of the assemblers is not a very easy thing to do. Note that beq a0, a1, symbol is treated identically by both.

Another "fun" snare is the xlen-dependent operands, which come up in shifts, where the immediate range accepted depends on xlen.

We probably also want to be careful with "which registers are valid to write", but that's quite difficult to do right now, and is closely connected to "what is encodable". This is especially the case when we want to say "this operand actually represents a GPR Pair, not a GPR" such in zilsd, but similar also occurs in C/Zca instructions.

Zfinx/Zdinx are going to be a nightmare on the "which registers are valid to write", especially rv32 zdinx. I haven't looked at how these are represented, but they're a clear case of "one mnemonic can mean a bunch of different instructions depending on the operands", which is a joy to deal with.

We eventually want to cover pseudos (which expand to sequences of instructions). call and tail are probably good places to start, lw <reg>, <sym> and sw <reg>, <sym>, <reg> are harder instances of similar things, as are the la.tls.ie and la.tls.gd pseudos.

We probably also need to cover optional operands, somewhere along the way. the 0 in 0(reg) is optional, and in the atomic instructions does not correspond to any encoding bits (the offset in these can only be 0). There are similar complexities in vsetvli.

We also made the decision in the toolchains recently that MOP/HINT-compatible instructions can always be written as their non-hint variant - i.e. c.sspush ra (from zicfiss) can always be written if you have c.mop.1 available (from zcmop) - see discussion here: riscv-non-isa/riscv-elf-psabi-doc#474 - this ends up adding some complexity to udb, but we don't believe that MOPs/HINTs can be re-allocated anyway.

codecov · 2026-02-03T13:28:13Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.94%. Comparing base (e6fdeeb) to head (3cb2886).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1527   +/-   ##
=======================================
  Coverage   71.94%   71.94%           
=======================================
  Files          54       54           
  Lines       27976    27976           
  Branches     6183     6183           
=======================================
  Hits        20128    20128           
  Misses       7848     7848

Flag	Coverage Δ
idlc	`75.90% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

dhower-qc · 2026-02-03T14:48:40Z

This PR is timely -- see #1435, which is just about ready. It introduces the new, new instruction schema that has dedicated operand objects that you could attach this information to.

ThinkOpenly · 2026-02-03T17:49:42Z

This PR is timely -- see #1435

Indeed. Looks like we're headed in the same rough direction.

which is just about ready.

Is it? ;-) Is the versioning in some of the file names an indicator of "works in progress"?

It introduces the new, new instruction schema that has dedicated operand objects that you could attach this information to.

OK. Looking...

Should we discuss this here, there, in a meeting, in a GitHub "discussion"? This is a fairly big topic.

dhower-qc · 2026-02-03T17:53:25Z

which is just about ready.

Is it? ;-) Is the versioning in some of the file names an indicator of "works in progress"?

The versioning is supposed to indicate that we are nearing 1.0 of a schema. I was thinking that we'd get 0.9 in main and then start a review process.

Should we discuss this here, there, in a meeting, in a GitHub "discussion"? This is a fairly big topic.

Probably big enough for a meeting. Maybe you, me, and Sam to start, then we can report back in the SIG?

ThinkOpenly · 2026-02-03T20:42:45Z

Probably big enough for a meeting. Maybe you, me, and Sam to start, then we can report back in the SIG?

I opened discussion #1532, since time zones can make getting everyone to a meeting challenging. I understand the gist of #1435, but I'm going to spend more time with it.

Currently, assembly syntax is represented as a simple string of comma-separated operands with a heuristic naming convention to indicate their respective type and purpose. Make this more rigorous by adding a schema which supports: - registers - the registers' register file (GPR, FPR, VR, CSR, etc.) - dereference syntax "(reg)" - dereference+offset syntax "offset(reg)" - immediates - floating-point rounding mode and possible values - FENCE scopes - register lists (for POP/PUSH) The new "operands" YAML field is currently optional and coexists with the existing "assembly" field. So, support can be added over time to both the YAML files and the infrastructure to support generation of actual assembly syntax where needed (documentation) until the "assembly" field is no longer needed.

ThinkOpenly requested a review from dhower-qc as a code owner February 2, 2026 21:47

ThinkOpenly mentioned this pull request Feb 10, 2026

fix: update c.nop encoding and assembly to support HINTs (#1177) #1289

Draft

ThinkOpenly marked this pull request as draft March 19, 2026 22:11

ThinkOpenly added 3 commits March 23, 2026 14:18

v2

1b041bf

support fence_scope operands

907ea07

ThinkOpenly force-pushed the operands branch from 27aef65 to 907ea07 Compare March 24, 2026 03:37

[autofix.ci] apply automated fixes

3cb2886

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for fine-grained assembly representation#1527

feat: add support for fine-grained assembly representation#1527
ThinkOpenly wants to merge 4 commits intoriscv:mainfrom
ThinkOpenly:operands

ThinkOpenly commented Feb 2, 2026

Uh oh!

ThinkOpenly commented Feb 2, 2026

Uh oh!

lenary commented Feb 3, 2026

Uh oh!

codecov bot commented Feb 3, 2026 •

edited

Loading

Uh oh!

dhower-qc commented Feb 3, 2026

Uh oh!

ThinkOpenly commented Feb 3, 2026

Uh oh!

dhower-qc commented Feb 3, 2026

Uh oh!

ThinkOpenly commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ThinkOpenly commented Feb 2, 2026

Uh oh!

ThinkOpenly commented Feb 2, 2026

Uh oh!

lenary commented Feb 3, 2026

Uh oh!

codecov bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dhower-qc commented Feb 3, 2026

Uh oh!

ThinkOpenly commented Feb 3, 2026

Uh oh!

dhower-qc commented Feb 3, 2026

Uh oh!

ThinkOpenly commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Feb 3, 2026 •

edited

Loading