feat: add support for fine-grained assembly representation#1527
feat: add support for fine-grained assembly representation#1527ThinkOpenly wants to merge 4 commits intoriscv:mainfrom
Conversation
|
Not terribly worried about this, but DO NOT MERGE. I had to specify all required fields and could not depend on the "default" values being used when trying to get past |
|
Thank you for getting this ball rolling, Paul. I would like to talk about testing unified-db against existing tools, and how we want to go about it. Currently, we test LLVM by inspecting the internal data descriptions (TableGen), but I'm not sure this is sustainable long-term. I would prefer to do the testing end-to-end - i.e. given a set of extensions(+parameters?), generate a random valid assembly string and check whether the assembler turns it into the expected instruction (+relocations). Similarly, generate a random invalid assembly string and check it is rejected. We should also be able to do the same for generating random encoding bits for disassembling (this is easier today, i think), for both valid and invalid cases. While these all seem like they should mirror each other, they don't quite, because the output of a disassembler is never meant to be assembled again, so the reality is you need an oracle to work out what the correct assembly should have been - in our case, that likely means writing an assembler/disassembler that can do so entirely based on UDB information alone. These are not small undertakings, but I think would stand us in a better situation long-term. I think you're right that we need to see how to join this up with the field information at some point. I'm keen for us to have a side table of "these are operands others have used, please re-use them", which hopefully would cover some combo of fields and operands, but it's hard to know how this would look today, and will need iteration. One kind of operand you've missed out is Symbol Expressions ( I started writing a much longer comment, which I'm going to hide below, because I think it's useful to write down these cases, but they start to look like the longer tail of things. I do think it's important to think about the more complex cases earlier, though, rather than designing something that works for simpler cases like More Hard CasesSorry for the stream of conciousness thoughts, I want to provide some degree of "here's complex cases we need to get right", rather than just thinking about simple cases like `add rd, rs1, rs2`:One thing to be really careful about right now is PC-relative immediate operands, such as the offset in Another "fun" snare is the xlen-dependent operands, which come up in shifts, where the immediate range accepted depends on xlen. We probably also want to be careful with "which registers are valid to write", but that's quite difficult to do right now, and is closely connected to "what is encodable". This is especially the case when we want to say "this operand actually represents a GPR Pair, not a GPR" such in zilsd, but similar also occurs in C/Zca instructions. Zfinx/Zdinx are going to be a nightmare on the "which registers are valid to write", especially rv32 zdinx. I haven't looked at how these are represented, but they're a clear case of "one mnemonic can mean a bunch of different instructions depending on the operands", which is a joy to deal with. We eventually want to cover pseudos (which expand to sequences of instructions). We probably also need to cover optional operands, somewhere along the way. the We also made the decision in the toolchains recently that MOP/HINT-compatible instructions can always be written as their non-hint variant - i.e. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1527 +/- ##
=======================================
Coverage 71.94% 71.94%
=======================================
Files 54 54
Lines 27976 27976
Branches 6183 6183
=======================================
Hits 20128 20128
Misses 7848 7848
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This PR is timely -- see #1435, which is just about ready. It introduces the new, new instruction schema that has dedicated operand objects that you could attach this information to. |
Indeed. Looks like we're headed in the same rough direction.
Is it? ;-) Is the versioning in some of the file names an indicator of "works in progress"?
OK. Looking... Should we discuss this here, there, in a meeting, in a GitHub "discussion"? This is a fairly big topic. |
The versioning is supposed to indicate that we are nearing 1.0 of a schema. I was thinking that we'd get 0.9 in main and then start a review process.
Probably big enough for a meeting. Maybe you, me, and Sam to start, then we can report back in the SIG? |
Currently, assembly syntax is represented as a simple string of comma-separated operands with a heuristic naming convention to indicate their respective type and purpose. Make this more rigorous by adding a schema which supports: - registers - the registers' register file (GPR, FPR, VR, CSR, etc.) - dereference syntax "(reg)" - dereference+offset syntax "offset(reg)" - immediates - floating-point rounding mode and possible values - FENCE scopes - register lists (for POP/PUSH) The new "operands" YAML field is currently optional and coexists with the existing "assembly" field. So, support can be added over time to both the YAML files and the infrastructure to support generation of actual assembly syntax where needed (documentation) until the "assembly" field is no longer needed.
Currently, assembly syntax is represented as a simple string of comma-separated operands with a heuristic naming convention to indicate their respective type and purpose.
Make this more rigorous by adding a schema which supports:
The new "operands" YAML field is currently optional and coexists with the existing "assembly" field. So, support can be added over time to both the YAML files and the infrastructure to support generation of actual assembly syntax where needed (documentation) until the "assembly" field is no longer needed.