Skip to content

Don't canonicalise file keys#1250

Open
lionel- wants to merge 1 commit into
oak/salsa-13-scan-asyncfrom
oak/salsa-14-canonicalisation
Open

Don't canonicalise file keys#1250
lionel- wants to merge 1 commit into
oak/salsa-13-scan-asyncfrom
oak/salsa-14-canonicalisation

Conversation

@lionel-

@lionel- lionel- commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Branched from #1245
Progress towards #1212

While I was working on breakpoints I ran into difficulties matching files from different locations: frontend URIs via DAP requests and source references from R whose paths went through normalizePath(). The latter does file canonicalisation, which means that in addition to resolving . and .., it also follows symlinks. The frontend URIs are not canonicalised, which caused CI failures when running tests in projects storedc in temporary folders, with mismatches between /tmp and /private/tmp. The fix I implemented was to canonicalise everywhere and encode that canonicalisation happened with the UrlId file, which we stored as file keys in our maps tracking state (e.g. breakpoints) by file.

The UrlId normalisation also dealt with other issues that surfaced over time, such as windows letter drive casing because to fix issues with lowercase on one end and uppercase on the other.

The same sort of issue came up with the LSP: we get URIs from the frontend, but we also encounter source() calls that may be relative and we need to resolve them. So I reached for the same UrlId approach to make the different file paths agree with each other.

It turns out that doing any sort of normalisation that requires I/O is a red flag. If a file gets deleted on disk, and we get an event about that file, there is literally no way to match it back to what we know. For the LSP that's a real issue because we get file deletion events after the file is deleted.

The other thing to watch out for: Always report back the unormalised URI/path to the frontend, because that's what they use to map files on their end. Any changes might cause the frontend to treat the file as different.

With this PR:

  • We no longer canonicalise files that we store as keys. We only do lexical normalisation: resolve . and ... That matches r-a and ty.
  • We still do canonicalisation at comparison time, in a secondary lookup. This way we can still match file paths coming from different sources.
  • We keep track of original file URI/paths for communication with the frontend. For now only in the DAP, this will be done for the LSP in another PR.

@DavisVaughan

Copy link
Copy Markdown
Contributor

It turns out that doing any sort of normalisation that requires I/O is a red flag. If a file gets deleted on disk, and we get an event about that file, there is literally no way to match it back to what we know.

LOL OUCH

@DavisVaughan DavisVaughan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simultaneously:

  • am sad we dont have a canonical uri everywhere, but get it, especially with the deleted file example
  • am happy we dont have to do IO for uri normalization

Comment thread crates/ark_test/src/dummy_frontend.rs
Comment thread crates/ark_test/src/dummy_frontend.rs
Comment on lines +20 to +22
/// What we normalise: drive-letter casing on Windows; percent-encoding
/// of `:` (decoded via `Url -> PathBuf -> Url` round-trip). No I/O:
/// `std::fs::canonicalize()`, no symlink resolution. The same input URI

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont we also normalize ..?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mention it in the pr, but i dont see it anywhere here, so thats why im asking

We only do lexical normalisation: resolve . and ...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to see a test in this file either way!

@lionel- lionel- Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find it either! Must have made a rebase mistake along the way or other... I distinctly remember component parsing code.

Claude says "Url::parse already collapses dot segments for file: URLs, but Url::from_file_path doesn't". Maybe I missed that the agent decided to remove that code at some point because of the parse behaviour 😬

Comment thread crates/aether_url/src/lib.rs
Comment thread crates/ark/src/dap/dap_state.rs
/// only. Lookups try the primary first (cheap, no fs touch), then
/// canonicalise the incoming URI and consult the secondary. DAP wire
/// output (`debugInfo` source paths, breakpoint events) round-trips the
/// primary key unchanged, so the frontend always sees the URI it sent.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea IIUC this isn't actually what is happening

In handle_set_breakpoints() we do this

let uri = match UrlId::from_file_path(path)

and we store that uri in by_url here.

So I'm not entirely sure "the frontend sees the uri it sent". It might be lexically normalized?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is either fine, and needs a doc comment update, or isn't fine, and we need a third map that gets us back to the original path: String the DAP provided us 😬

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And either way, a test could be good

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be consistent and send the verbatim path back here, thanks for catching this

@lionel- lionel- Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a third map, I've just included the verbatim path in the first map. I wonder if I should store the verbatim in the path variant of FilePath, in the next PR.

@lionel- lionel- force-pushed the oak/salsa-13-scan-async branch 2 times, most recently from 8f13f9c to 5f3ecf2 Compare June 10, 2026 14:23
@lionel- lionel- force-pushed the oak/salsa-14-canonicalisation branch from eb2c844 to 85cdf4f Compare June 10, 2026 14:31
@lionel- lionel- force-pushed the oak/salsa-13-scan-async branch from ec1435d to a2787c2 Compare June 10, 2026 20:10
Only canonicalise when comparing against paths from external sources
(e.g. normalised by R). Canonicalisation may corrupt the identity of
paths sent by frontends, and may prevent matching files at all when they
are deleted (at which point a non-canonicalised path sent by the
frontend can't be matched against canonicalised paths we've stored as
keys)
@lionel- lionel- force-pushed the oak/salsa-14-canonicalisation branch from 85cdf4f to 9463329 Compare June 11, 2026 06:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants