Skip to content

Add oak_scan crate with initial library scanning#1243

Merged
lionel- merged 11 commits into
mainfrom
oak/salsa-11-scan-library
Jun 10, 2026
Merged

Add oak_scan crate with initial library scanning#1243
lionel- merged 11 commits into
mainfrom
oak/salsa-11-scan-library

Conversation

@lionel-

@lionel- lionel- commented May 29, 2026

Copy link
Copy Markdown
Contributor

Branched from #1226
Progress towards #1212

Introduces oak_scan with a library scanner. The LSP layer calls (in the next PR) the scanner on startup with .libPaths(). Each libpath is set as a library root in oak_db.

oak_scan is the only writer of Salsa inputs in oak_db, and is in charge of maintaining invariants, such as making sure backpointers are up-to-date.

Library roots are static for now. Watching them for changes (package removed or installed) would require our own filesystem watcher. Since we don't need one for the workspace (we'll use LSP file events), the machinery would be a bit too heavy for what we would gain for it at this stage. For now the LSP can be restarted to register changes to libpaths.

Comment on lines +88 to +100
fn test_scan_multiple_library_paths_preserve_order() {
let tmp1 = tempfile::tempdir().unwrap();
let tmp2 = tempfile::tempdir().unwrap();
write_package(&tmp1.path().join("pkg1"), "pkg1", &[]);
write_package(&tmp2.path().join("pkg2"), "pkg2", &[]);
let mut db = OakDatabase::new();

let paths: Vec<PathBuf> = vec![tmp1.path().to_path_buf(), tmp2.path().to_path_buf()];
db.set_library_paths(&paths);

let roots = db.library_roots().roots(&db).clone();
assert_eq!(roots.len(), 2);
assert_eq!(roots[0].packages(&db)[0].name(&db), "pkg1");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random thought

We have quite a few things in oak that are ordering dependent, but I always forget which is and is not order dependent

It feels like it would be cool to have SortedVec be returned here, or something kind of like it, where its clear that the result is sorted (then, of course, the question becomes "sorted by what"?).

But maybe LibraryRoots suffices, since its docs state that the order invariant is upheld there

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The roots are not sorted lexicographically though, or against a well defined sort key that can be easily used from elsewhere, so SortedVec would be a bit weird. It's more like they are ordered than sorted. I agree LibraryRoots suffices as the upholder of the invariant.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets teach claude to always do tests/integration/ for all integration tests!

Comment on lines +32 to +44
/// Description of one R file the scanner wants to register.
///
/// `contents` is the on-disk snapshot at scan time. It's used as the
/// initial content whenever the helper mints a new `File` entity, i.e.
/// the first time a URL is seen, whether at the initial scan or on a
/// later rescan that discovers a newly-created file.
///
/// If a `File` already exists at this URL (scanner-created from an
/// earlier scan, or VFS-created via `didOpen`), the helpers reuse that
/// entity and leave its content alone. `set_contents` (driven by the
/// VFS) is the authoritative way to update content.
#[derive(Clone, Debug)]
pub struct FileEntry {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the thought "is it always an R file?"

It seems to be, so it might be nice if the name reflected this, maybe just RFile? Or RFileEntry.

But the generic FileEntry left me wondering if it was also used for other things besides just R files

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand there is nothing constraining the type to an R file, it's just that we only use it for R ones.

Comment thread crates/oak_scan/src/inputs.rs Outdated

/// Extension methods on the database for scanner orchestration and
/// placement-aware updates that don't have a natural `Root` receiver.
pub trait DbExt: Db + DbInputs {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to advocate for a more informative trait name than DbExt

Perhaps DbLibraryPaths or DbScan

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I like DbScan after the crate name.

Comment thread crates/oak_scan/src/library.rs Outdated
Comment on lines +109 to +110
for &pkg in &packages {
pkg.set_files(db).to(Vec::new());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for &pkg in &packages {
pkg.set_files(db).to(Vec::new());
for &package in &packages {
package.set_files(db).to(Vec::new());

The naming inconsistency hurts me 😆

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep I know, sorry. Using short names in small scopes and longer names in larger ones is my writing style (https://www.lysator.liu.se/c/pikestyle.html). Not sure how to resolve our difference on this :-)

Comment thread crates/oak_db/src/db.rs Outdated
/// dropped on a previous `set_*_paths` call, so the entity gets
/// reused on re-add. Analysis paths should not call this — they use
/// [`Db::package_by_name`] which is stale-blind.
fn package_by_url(&self, url: &UrlId) -> Option<Package>;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea if I should use this vs package_by_name based on the names alone 😢

I wonder if it can be package_by_url_including_stale() or something to really discourage usage of it

Or separate into package_by_url() and stale_package_by_url() for use by the scanner

I wonder if we are going to have to expose some ugly internals like this by having the scanner and db in separate crates :/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm good point.

I moved the by-url lookup to oak_scan, which is the only one that should be concerned with stale roots.

If an analysis crate has a URL for a package, they can look up the name on disk and use package_by_name(), which resolves the effective package for that name.

Comment thread crates/oak_db/src/db.rs
Comment on lines +90 to +92
/// Stale roots are not included. Salsa-cached and invalidates when one of
/// `workspace_roots` / `library_roots` / `orphan_root` changes.
fn live_roots(&self) -> &[LiveRoot];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also accept Dead root instead of Stale root since we have terminology like:

  • live roots
  • "resurrect"ing of dead/stale roots

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd be fine, just a bit grim to my taste 😄

Comment thread crates/oak_db/src/db.rs Outdated
Comment on lines +152 to +160
pub fn package_by_url_query(db: &dyn Db, url: &UrlId) -> Option<Package> {
for &root in db.live_roots() {
if let LiveRoot::Workspace(r) | LiveRoot::Library(r) = root {
if let Some(&pkg) = root_package_url_index(db, r).get(url) {
return Some(pkg);
}
}
}
for root in db.library_roots().roots(db) {
if let Some(&pkg) = root_package_index(db, *root).get(name) {
return Some(pkg);
stale_package_url_index(db).get(url).copied()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea i would be very happy to see a pair of functions like

  • package_by_url_query() / stale_package_by_url_query()
  • live_package_by_url_query() / dead_package_by_url_query()

that feels less confusing to me in the long run, i feel like its a bit hard to tell exactly which set of roots are checked just from the function name. like, is the stale package set checked or not?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think moving the lookup to oak_scan mostly resolves the confusion.

Comment thread crates/oak_db/src/db.rs Outdated
@lionel- lionel- force-pushed the oak/salsa-11-scan-library branch from d50f230 to 76a82b0 Compare June 10, 2026 10:07
@lionel- lionel- merged commit bb1e1a8 into main Jun 10, 2026
17 checks passed
@lionel- lionel- deleted the oak/salsa-11-scan-library branch June 10, 2026 10:11
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 10, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants