Skip to content

R path#234

Open
weswc wants to merge 2 commits intomainfrom
r_path
Open

R path#234
weswc wants to merge 2 commits intomainfrom
r_path

Conversation

@weswc
Copy link
Copy Markdown
Member

@weswc weswc commented May 15, 2025

For build processes, it is often helpful to not have to specify an R version, instead just use the version found on the path. Additionally, R installed in non-standard locations has previously not been accessible to rv (except if it is added to the PATH). To address both of these issues is the addition of the r_path field, which if specified rv will look for R at that location (and that location only). This allows the r_version to be optional, while still requiring specificity around what R is to be used.

The rules for the two fields are as followed:

  1. If r_path and r_version is specified, the version found at r_path MUST hazy match the one specified in r_version
  2. If r_path is specified, but r_version is not, the version found at r_path will be the version of r used
  3. If r_version is specified, but r_path is not, we will attempt to find that R version on the system
  4. If neither are specified, fail

@weswc
Copy link
Copy Markdown
Member Author

weswc commented May 15, 2025

1. Both r_path and r_version specified

a. Compatible:

r_path = "/opt/R/4.4.1/bin/R"
[project]
name = "test"
r_version = "4.4"
repositories = [ { alias = "CRAN", url = "https://cran.rstudio.com" } ]
$ rv info --r-version
r-version: 4.4

b. Bad hazy match:

r_path = "/opt/R/4.4.1/bin/R"
[project]
name = "test"
r_version = "4.4.2"
repositories = [ { alias = "CRAN", url = "https://cran.rstudio.com" } ]
$ rv info --r-version
Failed to load config at `.`

Caused by:
    r_version (4.4.2) does not match version found at r_path (4.4.1)

c. Incompatible version:

r_path = "/opt/R/4.4.1/bin/R"
[project]
name = "test"
r_version = "4.3"
repositories = [ { alias = "CRAN", url = "https://cran.rstudio.com" } ]
$ rv info --r-version
Failed to load config at `.`

Caused by:
    r_version (4.3) does not match version found at r_path (4.4.1)

2. Only r_path specified

a. Valid path:

r_path = "/opt/R/4.3.2/bin/R"
[project]
name = "test"
repositories = [ { alias = "CRAN", url = "https://cran.rstudio.com" } ]
$ rv info --r-version
r-version: 4.3.2

b. Use path R version:

r_path = "R"
[project]
name = "test"
repositories = [ { alias = "CRAN", url = "https://cran.rstudio.com" } ]
$ rv info --r-version
r-version: 4.4.1

3. Only r_version specified

[project]
name = "test"
r_version = "4.3"
repositories = [ { alias = "CRAN", url = "https://cran.rstudio.com" } ]
$ rv info --r-version
r-version: 4.3

4. Nothing specified

[project]
name = "test"
repositories = [ { alias = "CRAN", url = "https://cran.rstudio.com" } ]
Failed to load config at `.`

Caused by:
    r_version or r_path must be set

Summary with --r-version

r_path = "R"
[project]
name = "test"
repositories = [ { alias = "CRAN", url = "https://cran.rstudio.com" } ]
dependencies = ["R6"]
$ rv summary --r-version=4.5
== System Information == 
OS: linux (x86_64)
R Version: 4.5

Num Workers for Sync: 6 (4 cpus available)
Cache Location: /cluster-data/user-homes/wes/.cache/rv

== Dependencies == 
Library: rv/library/4.5/x86_64/jammy
Installed: 0/1

Package Sources: 
  CRAN: 0/1 source packages

Installation Summary: 
  CRAN: 1/1 to download (1 to compile)

== Remote == 
CRAN (https://cran.rstudio.com): 0 binary packages, 22404 source packages

@weswc weswc requested review from Keats and dpastoor May 15, 2025 19:42
@Keats
Copy link
Copy Markdown
Collaborator

Keats commented May 15, 2025

I haven't reviewed the code but I'm not too keen on the idea. This makes projects not portable, eg if you have r_path set I would need to modify the rproject.toml to make it work on my machine and remember to not commit it. It will also not work for CI most likely

@weswc
Copy link
Copy Markdown
Member Author

weswc commented May 15, 2025

I will let @dpastoor also weigh in, but this is the solution we landed on together. The problem space is wanting to run CI's with a matrix of R versions (i.e. 4.4 and 4.5) without needing to change the config between sync's, but not just make r_version completely optional.

We considered using something like use_r_path: bool in the config that would allow r_version to be None, but ended up at this solution where to opt into using whatever is on the path, specify "R" in r_path, and then add the additional flexibility to specify non-standard R version locations, but acknowledging you're losing portability.

@Keats
Copy link
Copy Markdown
Collaborator

Keats commented May 15, 2025

That sounds like the flag --r-version could be added to sync then?

@dpastoor
Copy link
Copy Markdown
Member

there is absolutely a chance of not being portable - but also we need some ways to actually loosen the portability/precision. I think flags make a lot of sense, but also i don't want it to be only flags - its too easy to forget to use a flag.

If it becomes a situation where you'd need to use that flag every time you do an activity, it should be a config var, and this is a case of it can be used for one-off's but also could be used.

We also have the issue at the moment of we're only auto-discovering the 'well-known' areas that rstudio installs r versions, if a group installs it anywhere else on the system we have no way of telling rv where to find it without doing path fiddling in the specific terminal session. So we should absolutely have some path disclaimers, but i think its worth giving the option.

The most common path usage I'd see is literally r_path: "R" which would be portable, just loose to say "use whatever default version is exposed".

This would work for CI very well since CI installs the specific version of R requested in the matrix and puts it on the default path

@Keats
Copy link
Copy Markdown
Collaborator

Keats commented May 16, 2025

That does throw out the reproducible part of rv through the window though.
Every build with this option is basically a brand new build, you can't use the lockfile since the R path could be pointing to anything (essentially like use_lockfile = false) and it also won't work across environments.

If it becomes a situation where you'd need to use that flag every time you do an activity, it should be a config var, and this is a case of it can be used for one-off's but also could be used.

I would expect to not have a config option that breaks reproducibility/portability tbh. IMO if we really want something like that, the R version should be required in the rproject.toml still just so we keep the reproducibility aspect. Eg

[project]
r_version = "4.4"
# Where my R is installed locally
r_path = "/usr/bin/R"

And the validation stays the same as in this PR. This way at least it's kinda reproducible since the R version is explicit, otherwise rv will just give me random stuff depending on when I'm running the sync and what's in my /usr/bin

@dpastoor
Copy link
Copy Markdown
Member

dpastoor commented May 17, 2025

I don't think it throws reproducibility out the window. It does open the door to define a level of precision needed, of which now we've said "reproducible" = matching R plus package versions.

here is what we can say about reproducibility:

  1. compiling anything from src instead of binary with different env set can change reproducibility
  2. changing the underlying blas library R links to will impact reproducibility (we're hitting that now in prod in multiple cQT projects)
  3. Changing other components of how R is compiled can absolutely impact reproducibility within the same R version. Its less used these days but in HPC clusters will have multiple of the same version of R linking against different fortran/blas libraries. MKL+intell fortran vs openblas/gfortran type deal.

We can also say that for many projects getting the same package versions with a different version of R can often get consistent results.

Finally, I think we're opening the developer door a bit with rv around making it something where stricter reproducibility should be used for projects, and should generally be the "default state" of rv, but for the developer persona giving more fluidity. For example, setting use_lockfile = false and pointing to a moving target of latest is a very convenient way to consistently and purposefully keep pulling in the latest of everything.

I think we can and should discuss the specific implementation and nomenclature to make sure these options and the associated risks are well articulated.

To use an explicit example here for us - I would like us to get to the point that we can include rv setup for all our R packages - use https://github.com/A2-ai/dvs as an example. This should NOT need an explicit R version, as a developer should be able to come and develop dvs against different versions of R on purpose and be able to switch between them locally, or otherwise not have to find a specific version. To tie this to a cargo example, the cargo.toml/cargo.lock do not force a specific version of rustc, though it can be used if needed.

To the use of the r_path variable itself as the way to get there - jury is still out - I honestly don't love it, but i'm not sure what we should do.

r_path gives us a way to do it via shorthand of "R" to say "whatever is on that path" in a pseudo-portable way. It also at the same time gives a way for people to be super explicit if they want eg if they have a custom version of R at /opt/custom-r/path/to/bin/R it gives a way to make sure rv uses that too, so a "two-birds-one-stone" situation.

I would also be open-to, and wonder if we could have something like no_version: true or r_version = "path" or some way to convey that concept by overloading r_version (i don't like this personally), or a new flag to express you don't care.

@Keats
Copy link
Copy Markdown
Collaborator

Keats commented May 19, 2025

Agreed on the system changes impacting reproducibility, here I meant reproducibility as in I will pull the same set of deps on my machine everytime for a project.

For example, setting use_lockfile = false and pointing to a moving target of latest is a very convenient way to consistently and purposefully keep pulling in the latest of everything.

This is like doing "*" for versions in Cargo.toml which works, but no one uses it in practice because it breaks all the time. Maybe it's better in R.

To use an explicit example here for us - I would like us to get to the point that we can include rv setup for all our R packages - use https://github.com/A2-ai/dvs as an example. This should NOT need an explicit R version, as a developer should be able to come and develop dvs against different versions of R on purpose and be able to switch between them locally, or otherwise not have to find a specific version. To tie this to a cargo example, the cargo.toml/cargo.lock do not force a specific version of rustc, though it can be used if needed.

I agree, it's just that it probably shouldn't be in the config file. If it's in the file, a contributor will still need to remove it or change to their own hardcoded paths and then remember to not commit that part to the PR.

We could allow r_version = "*" which allows anything, sets use_lockfile = false and have a config flag --r_path=R

@dpastoor
Copy link
Copy Markdown
Member

I would be fine with

We could allow r_version = "*" which allows anything

though i wouldn't have that definitely force lockfile to be false. plenty of package combos should run fine across r versions, so that shouldn't mean throwing that out for sure. If the lockfile can't be resolved due to dep constraints for a different R version that is their issue to then solve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants