Skip to content

Evaluate token support based on the list of CMSSW releases requested during job submission#12463

Open
amaltaro wants to merge 8 commits intodmwm:masterfrom
amaltaro:fix-12228
Open

Evaluate token support based on the list of CMSSW releases requested during job submission#12463
amaltaro wants to merge 8 commits intodmwm:masterfrom
amaltaro:fix-12228

Conversation

@amaltaro
Copy link
Copy Markdown
Contributor

@amaltaro amaltaro commented Nov 11, 2025

Fixes #12228 (partially)

Status

In development

Description

Evaluate whether a job needs to explicitly disable token credentials or not, based on the CMSSW version and whether its XRootD version has full support to tokens or not. It relies on two mechanisms for this evaluation:

  1. CMSSW version comparison: any CMSSW version smaller than CMSSW_10_6_47 is considered not token ready;
  2. List of known broken token support: releases greater than the minimum version above, but still with an XRootD version (< 5.7.2) that does not have full support to tokens.

This pull request provides 3 artifacts:

  1. WMCore relevant code changes for job-basis evaluation of token;
  2. JSON data with a list of CMSSW versions greater than 10_6_47 which still have a broken XRootD version, see cmssw_no_token_support.json
  3. utilitarian script build_cmssw_bad_token.py to search CMSSW releases and build a new JSON file with the not-token-ready releases

Assumption-1: CMSSW releases with XrootD version greater or equal than 5.7.2 have full support to tokens.
Assumption-2: any CMSSW releases smaller than CMSSW_10_6_47 are considered NOT token-ready (as CMSW_4_x series reports xrootd version 5.27.06-cms3 ...)
Assumption-3: any CMSSW greater or equal than CMSSW_10_6_47 and not present in cmssw_no_token_support.json are considered as full token support.

Is it backward compatible (if not, which system it affects?)

Partially
With this patch, assuming that token-based submission is enabled for JobSubmitter (useOauthToken), the agent will start deciding whether it sends an x509 or a token to the job runtime.

Related PRs

None

External dependencies / deployment changes

None

@dmwm-bot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Unit tests: failed
    • 20 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 4 warnings
    • 110 comments to review
  • Pycodestyle check: succeeded
    • 44 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/1082/artifact/artifacts/PullRequestReport.html

@amaltaro amaltaro changed the title Fix 12228 Evaluate token support based on the list of CMSSW releases requested during job submission Nov 11, 2025
@dmwm-bot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Unit tests: failed
    • 20 new failures
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 4 warnings
    • 110 comments to review
  • Pycodestyle check: succeeded
    • 44 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/1085/artifact/artifacts/PullRequestReport.html

@khurtado
Copy link
Copy Markdown
Contributor

@amaltaro
Don't forget about the XRootD version inconsistencies before CMSSW v5. There are a bunch of v4 flagged as token ready because of it.

@amaltaro
Copy link
Copy Markdown
Contributor Author

Thank you for spotting this problem, @khurtado. Indeed the xrootd.txt file says:

CMSSW_4_1_8_patch14,slc5_amd64_gcc434,5.27.06-cms3,Group 1 (>= 5.7.2)

which is greater than the cut-off XrootD version, hence properly tagged as token-ready - despite not.

To be on the safe side, I will add one extra constraint to the script that builds the map, where any release below CMSSW_9_x will be tagged as NOT token-ready. I think you suggested this in one of our chats and I believe it is a good commitment.

@dmwm-bot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Unit tests: failed
    • 20 new failures
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 4 warnings
    • 91 comments to review
  • Pycodestyle check: succeeded
    • 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/1086/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link
Copy Markdown

dmwm-bot commented Dec 9, 2025

Jenkins results:

  • Python3 Unit tests: failed
    • 20 new failures
    • 4 changes in unstable tests
  • Python3 Pylint check: failed
    • 15 warnings and errors that must be fixed
    • 4 warnings
    • 181 comments to review
  • Pycodestyle check: succeeded
    • 77 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/1124/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Unit tests: failed
    • 20 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 16 warnings and errors that must be fixed
    • 4 warnings
    • 182 comments to review
  • Pycodestyle check: succeeded
    • 76 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/1137/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Copy Markdown
Contributor Author

test this please

2 similar comments
@amaltaro
Copy link
Copy Markdown
Contributor Author

test this please

@amaltaro
Copy link
Copy Markdown
Contributor Author

test this please

rename json file

Update mapfile by properly marking CMSSW_4_x as not token-ready

REMOVE-ME: text file based on CVMFS releases/scram_archs and script to generate json file

rename script to build map file

Consider releases before CMSSW_9_x as not token ready; fix output file name

Repurpose script to construct list of releases that do not support tokens

code refactoring to use CMSSW version comparison and simple list of not-token-ready
@dmwm-bot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 14 warnings and errors that must be fixed
    • 4 warnings
    • 182 comments to review
  • Pycodestyle check: succeeded
    • 76 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/1202/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 14 warnings and errors that must be fixed
    • 4 warnings
    • 182 comments to review
  • Pycodestyle check: succeeded
    • 76 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/1203/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Copy Markdown
Contributor Author

I have ran the following workflow in testbed-vocms0265:

amaltaro_SC_ProdPsi_Oct2025_Val_260130_225504_1175

and I can confirm that token-based stage out works fine. I see these lines in condor.out:

CMS token found, setting BEARER_TOKEN_FILE=/srv/.condor_creds/cms.use
  OAuthServicesNeeded = "cms"

And these relevant lines in wmagentJob.log:

Stage out to : T2_CH_CERN using: xrdcp
2026-01-31 01:16:58,055:INFO:StageOutImpl:Running the stage out with the available auth method (attempt 1)... 
env BEARER_TOKEN_FILE=$BEARER_TOKEN_FILE BEARER_TOKEN=$(cat ${BEARER_TOKEN_FILE:-/dev/null}) xrdcp --force --nopbar --cksum adler32:21ea0683  "/srv/job/WMTaskSpace/cmsRun1/FEVTDEBUGoutput.root"  "<PFN_NAMESPACE>/cms/store/unmerged/CMSSW_12_0_0/RelValPsi2SToJPsiPiPi/GEN-SIM/GenSimFull_SC_ProdPsi_Oct2025_Val_Alanv1-v11/00000/1202784f-3561-4133-a2d7-7d174a48654b.root" 
xrdcp exit code: 0

Note that this workflow uses CMSSW_12_0_0, which was supposed to have token disabled!

It turns out the JSON file had not been properly loaded by BasePlugin.py, so I am fixing that now. In any case, it was a good test to ensure the ability to stage data out with tokens (with a release that was not meant to work though).

@amaltaro
Copy link
Copy Markdown
Contributor Author

amaltaro commented Feb 3, 2026

Another update from my tests.

I executed this workflow
amaltaro_SC_ProdPsi_Oct2025_Val_260203_185357_2633

with this patch properly applied - and added an extra log record in SimpleCondorPlugin, to ensure what is actually being set -

2026-01-31 14:08:25,454:140291797415616:INFO:SimpleCondorPlugin:use_oauth_services set to UNDEFINED

and ALL of the jobs failed with

    <a n="HoldReason"><s>Job credentials are not available</s></a>

I don't really understand this error, but I suspect condor is looking for an undefined token credential in condor credd.

I am now disabling that else statement, such that use_oauth_services isn't set at all, when a job is not meant to use tokens.

@khurtado
Copy link
Copy Markdown
Contributor

khurtado commented Feb 3, 2026

I have ran the following workflow in testbed-vocms0265:

amaltaro_SC_ProdPsi_Oct2025_Val_260130_225504_1175

and I can confirm that token-based stage out works fine. I see these lines in condor.out:

CMS token found, setting BEARER_TOKEN_FILE=/srv/.condor_creds/cms.use
  OAuthServicesNeeded = "cms"

And these relevant lines in wmagentJob.log:

Stage out to : T2_CH_CERN using: xrdcp
2026-01-31 01:16:58,055:INFO:StageOutImpl:Running the stage out with the available auth method (attempt 1)... 
env BEARER_TOKEN_FILE=$BEARER_TOKEN_FILE BEARER_TOKEN=$(cat ${BEARER_TOKEN_FILE:-/dev/null}) xrdcp --force --nopbar --cksum adler32:21ea0683  "/srv/job/WMTaskSpace/cmsRun1/FEVTDEBUGoutput.root"  "<PFN_NAMESPACE>/cms/store/unmerged/CMSSW_12_0_0/RelValPsi2SToJPsiPiPi/GEN-SIM/GenSimFull_SC_ProdPsi_Oct2025_Val_Alanv1-v11/00000/1202784f-3561-4133-a2d7-7d174a48654b.root" 
xrdcp exit code: 0

Note that this workflow uses CMSSW_12_0_0, which was supposed to have token disabled!

It turns out the JSON file had not been properly loaded by BasePlugin.py, so I am fixing that now. In any case, it was a good test to ensure the ability to stage data out with tokens (with a release that was not meant to work though).

Hi Alan! I see you still have the regular proxy in the code:

https://github.com/dmwm/WMCore/pull/12463/changes#diff-1e3bd71d0521a5c2de66304c8b40efaeb5e7c2f6ed453a5720dd7373c420f4ebL528

Did you check proxy authentication was disabled for the test?
Otherwise, an old version like 4.12.3 (from CMSSW 12.0.0) may just use that and ignore the token (since it won't know what those files or variables are).
The other possibility is that xrd build may have some legacy scitoken plugin that partially supported tokens (full support was declared in +5.x versions)

@amaltaro
Copy link
Copy Markdown
Contributor Author

amaltaro commented Feb 4, 2026

Hi @khurtado

Hi Alan! I see you still have the regular proxy in the code:
Did you check proxy authentication was disabled for the test?

Based on the job snippet above, for the actual command executed, I do not see this explicit "set to empty" line in the command:
https://github.com/dmwm/WMCore/blob/183c9f6/src/python/WMCore/Storage/Backends/XRDCPImpl.py#L34

so, yes, it is possible that it actually used the x509 credential (the force stage out option is only triggered after stage out failures).

Thinking further about these stage out and runtime developments, we could perhaps force the runtime behavior at the job submission time, with a change that would look like:

            if self.useCMSToken and isJobTokenReady:
                ad['use_oauth_services'] = "cms"
            else:
                ad['My.x509userproxy'] = classad.quote(self.x509userproxy)

This would ensure that only one credential is sent to the worker node, which looks safer and easier to manage credentials being delivered to the nodes/clients. On the other hand, the runtime code has less flexibility and fallback mechanisms. Please let me know if you have any thoughts on this @khurtado @anpicci and others

In any case, I ran another test workflow:
amaltaro_SC_ProdPsi_Oct2025_Val_260203_203804_6324

which works well if I remove the following 2 lines from the SimpleCondorPlugin:

else:
                ad['use_oauth_services'] = undefined

otherwise - as in my previous post - jobs fail in condor due to lack of credentials.

Kenyi, didn't the last htcondor upgrade required a workaround in our code in the shape of having always the same set of job classads and in the same order? If so, do you think that not having use_oauth_services defined at all could cause any problems? If it looks safe, then I will update this pull request by removing those 2 lines - which likely give us the final development state for this PR.

@khurtado
Copy link
Copy Markdown
Contributor

khurtado commented Feb 4, 2026

@amaltaro That suggestion seems fine to me, or simply commenting out the x509 at submission for the test.

With respect to the bug, yes, it could be problematic:
#12462

We may need to do some testing to double check, and report the issue when we set oauth_services to undefined otherwise.

@amaltaro
Copy link
Copy Markdown
Contributor Author

amaltaro commented Feb 4, 2026

With respect to the bug, yes, it could be problematic:
#12462
We may need to do some testing to double check, and report the issue when we set oauth_services to undefined otherwise.

Okay, let me run a set of workflows to see if we can trigger any issue with job submission.
In any case, to me it feels like we triggered an htcondor bug with the setting of ad['use_oauth_services'] = "UNDEFINED". Shall we report it to the developers?

@dmwm-bot
Copy link
Copy Markdown

dmwm-bot commented Feb 4, 2026

Jenkins results:

  • Python3 Unit tests: failed
    • 20 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 16 warnings and errors that must be fixed
    • 4 warnings
    • 182 comments to review
  • Pycodestyle check: succeeded
    • 76 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/1205/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link
Copy Markdown

dmwm-bot commented Feb 4, 2026

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 14 warnings and errors that must be fixed
    • 4 warnings
    • 182 comments to review
  • Pycodestyle check: succeeded
    • 76 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/1206/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Copy Markdown
Contributor Author

amaltaro commented Mar 5, 2026

test this please

1 similar comment
@amaltaro
Copy link
Copy Markdown
Contributor Author

amaltaro commented Mar 6, 2026

test this please

@dmwm-bot
Copy link
Copy Markdown

dmwm-bot commented Mar 6, 2026

Jenkins results:

  • Python3 Unit tests: succeeded
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 14 warnings and errors that must be fixed
    • 4 warnings
    • 182 comments to review
  • Pycodestyle check: succeeded
    • 77 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/1226/artifact/artifacts/PullRequestReport.html

else:
ad['My.x509userproxy'] = classad.quote(self.x509userproxy)
ad['use_oauth_services'] = ""
ad['use_oauth_services'] = "$(item)" # HACK: don't reproduce it anywhere else!
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amaltaro I know you are still working on this, but I just wanted to note that you would also need to add the classad in the submit object (createSubmitRequest):

sub = htcondor.Submit("""
    # …
    MY.OAuthServicesNeeded = "$(item)"
    """)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @khurtado , apologies for the belated follow up.
Reading the code and the htcondor thread again, you are right to point out that Jaime used MY.OAuthServicesNeeded in his tests. I am always confused with these classads, for instance, what is the difference of this one and use_oauth_services? Isn't the latter an htcondor macro, while the MY.OAuthServicesNeeded` an application-custom ad?

Additionally, I see that in SimpleCondorPlugin we use My.<ad>, while in the list post he used MY.<ad>, is there any difference on this as well?

On what concerns setting it in createSubmitRequest(). Isn't that the common submit object used in a batch submission across any workflows/tasks being submitted? I understand that, if we set it at the submit object level, it will force all jobs being submitted with schedd.submit(sub, jobParams) to inherit the same credential configuration - which is not what we want, as there can be different workflows+tasks in the submission package.

Copy link
Copy Markdown
Contributor

@khurtado khurtado Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amaltaro Responses inline

Hi @khurtado , apologies for the belated follow up. Reading the code and the htcondor thread again, you are right to point out that Jaime used MY.OAuthServicesNeeded in his tests. I am always confused with these classads, for instance, what is the difference of this one and use_oauth_services? Isn't the latter an htcondor macro, while the MY.OAuthServicesNeeded` an application-custom ad?

Yes, use_oauth_services is a macro, part of the Job Description Language (JDL), which is a higher level call than the Classad Language (they are similar but not the same), which in the end manipulates the job classads and will create OauthServicesNeeded. Using MY.OauthServicesNeeded, the Classad Language is used to create the classad directly (Cole's first response in the condor support email explains this better). My understanding is that while MY.<ad> is supposed to be used for custom ads, here we are using it to directly to manipulate a classad that we know is used for token authentication for this trick.

Additionally, I see that in SimpleCondorPlugin we use My.<ad>, while in the list post he used MY.<ad>, is there any difference on this as well?

Yes, this is confusing because the documentation is not very explicit. Basically MY.<ad> or TARGET.<ad> are treated as attribute names, which are case-insensitive (see below). The htcondor documentation and python examples used to use My. in the past, as far as I remember, but new documentation encourages MY as the convention (though there is no technical difference, I guess using all capitals make it easier to read these attribute names).

https://htcondor.readthedocs.io/en/latest/man-pages/classads.html#attributes

On what concerns setting it in createSubmitRequest(). Isn't that the common submit object used in a batch submission across any workflows/tasks being submitted? I understand that, if we set it at the submit object level, it will force all jobs being submitted with schedd.submit(sub, jobParams) to inherit the same credential configuration - which is not what we want, as there can be different workflows+tasks in the submission package.

This is where the trick is. Yes, we are setting it for every single job, but setting it to a variable $(item). And when item is set to '$(blank), it's similar to setting it to an empty string in the Job Description Language, which will remove the classad from the specific job in a later step process.

So, Cole proposed a workaround at the Job Description Language level, but we manipulate the classads via python at the ClassAd Level. Jaime's trick allows us to use Cole's workaround at the ClassAd level. None of this is standard, but it is supported and it seems to work for our specific case/need.

I tested interactively (with a simplified custom case) and it worked in the way we need.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for elaborating on the differences between JDL and Classad Language (even though I am sure I will have the same question next week).

Onto the suggested solution, note that we need to decide whether we set MY.OAuthServicesNeeded to cms or to an empty string (with the trick above). What I fail to see is how we can submit jobs from diverse workflow/tasks if we have this logic into the createSubmitRequest() method?
Either ALL jobs submitted will use tokens, or none of them.

What I am saying is that to me it is wrong to set it at createSubmitRequest, as it is not a general job submission parameter and it is payload-dependent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or with the following code snippet, it might be easier to understand. The common submit dictionary:

    def createSubmitRequest(self, jobList, cmsswMicroArchs=None):
        # using classad language
        sub['MY.OAuthServicesNeeded'] = "$(item)"   # HACK: don't reproduce it anywhere else!

is overseeded by the actual job parameter dictionary:

    def getJobParameters(self, jobList, cmsswMicroArchs=None):
        # using JDL for tokens, classad language for x509
            if self.useCMSToken and isJobTokenReady:
                ad['use_oauth_services'] = "cms"
                ad['My.x509userproxy'] = ""
            else:
                ad['My.x509userproxy'] = classad.quote(self.x509userproxy)
                ad['use_oauth_services'] = ""

effectively bypassing any translations between JDL and classad language.
In other words, the submit object only has a placeholder for the actual job parameter.

Is it what you were explaining before? Is it your understanding of what Jaime suggested?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amaltaro Yes. It is a place holder and the point is to deal with token parameters at the classad language level, using also the $(blank) trick since using "" directly doesn't work (as opposed to the JDL level)

@amaltaro amaltaro requested a review from khurtado April 1, 2026 15:56
@amaltaro
Copy link
Copy Markdown
Contributor Author

amaltaro commented Apr 1, 2026

Kenyi, as per our discussion in #12463 (comment) , I have implemented the relevant code changes to SimpleCondorPlugin. Can you please review it? If it looks good to you, I will then patch vocms0193 and run some tests FYI @kersevan

@amaltaro
Copy link
Copy Markdown
Contributor Author

amaltaro commented Apr 1, 2026

test this please

Copy link
Copy Markdown
Contributor

@khurtado khurtado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amaltaro I left a comment along the code. We may need to double check on the x509proxy as well.

ad['use_oauth_services'] = "cms"
ad['My.x509userproxy'] = ""
else:
ad['use_oauth_services'] = "$(blank)"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amaltaro I think we need to define item here instead of use_oauth_services.

E.g.:

if self.useCMSToken and isJobTokenReady:
                ad['item'] = "cms"
                ad['My.x509userproxy'] = ""
            else:
                ad['item'] = '$(blank)'

Could you remind me if we testedad['My.x509userproxy'] = "" ?
We may need to apply the same trick for this one.
We can likely change the name item to something more meaningful as well (e.g.: tokenParameter?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my memory does not fail me, we have already tested My.x509userproxy with an empty value and it works well. Given that the order of job submission matters for this setup, it's probably better to test it in a standalone setup.

@dmwm-bot
Copy link
Copy Markdown

dmwm-bot commented Apr 1, 2026

Jenkins results:

  • Python3 Unit tests: failed
    • 11 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 14 warnings and errors that must be fixed
    • 4 warnings
    • 182 comments to review
  • Pycodestyle check: succeeded
    • 77 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/1234/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Copy Markdown
Contributor Author

amaltaro commented Apr 1, 2026

Thanks Kenyi. I made another update to this PR. Please have a look at your convenience.

@amaltaro
Copy link
Copy Markdown
Contributor Author

amaltaro commented Apr 1, 2026

test this please

@khurtado
Copy link
Copy Markdown
Contributor

khurtado commented Apr 1, 2026

Thanks Kenyi. I made another update to this PR. Please have a look at your convenience.

Thanks Alan, it looks good to me.

@amaltaro
Copy link
Copy Markdown
Contributor Author

amaltaro commented Apr 2, 2026

test this please

@dmwm-bot
Copy link
Copy Markdown

dmwm-bot commented Apr 2, 2026

Jenkins results:

  • Python3 Unit tests: failed
    • 11 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 14 warnings and errors that must be fixed
    • 4 warnings
    • 182 comments to review
  • Pycodestyle check: succeeded
    • 77 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/1237/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Copy Markdown
Contributor Author

amaltaro commented Apr 2, 2026

test this please

1 similar comment
@amaltaro
Copy link
Copy Markdown
Contributor Author

test this please

@dmwm-bot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 14 warnings and errors that must be fixed
    • 4 warnings
    • 182 comments to review
  • Pycodestyle check: succeeded
    • 77 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/1297/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Copy Markdown
Contributor Author

I decided to resume this development and testing, and noticed that the condor credential was no longer available in the schedd:

cmst1@vocms0265:amaltaro $ condor_store_cred query-oauth -s cms
Account: <current> (cmst1)
CredType: oauth

No credential is stored.

Next, I followed the instructions provided by @khurtado over mattermost, in summary:

# note that this device might change 
amaltaro@vocms0265:~ $ chmod o+w /dev/pts/0

amaltaro@vocms0265:~ $ cmst1
cmst1@vocms0265:amaltaro $ cd /data/cmst1/tokens
cmst1@vocms0265:tokens $ condor_submit submit.jdl
Submitting job(s)
Attempting to get tokens for cms
...
Attempting OIDC authentication with <vault url>
...
Complete the authentication at:
    https://cms-auth....
# I completed it via SSO
...
Operation succeeded and is waiting for processing.

and now I can list the token stored in the schedd again.

@amaltaro
Copy link
Copy Markdown
Contributor Author

I managed to get two workflows with job submission in the same JobSubmitter polling cycle. One using CMSSW release 12x (hence x509) and the second using 15x (hence OAuth), in this respective order.

Inspecting one job for each, here are their relevant classads:

# using x509 - CMSSW_12_x
cmst1@vocms0265:amaltaro $ condor_q -l 3127.8 -af:h WMAgent_RequestName OAuthServicesNeeded use_x509userproxy x509userproxy x509UserProxyEmail x509UserProxyExpiration x509userproxysubject
WMAgent_RequestName = "amaltaro_SC_ProdPsi_12x_Oct2025_Val_260416_155957_2773"
ClusterId = 3127
ProcId = 8
OAuthServicesNeeded = undefined
use_x509userproxy = true
x509userproxy = "/data/certs/myproxy.pem"  # and the other x509 ads have been set

and

# using OAuth - CMSSW_15_x
cmst1@vocms0265:amaltaro $ condor_q -l 3127.15 -af:h WMAgent_RequestName OAuthServicesNeeded use_x509userproxy x509userproxy x509UserProxyEmail x509UserProxyExpiration x509userproxysubject
WMAgent_RequestName = "amaltaro_SC_ProdPsi_15x_Oct2025_Val_260416_160012_361"
ClusterId = 3127
ProcId = 15
OAuthServicesNeeded = cms
use_x509userproxy = true
x509userproxy = undefined  # despite this undefined, the other x509 ads have been set

We can observe that use_x509userproxy was set to true in both workflows, which makes me think that the same trick needs to be applied for x509 (as you have foreseen, @khurtado ).

@amaltaro
Copy link
Copy Markdown
Contributor Author

test this please

1 similar comment
@amaltaro
Copy link
Copy Markdown
Contributor Author

test this please

@dmwm-bot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 14 warnings and errors that must be fixed
    • 4 warnings
    • 182 comments to review
  • Pycodestyle check: succeeded
    • 76 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/1312/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Copy Markdown
Contributor Author

After disabling use_x509userproxy in the condor schedd configuration, I have ran two other test workflows and I no longer see this job classad defined in the jobs.

Here is a job that was meant to use x509 credentials and keep token disabled:

cmst1@vocms0265:amaltaro $ condor_history -l 3148.4 -af:h WMAgent_RequestName OAuthServicesNeeded use_x509userproxy x509userproxy x509UserProxyEmail x509UserProxyExpiration x509userproxysubject CMS_JobType CMSSW_Versions
CMSSW_Versions = "CMSSW_12_0_0,CMSSW_12_0_0,CMSSW_12_0_0"
CMS_JobType = "Production"
OAuthServicesNeeded = undefined
WMAgent_RequestName = "amaltaro_SC_ProdPsi_12x_Oct2025_Val_260424_124257_2594"
x509UserProxyEmail = "xxx"
x509UserProxyExpiration = 1777640102
x509userproxy = "xxx"
x509userproxysubject = "/DC=ch/DC=cern/xxx"

while this job is meant to use tokens and keep x509 disabled.

cmst1@vocms0265:amaltaro $ condor_history -l 3148.10 -af:h WMAgent_RequestName OAuthServicesNeeded use_x509userproxy x509userproxy x509UserProxyEmail x509UserProxyExpiration x509userproxysubject CMS_JobType CMSSW_Versions
CMSSW_Versions = "CMSSW_15_0_0_pre3,CMSSW_15_0_0,CMSSW_15_0_0_pre3"
CMS_JobType = "Production"
OAuthServicesNeeded = cms
WMAgent_RequestName = "amaltaro_SC_ProdPsi_15x_Oct2025_Val_260424_124313_8331"
x509UserProxyEmail = "xxx"
x509UserProxyExpiration = 1777640102
x509userproxy = undefined
x509userproxysubject = "/DC=ch/DC=cernxxx"

Where we can make the following two observations:

  • use_x509userproxy is no longer found anywhere in any of these jobs (that belong from different workflows)
  • despite explicitly setting x509userproxy to undefined, all of the other x509 classads are still set with an actual valid content.

I need to check with @khurtado and other SI/HTCondor experts to see if there is any problem on this setup. I can confirm though that the workflow based on x509 has succeeded, without use_x509userproxy enabled.

@khurtado
Copy link
Copy Markdown
Contributor

@amaltaro Can you also check the environment variables in the wmagent job logs ?

Specifically, this variable:

X509_USER_PROXY

@amaltaro
Copy link
Copy Markdown
Contributor Author

Log condor.out does not have any definition of X509_USER_PROXY for the workflow using CMSSW_15x, however wmagentJob.log starts the stage out attempt with x509 instead:

2026-04-24 14:11:02,320:INFO:XRDCPImpl:Stage out requested with tokens, but environment variable is not defined. Forcing it to use X509 authentication method instead.

and the reason seems to be that _CONDOR_CREDS isn't defined (from condor.out - isn't condor schedd supposed to be setting it?):

Variable _CONDOR_CREDS is not defined, condor auth/token credentials directory not found.

Checking the other workflow that uses an older CMSSW_12x release, I see X509_USER_PROXY defined and pointing to the proxy file, and the stage out behaved properly:

2026-04-24 14:34:31,620:INFO:XRDCPImpl:Stage out requested with tokens, but environment variable is not defined. Forcing it to use X509 authentication method instead.

So we need to figure out why _CONDOR_CREDS is not defined at job runtime.

@khurtado
Copy link
Copy Markdown
Contributor

@amaltaro I think the issue with _CONDOR_CREDS should be reported to the HTCondor team. However, could you try one more test before that?

Change the order here:
https://github.com/dmwm/WMCore/pull/12463/changes#diff-1e3bd71d0521a5c2de66304c8b40efaeb5e7c2f6ed453a5720dd7373c420f4ebR753-R755

So that x509userproxy set first, then oauthServicesNeeded.

E.g.: In the condor_q -l, the classad order would display it in this order:

x509userproxy = <>
OAuthServicesNeeded = <>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Integration and validation of the Bastion service to manage service tokens in the agents

3 participants