Evaluate token support based on the list of CMSSW releases requested during job submission#12463
Evaluate token support based on the list of CMSSW releases requested during job submission#12463amaltaro wants to merge 8 commits intodmwm:masterfrom
Conversation
|
Jenkins results:
|
|
Jenkins results:
|
|
@amaltaro |
|
Thank you for spotting this problem, @khurtado. Indeed the xrootd.txt file says: which is greater than the cut-off XrootD version, hence properly tagged as token-ready - despite not. To be on the safe side, I will add one extra constraint to the script that builds the map, where any release below CMSSW_9_x will be tagged as NOT token-ready. I think you suggested this in one of our chats and I believe it is a good commitment. |
|
Jenkins results:
|
|
Jenkins results:
|
|
Jenkins results:
|
|
test this please |
2 similar comments
|
test this please |
|
test this please |
rename json file Update mapfile by properly marking CMSSW_4_x as not token-ready REMOVE-ME: text file based on CVMFS releases/scram_archs and script to generate json file rename script to build map file Consider releases before CMSSW_9_x as not token ready; fix output file name Repurpose script to construct list of releases that do not support tokens code refactoring to use CMSSW version comparison and simple list of not-token-ready
|
Jenkins results:
|
|
Jenkins results:
|
|
I have ran the following workflow in testbed-vocms0265: and I can confirm that token-based stage out works fine. I see these lines in condor.out: And these relevant lines in wmagentJob.log: Note that this workflow uses It turns out the JSON file had not been properly loaded by BasePlugin.py, so I am fixing that now. In any case, it was a good test to ensure the ability to stage data out with tokens (with a release that was not meant to work though). |
|
Another update from my tests. I executed this workflow with this patch properly applied - and added an extra log record in SimpleCondorPlugin, to ensure what is actually being set - and ALL of the jobs failed with I don't really understand this error, but I suspect condor is looking for an I am now disabling that |
Hi Alan! I see you still have the regular proxy in the code: Did you check proxy authentication was disabled for the test? |
|
Hi @khurtado
Based on the job snippet above, for the actual command executed, I do not see this explicit "set to empty" line in the command: so, yes, it is possible that it actually used the x509 credential (the force stage out option is only triggered after stage out failures). Thinking further about these stage out and runtime developments, we could perhaps force the runtime behavior at the job submission time, with a change that would look like: This would ensure that only one credential is sent to the worker node, which looks safer and easier to manage credentials being delivered to the nodes/clients. On the other hand, the runtime code has less flexibility and fallback mechanisms. Please let me know if you have any thoughts on this @khurtado @anpicci and others In any case, I ran another test workflow: which works well if I remove the following 2 lines from the SimpleCondorPlugin: otherwise - as in my previous post - jobs fail in condor due to lack of credentials. Kenyi, didn't the last |
Okay, let me run a set of workflows to see if we can trigger any issue with job submission. |
|
Jenkins results:
|
|
Jenkins results:
|
|
test this please |
1 similar comment
|
test this please |
|
Jenkins results:
|
| else: | ||
| ad['My.x509userproxy'] = classad.quote(self.x509userproxy) | ||
| ad['use_oauth_services'] = "" | ||
| ad['use_oauth_services'] = "$(item)" # HACK: don't reproduce it anywhere else! |
There was a problem hiding this comment.
@amaltaro I know you are still working on this, but I just wanted to note that you would also need to add the classad in the submit object (createSubmitRequest):
sub = htcondor.Submit("""
# …
MY.OAuthServicesNeeded = "$(item)"
""")
There was a problem hiding this comment.
Hi @khurtado , apologies for the belated follow up.
Reading the code and the htcondor thread again, you are right to point out that Jaime used MY.OAuthServicesNeeded in his tests. I am always confused with these classads, for instance, what is the difference of this one and use_oauth_services? Isn't the latter an htcondor macro, while the MY.OAuthServicesNeeded` an application-custom ad?
Additionally, I see that in SimpleCondorPlugin we use My.<ad>, while in the list post he used MY.<ad>, is there any difference on this as well?
On what concerns setting it in createSubmitRequest(). Isn't that the common submit object used in a batch submission across any workflows/tasks being submitted? I understand that, if we set it at the submit object level, it will force all jobs being submitted with schedd.submit(sub, jobParams) to inherit the same credential configuration - which is not what we want, as there can be different workflows+tasks in the submission package.
There was a problem hiding this comment.
@amaltaro Responses inline
Hi @khurtado , apologies for the belated follow up. Reading the code and the htcondor thread again, you are right to point out that Jaime used
MY.OAuthServicesNeededin his tests. I am always confused with these classads, for instance, what is the difference of this one anduse_oauth_services? Isn't the latter an htcondor macro, while the MY.OAuthServicesNeeded` an application-custom ad?
Yes, use_oauth_services is a macro, part of the Job Description Language (JDL), which is a higher level call than the Classad Language (they are similar but not the same), which in the end manipulates the job classads and will create OauthServicesNeeded. Using MY.OauthServicesNeeded, the Classad Language is used to create the classad directly (Cole's first response in the condor support email explains this better). My understanding is that while MY.<ad> is supposed to be used for custom ads, here we are using it to directly to manipulate a classad that we know is used for token authentication for this trick.
Additionally, I see that in SimpleCondorPlugin we use
My.<ad>, while in the list post he usedMY.<ad>, is there any difference on this as well?
Yes, this is confusing because the documentation is not very explicit. Basically MY.<ad> or TARGET.<ad> are treated as attribute names, which are case-insensitive (see below). The htcondor documentation and python examples used to use My. in the past, as far as I remember, but new documentation encourages MY as the convention (though there is no technical difference, I guess using all capitals make it easier to read these attribute names).
https://htcondor.readthedocs.io/en/latest/man-pages/classads.html#attributes
On what concerns setting it in
createSubmitRequest(). Isn't that the common submit object used in a batch submission across any workflows/tasks being submitted? I understand that, if we set it at the submit object level, it will force all jobs being submitted withschedd.submit(sub, jobParams)to inherit the same credential configuration - which is not what we want, as there can be different workflows+tasks in the submission package.
This is where the trick is. Yes, we are setting it for every single job, but setting it to a variable $(item). And when item is set to '$(blank), it's similar to setting it to an empty string in the Job Description Language, which will remove the classad from the specific job in a later step process.
So, Cole proposed a workaround at the Job Description Language level, but we manipulate the classads via python at the ClassAd Level. Jaime's trick allows us to use Cole's workaround at the ClassAd level. None of this is standard, but it is supported and it seems to work for our specific case/need.
I tested interactively (with a simplified custom case) and it worked in the way we need.
There was a problem hiding this comment.
Thank you for elaborating on the differences between JDL and Classad Language (even though I am sure I will have the same question next week).
Onto the suggested solution, note that we need to decide whether we set MY.OAuthServicesNeeded to cms or to an empty string (with the trick above). What I fail to see is how we can submit jobs from diverse workflow/tasks if we have this logic into the createSubmitRequest() method?
Either ALL jobs submitted will use tokens, or none of them.
What I am saying is that to me it is wrong to set it at createSubmitRequest, as it is not a general job submission parameter and it is payload-dependent.
There was a problem hiding this comment.
Or with the following code snippet, it might be easier to understand. The common submit dictionary:
def createSubmitRequest(self, jobList, cmsswMicroArchs=None):
# using classad language
sub['MY.OAuthServicesNeeded'] = "$(item)" # HACK: don't reproduce it anywhere else!
is overseeded by the actual job parameter dictionary:
def getJobParameters(self, jobList, cmsswMicroArchs=None):
# using JDL for tokens, classad language for x509
if self.useCMSToken and isJobTokenReady:
ad['use_oauth_services'] = "cms"
ad['My.x509userproxy'] = ""
else:
ad['My.x509userproxy'] = classad.quote(self.x509userproxy)
ad['use_oauth_services'] = ""
effectively bypassing any translations between JDL and classad language.
In other words, the submit object only has a placeholder for the actual job parameter.
Is it what you were explaining before? Is it your understanding of what Jaime suggested?
There was a problem hiding this comment.
@amaltaro Yes. It is a place holder and the point is to deal with token parameters at the classad language level, using also the $(blank) trick since using "" directly doesn't work (as opposed to the JDL level)
|
Kenyi, as per our discussion in #12463 (comment) , I have implemented the relevant code changes to SimpleCondorPlugin. Can you please review it? If it looks good to you, I will then patch vocms0193 and run some tests FYI @kersevan |
|
test this please |
| ad['use_oauth_services'] = "cms" | ||
| ad['My.x509userproxy'] = "" | ||
| else: | ||
| ad['use_oauth_services'] = "$(blank)" |
There was a problem hiding this comment.
@amaltaro I think we need to define item here instead of use_oauth_services.
E.g.:
if self.useCMSToken and isJobTokenReady:
ad['item'] = "cms"
ad['My.x509userproxy'] = ""
else:
ad['item'] = '$(blank)'
Could you remind me if we testedad['My.x509userproxy'] = "" ?
We may need to apply the same trick for this one.
We can likely change the name item to something more meaningful as well (e.g.: tokenParameter?)
There was a problem hiding this comment.
If my memory does not fail me, we have already tested My.x509userproxy with an empty value and it works well. Given that the order of job submission matters for this setup, it's probably better to test it in a standalone setup.
|
Jenkins results:
|
|
Thanks Kenyi. I made another update to this PR. Please have a look at your convenience. |
|
test this please |
Thanks Alan, it looks good to me. |
|
test this please |
|
Jenkins results:
|
|
test this please |
1 similar comment
|
test this please |
|
Jenkins results:
|
|
I decided to resume this development and testing, and noticed that the condor credential was no longer available in the schedd: Next, I followed the instructions provided by @khurtado over mattermost, in summary: and now I can list the token stored in the schedd again. |
|
I managed to get two workflows with job submission in the same JobSubmitter polling cycle. One using CMSSW release 12x (hence x509) and the second using 15x (hence OAuth), in this respective order. Inspecting one job for each, here are their relevant classads: and We can observe that |
|
test this please |
1 similar comment
|
test this please |
|
Jenkins results:
|
|
After disabling Here is a job that was meant to use x509 credentials and keep token disabled: while this job is meant to use tokens and keep x509 disabled. Where we can make the following two observations:
I need to check with @khurtado and other SI/HTCondor experts to see if there is any problem on this setup. I can confirm though that the workflow based on x509 has succeeded, without |
|
@amaltaro Can you also check the environment variables in the wmagent job logs ? Specifically, this variable: |
|
Log and the reason seems to be that Checking the other workflow that uses an older CMSSW_12x release, I see So we need to figure out why _CONDOR_CREDS is not defined at job runtime. |
|
@amaltaro I think the issue with Change the order here: So that x509userproxy set first, then oauthServicesNeeded. E.g.: In the |
Fixes #12228 (partially)
Status
In development
Description
Evaluate whether a job needs to explicitly disable token credentials or not, based on the CMSSW version and whether its XRootD version has full support to tokens or not. It relies on two mechanisms for this evaluation:
CMSSW_10_6_47is considered not token ready;< 5.7.2) that does not have full support to tokens.This pull request provides 3 artifacts:
cmssw_no_token_support.jsonbuild_cmssw_bad_token.pyto search CMSSW releases and build a new JSON file with the not-token-ready releasesAssumption-1: CMSSW releases with XrootD version greater or equal than
5.7.2have full support to tokens.Assumption-2: any CMSSW releases smaller than
CMSSW_10_6_47are considered NOT token-ready (as CMSW_4_x series reports xrootd version 5.27.06-cms3 ...)Assumption-3: any CMSSW greater or equal than
CMSSW_10_6_47and not present incmssw_no_token_support.jsonare considered as full token support.Is it backward compatible (if not, which system it affects?)
Partially
With this patch, assuming that token-based submission is enabled for JobSubmitter (useOauthToken), the agent will start deciding whether it sends an x509 or a token to the job runtime.
Related PRs
None
External dependencies / deployment changes
None