Skip to content

Add error handling for job submission and micro-architecture fetching#12420

Open
hassan11196 wants to merge 1 commit intodmwm:masterfrom
hassan11196:jobsubmitter-fix
Open

Add error handling for job submission and micro-architecture fetching#12420
hassan11196 wants to merge 1 commit intodmwm:masterfrom
hassan11196:jobsubmitter-fix

Conversation

@hassan11196
Copy link
Copy Markdown
Member

Fixes #12419

Status

not-tested

Description

Handle exception thrown TagCollector, defaultMicroArchVersionNumberByRelease method in SimpleCondorPlugin due to issues with communicating to cmssdt.cern.ch.

Is it backward compatible (if not, which system it affects?)

Yes

Related PRs

N/A

External dependencies / deployment changes

No

@hassan11196
Copy link
Copy Markdown
Member Author

patched draining agent vocms0252 and now the JobSubmitter thread does not die due to the exception,

2025-07-31 16:15:19,881:140231827257024:INFO:JobSubmitterPoller:Have 1000 jobs to submit.
2025-07-31 16:15:19,881:140231827257024:INFO:JobSubmitterPoller:Done assigning site locations.
2025-07-31 16:17:31,937:140231827257024:ERROR:SimpleCondorPlugin:Failed to create submit request for 200 jobs
2025-07-31 16:17:31,937:140231827257024:ERROR:SimpleCondorPlugin:(28, 'Failed to connect to cmssdt.cern.ch port 443 after 131668 ms: Could not connect to server')
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/WMCore/BossAir/Plugins/SimpleCondorPlugin.py", line 167, in submit
    (sub, jobParams) = self.createSubmitRequest(jobsReady)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/WMCore/BossAir/Plugins/SimpleCondorPlugin.py", line 707, in createSubmitRequest
    jobParameters = self.getJobParameters(jobList)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/WMCore/BossAir/Plugins/SimpleCondorPlugin.py", line 513, in getJobParameters
    rel_microarchs = self.tc.defaultMicroArchVersionNumberByRelease()
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/WMCore/Services/TagCollector/TagCollector.py", line 125, in defaultMicroArchVersionNumberByRelease
    for row in self.data():
               ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/WMCore/Services/TagCollector/TagCollector.py", line 81, in data
    data = self._getResult()
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/WMCore/Services/TagCollector/TagCollector.py", line 66, in _getResult
    f = self.refreshCache(cFile, callname, args, encoder=encoder, decoder=decodeBytesToUnicode,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/WMCore/Services/Service.py", line 218, in refreshCache
    self.getData(cachefile, url, inputdata, incoming_headers, encoder, decoder, verb, contentType, binary=binary)
  File "/usr/local/lib/python3.12/site-packages/WMCore/Services/Service.py", line 318, in getData
    data, dummyStatus, dummyReason, from_cache = self["requests"].makeRequest(uri=url,
                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/WMCore/Services/Requests.py", line 185, in makeRequest
    result, response = self.makeRequest_pycurl(uri, data, verb, headers)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/WMCore/Services/Requests.py", line 202, in makeRequest_pycurl
    response, result = self.reqmgr.request(uri, data, headers, verb=verb,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/Utils/PortForward.py", line 68, in portMangle
    return callFunc(callObj, url, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/WMCore/Services/pycurl_manager.py", line 338, in request
    curl.perform()
pycurl.error: (28, 'Failed to connect to cmssdt.cern.ch port 443 after 131668 ms: Could not connect to server')
2025-07-31 16:17:31,940:140231827257024:ERROR:SimpleCondorPlugin:Moving on the the next batch of jobs and/or cycle....

@dmwm-bot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 5 warnings
    • 51 comments to review
  • Pycodestyle check: succeeded
    • 17 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/916/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 57 tests no longer failing
    • 9 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 5 warnings
    • 51 comments to review
  • Pycodestyle check: succeeded
    • 17 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/917/artifact/artifacts/PullRequestReport.html

@anpicci anpicci requested a review from amaltaro October 7, 2025 13:58
Copy link
Copy Markdown
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for creating this patch, @hassan11196 .
I left a comment along the code which must be considered before moving forward.

In addition, I feel like we should:

  • either drop this in favor of #12449 (which will actually use the TagCollector cache)
  • merge these changes to #12449
  • or vice-versa, merge the changes in #12449 to this one

logging.error("Failed to create submit request for %d jobs", len(jobsReady))
logging.exception(str(ex))
logging.error("Moving on the the next batch of jobs and/or cycle....")
return successfulJobs, failedJobs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that these 2 variables can potentially be empty (or not tracking jobs yet to be submitted), I think this will cause issues upstream.
The best would be to iterate over all those (remaining) jobs and add them to the failedJobs variable, before returning it - similar to what is done in the exception block below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JobSubmitter crashing due to issues with cmssdt.cern.ch

3 participants