Skip to content

Code refactoring to reduce memory footprint#12385

Open
vkuznet wants to merge 4 commits intodmwm:masterfrom
vkuznet:fix-issue-12200
Open

Code refactoring to reduce memory footprint#12385
vkuznet wants to merge 4 commits intodmwm:masterfrom
vkuznet:fix-issue-12200

Conversation

@vkuznet
Copy link
Copy Markdown
Contributor

@vkuznet vkuznet commented Jun 10, 2025

Fixes #12200

Status

not-tested

Description

Provide code refactoring to reduce memory footprint

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

<If it's a follow up work; or porting a fix from a different branch, please mention them here.>

External dependencies / deployment changes

<Does it require deployment changes? Does it rely on third-party libraries?>

@dmwm-bot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 6 warnings
    • 47 comments to review
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/762/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests deleted
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 6 warnings
    • 62 comments to review
  • Pycodestyle check: succeeded
    • 7 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/763/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests deleted
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 6 warnings
    • 62 comments to review
  • Pycodestyle check: succeeded
    • 9 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/764/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests deleted
    • 1 tests added
    • 4 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 6 warnings
    • 57 comments to review
  • Pycodestyle check: succeeded
    • 9 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/765/artifact/artifacts/PullRequestReport.html

Copy link
Copy Markdown
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

Copy link
Copy Markdown
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valentin, the code itself looks okay to me. However, the improvements to memory footprint will be negligible. In the current state of the system, it improves memory consumption by 0.01% only, as it still loads all requests matching a given request status from:

reqStatus = ['announced', 'aborted-completed', 'rejected']

The unit test probably needs updating, hence my request for changes here.

def testRunning(self):
result = self.msRuleCleaner._execute(self.reqRecords)
self.assertEqual(result, (3, 2, 0, 0))
def testMemoryLeak(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably didn't want to persist this memory leak investigation as a unit test (?)

@vkuznet
Copy link
Copy Markdown
Contributor Author

vkuznet commented Jun 16, 2025

Valentin, the code itself looks okay to me. However, the improvements to memory footprint will be negligible. In the current state of the system, it improves memory consumption by 0.01% only, as it still loads all requests matching a given request status from:

reqStatus = ['announced', 'aborted-completed', 'rejected']

The unit test probably needs updating, hence my request for changes here.

Alan, how to you come up to your conclusion and in particular the number you quote? My understanding that memory now is reused due to this loop:

            for status in reqStatus:
                req = self.getRequestRecords(status)
                wflow = MSRuleCleanerWflow(req)

Yes, we loop over all request statuses, but within the loop we re-use req and wflow objects with respect to previous implementation where we load all requests into list and pass it around.

I would appreciate if you provide your view on where memory is spent that it would be easier to understand your request.

@amaltaro
Copy link
Copy Markdown
Contributor

@vkuznet the previous _execute() method was already reusing the req and wflow object, so I see no gain with the proposed code changes in that perspective.

The only gain is that now we load requests for a single status, instead of loading requests for all eligible statuses (3 in total).
Nonetheless, as mentioned in my previous comment, out of those 3 statuses, most of the time 1 of them correspond to +99% of the requests to be handled, hence minimizing the gains.

In order to really solve the memory footprint, we would have to control how many requests we can fetch from ReqMgr2/CouchDB, as implemented in getRequestRecords(). CouchDB itself does not provide data streaming.

There are a few ways that we could handle this, but none of them look simple and/or robust to me. For instance:
a) we could use the offset parameter to fetch data in slices, but given that this requires multiple HTTP requests, it would become tricky to properly slice the workflows while ensuring that no workflows are left out.
b) we could first discover the workflow names, then retrieve each one of them (or in bulk of XX workflows)
c) we could use WMStats and only fetch the relevant information.

All of them have tradeoffs though. Right now, I suspect b) would be a good commitment.

@vkuznet
Copy link
Copy Markdown
Contributor Author

vkuznet commented Jun 16, 2025

Alan, thank you for providing more details. The current code used two nested for loops, in one it captured all requests in a dict (requestRecords.update(self.getRequestRecords(status)) ) and in second for loop (within _execute) it used its values to construct wflow . I eliminated one dictionary requestRecords which we updated and pass around, and it eliminates nested for loop. As each dictionary allocates memory exponentially with nested structures I don't know if it accounts to 0.01% as you claimed. My question is how did you measure it?

That said, from your details you claim that majority of memory allocation comes from getRequestRecords(). If all requests objects (dictionaries) are allocated by this API then indeed we may not gain much while we create another dictionary out of them and pass it around as memory allocation maybe reused in this case.

I'll have a look into options you propose, but please clarify further how to discover the workflow names as you propose.

@amaltaro
Copy link
Copy Markdown
Contributor

@vkuznet I only did a ratio of max(workflows in (a,b,c)) / sum(workflows in (a,b,c)

@vkuznet
Copy link
Copy Markdown
Contributor Author

vkuznet commented Jun 16, 2025

Alan, in order to proceed I need more information on the following topics:

  • which CouchDB database ReqMgr2 data service is using to fetch workflows with a specific status, e.g. /reqmgr2/requests?status=new this call goes where?
  • the issue is that above reqmgr2 call fetches everything, i.e. full document and compose them in a dictionary which we know will blow out memory in a client when it parses it, therefore
  • I'm looking for two HTTP calls to CouchDB:
    • to fetch ONLY workflow names and no details, so the question is which CouchDB and appropriate view to use to get workflows in a specific state.
    • once we know particular CouchDB doc id which HTTP cal to use to fetch info which is provided by getRequestRecords(status)

@vkuznet
Copy link
Copy Markdown
Contributor Author

vkuznet commented Jun 16, 2025

While awaiting for Alan's response, I made deep dive into WM code to answer my questions. Here is what I found:

  • reqmgr2 queries (via cache class) /wmstatsserver/data service which is data-service in front of CouchDB
  • the CoucDB database is called reqmgr_workload_cache, e.g.
scurl https://xxx.cern.ch/couchdb/reqmgr_workload_cache/_all_docs

will return all workflows names as a keys, such that we can get full document via:

  • /couchdb/reqmgr_workload_cache/wflow-name end-point
scurl https://xxx.cern.ch/couchdb/reqmgr_workload_cache/pdmvserv_task_PPD-Run3Winter24GS-00015__v1_T_250506_143256_3616
  • each returned document is a dictionary which contains RequestStatus.
  • in testbed we have almost 30K workflows, in production this number is 1.2M
  • then, I looked all existing couchdb views:
scurl -s "https://xxx.cern.ch/couchdb/reqmgr_workload_cache/_all_docs?startkey=\"_design/\"&endkey=\"_design0\""

which brings me different erlang maps, among them the bystatus one

  • Using bystatus map for a specific state, e.g. running-open we can fetch only desired information like workflows (request) names, e.g.
scurl -s "https://xxx.cern.ch/couchdb/reqmgr_workload_cache/_design/ReqMgr/_view/bystatus?key=%22running-open%22&include_docs=false" | jq .
{
  "total_rows": 29435,
  "offset": 29426,
  "rows": [
    {
      "id": "cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-Backfill-00006__v1_T_250508_114352_1105",
      "key": "running-open",
      "value": 1746714763
    },
    {
      "id": "cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-Backfill-00006__v1_T_250521_193936_648",
      "key": "running-open",
      "value": 1748011460
    },
    {
      "id": "ireid_TC_Backfill_IDR_CMS_Multi_250616_130703_1083",
      "key": "running-open",
      "value": 1750083258
    }
  ]
}

This output is quite small and will not lead to memory blowup. We can use this view to get list of all workflows in a specific state and then proceed with their documents.

@amaltaro , can you agree on such approach? What I propose is the following:

  • use directly CouchDB call with bystatus view which gives us list of workflows in a specific state
  • loop over provided workflows to perform code logic

I anticipate that memory footprint will be small in this proposal. To confirm that currently we have in production instance:

# announced state
scurl -s "https://xxx.cern.ch/couchdb/reqmgr_workload_cache/_design/ReqMgr/_view/bystatus?key=%22announced%22&include_docs=false" | jq '.rows[].id' | wc -l
9552

# and similar call in rejected (1) and aborted-completed (2)

So, at most we need to load 10K strings in a code (not dictionaries) and them loop over them to load individual workflows document.

@amaltaro
Copy link
Copy Markdown
Contributor

@vkuznet it looks like my answer is no longer needed. And I am glad you figured these things out.

I do want to make a couple of remarks though:

  1. please do not query CouchDB directly. We should use it only if other services (ReqMgr2, WMStats) don't provide a REST API for that. CouchDB direct calls are harder to track, so we better go through central services.
  2. if you read the method I mentioned above, you will see it being used in this format:
        result = self.reqmgr2.getRequestByStatus([reqStatus], detail=True)

where detail is the answer you are looking for. Please explore the self.reqmgr2 object to make yourself familiar with other methods available (I am almost sure there is one that you can provide a list of workflows that you want to retrieve).

@vkuznet
Copy link
Copy Markdown
Contributor Author

vkuznet commented Jun 16, 2025

Here is how much memory will be used using the approach I provided:

import requests
import tracemalloc
import json
import os

COUCHDB_URL = "https://xxx.cern.ch/couchdb/reqmgr_workload_cache/_design/ReqMgr/_view/bystatus"
STATE = "announced"  # The key to filter by
PARAMS = {
    "key": json.dumps(STATE),  # JSON-encode the key value
    "limit": 10000  # Optional: limit result size
}
CLIENT_CERT = os.getenv("HOME")+"/.globus/usercert.pem"
CLIENT_KEY = os.getenv("HOME")+"/.globus/userkey.pem"
CA_CERT = "ca_cert.pem"

tracemalloc.start()

# request logic to fetch all workflows names within given state
response = requests.get(COUCHDB_URL, params=PARAMS, cert=(CLIENT_CERT, CLIENT_KEY),verify=CA_CERT )
response.raise_for_status()
data = response.json()
ids = [row["id"] for row in data.get("rows", [])]

current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(f"Fetched {len(ids)} document IDs.")
print(f"Current memory usage: {current / 1024:.2f} KB")
print(f"Peak memory usage: {peak / 1024:.2f} KB")

and the output is this:

Fetched 9553 document IDs.
Current memory usage: 5685.05 KB
Peak memory usage: 6740.40 KB

So, for all 10K request in announced state we'll use approximately 1KB of RAM

@vkuznet
Copy link
Copy Markdown
Contributor Author

vkuznet commented Jun 16, 2025

@vkuznet it looks like my answer is no longer needed. And I am glad you figured these things out.

I do want to make a couple of remarks though:

  1. please do not query CouchDB directly. We should use it only if other services (ReqMgr2, WMStats) don't provide a REST API for that. CouchDB direct calls are harder to track, so we better go through central services.
  2. if you read the method I mentioned above, you will see it being used in this format:
        result = self.reqmgr2.getRequestByStatus([reqStatus], detail=True)

where detail is the answer you are looking for. Please explore the self.reqmgr2 object to make yourself familiar with other methods available (I am almost sure there is one that you can provide a list of workflows that you want to retrieve).

This is how I found all details :) Said that, currently we do not expose detail parameter in reqmgr2/WMStats APIs and making changes will require changing their APIs

@vkuznet
Copy link
Copy Markdown
Contributor Author

vkuznet commented Jun 16, 2025

@vkuznet it looks like my answer is no longer needed. And I am glad you figured these things out.
I do want to make a couple of remarks though:

  1. please do not query CouchDB directly. We should use it only if other services (ReqMgr2, WMStats) don't provide a REST API for that. CouchDB direct calls are harder to track, so we better go through central services.
  2. if you read the method I mentioned above, you will see it being used in this format:
        result = self.reqmgr2.getRequestByStatus([reqStatus], detail=True)

where detail is the answer you are looking for. Please explore the self.reqmgr2 object to make yourself familiar with other methods available (I am almost sure there is one that you can provide a list of workflows that you want to retrieve).

This is how I found all details :) Said that, currently we do not expose detail parameter in reqmgr2/WMStats APIs and making changes will require changing their APIs

@amaltaro , if you insists of using reqmgr/WMStats please specify which REST end-point we can use. What we have is Web end-point, e.g. https://xxx.cern.ch/reqmgr2/requests?status=new and it returns HTML, even if I ask for json, e.g. scurl -H "Accept: application/json" "https://xxx.cern.ch/reqmgr2/requests?status=new". So, it does use the self.reqmgr2 object and get data view self.reqmgr2.getRequestByStatus but it neither provides REST end-point or accepts detail. Once I know which REST end-point to use with detail=false then of course I'll be happy to use it. So far I didn't find that such exist.

@amaltaro
Copy link
Copy Markdown
Contributor

I think you can use getRequestByStatus() from:

def getRequestByStatus(self, statusList, detail=True):

followed by getRequestByNames() to retrieve the actual documents:

def getRequestByNames(self, names):

@vkuznet
Copy link
Copy Markdown
Contributor Author

vkuznet commented Jun 16, 2025

Alan, I know these methods but there are few objections of not using them in this ticket context:

  • this is not REST end-point, e.g. I need to call self.reqmgr2 object which by itself fetches results via cache object, and it is unclear how cache works, does it take into account HTTP parameters or not. In other words we need to make call to /requests?status=<status>&detail=false. The question here how the call will be perform and stored.
  • Since the reqmgr2 object is bound to cache the results will be stored first to the cache and then fetched. If cache does not distinguish between details=true and details=false there is no point of using it since results will take full dictionaries instead of returned only workflow names.
  • The results of getRequestByStatus is a dictionary, and not a list, which will have different memory impact, again details depends on how results stored and retrieved from the internal cache object.

My point is that here we are trying to reduce memory footprint but the memory usage is based on implementation of self.reqmgr2 object itself. And, I fear that its usage may not lead to desired outcome. Instead, the method I proposed via direct call to CouchDB has well measured memory footprint, only 1KB, see #12385 (comment) What is impact and parts of using self.reqmgr2 is not clear and cannot be easily measure because this object is shared across many APIs and bound to the cache.

Therefore, before changing code to use or not self.reqmgr2 object with its cache I need to know how it will affect the memory footprint. Please provide your input on this, and better we need to concrete (stand-alone) measurement that its usage does not contribute to memory footprint of the service.

@anpicci
Copy link
Copy Markdown
Contributor

anpicci commented Jun 17, 2025

@vkuznet @amaltaro , thank you for this discussion:

I have some comments:

  1. Alan, is there a motivation not to make a direct call to CouchDB? "Direct calls are harder to track": do we have the need to track them (sorry for the naive question)? Or is there also a risk for DDoS?

@dmwm-bot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Unit tests: failed
    • 51 new failures
    • 1 tests deleted
    • 1 tests added
    • 4 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 6 warnings
    • 57 comments to review
  • Pycodestyle check: succeeded
    • 9 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/778/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Copy Markdown
Contributor

Alan, is there a motivation not to make a direct call to CouchDB? "Direct calls are harder to track": do we have the need to track them (sorry for the naive question)? Or is there also a risk for DDoS?

The motivation is that we cannot track which user is making requests to CouchDB, we do not have the usual kube eagle monitoring as well.

Besides, as I already mentioned multiple times in my replies above, we have ReqMgr2 client and REST APIs that we can use for that. There is no need to reinvent anything.

@vkuznet
Copy link
Copy Markdown
Contributor Author

vkuznet commented Jun 17, 2025

I committed the code which relies on reqmgr2 APIs, but I also pointed out that these APIs involves caching and asked several questions how this cache is handled. I also pointed out that within context of memory allocation we may delegated it to reqmgr2 APIs where cache objects resides and I don't know how it will play out in a global scope.

That said, I suggest that you preliminary review my changes and decide if this is correct approach, it uses chunks of requests ids and process them sequentially. For instance for 10K requests, with 100 chunks, and assuming 1sec time to fetch the docs we will need to spend 100sec per iteration. This is due to sequential processing. The larger chunks will lead to larger memory footprints, but smaller chunks to larger overall time. The only way out of it is to introduce the parallel processing. But this is a second round of improvements if we agree on proposed approach with reqmgr2 APIs.

@vkuznet vkuznet requested review from amaltaro and todor-ivanov June 18, 2025 16:25
Copy link
Copy Markdown
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vkuznet , The code looks good to me

@dmwm-bot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests deleted
    • 1 tests added
  • Python3 Pylint check: succeeded
    • 6 warnings
    • 58 comments to review
  • Pycodestyle check: succeeded
    • 9 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/791/artifact/artifacts/PullRequestReport.html

Copy link
Copy Markdown
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just getting this review out of the way.
As discussed a couple of weeks ago, we decided to put this development on hold/waiting and we will probably not resume it this quarter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce microservices memory footprint

5 participants