Skip to content

wiregrid_encoder: Fix potential race condition#1043

Merged
BrianJKoopman merged 10 commits intomainfrom
mhasself/wg-enc
Apr 28, 2026
Merged

wiregrid_encoder: Fix potential race condition#1043
BrianJKoopman merged 10 commits intomainfrom
mhasself/wg-enc

Conversation

@mhasself
Copy link
Copy Markdown
Member

Description

Creates a new container structure for each bundle of data published to feeds

Prior to this change, the dicts were re-used for multiple calls to publish_to_feed. Since the acq process runs in a thread, the data will be consumed in the reactor, asynchronous to the thread, and thus is vulnerable to corruption.

This is a general problem with publishing data from threads -- it should always be assembled into a unique container instance, published, and then never modified again (other than to delete it or replace it in the thread namespace). Although ocs/ocs_feed tries to "copy" the message and pass the copy to the reactor, it's not a deepcopy so that's actually pretty pointless for our standard Block way of publishing data to a feed.

Motivation and Context

Memory leaks have been observed for this agent, and are associated with rapid logging of message:

2026-04-23T07:28:33+0000 while calling from thread
2026-04-23T07:28:34+0000 Recieved IRIG timeout packet.
2026-04-23T07:28:34+0000 while calling from thread
2026-04-23T07:28:34+0000 while calling from thread
2026-04-23T07:28:34+0000 while calling from thread
2026-04-23T07:28:34+0000 while calling from thread
2026-04-23T07:28:35+0000 while calling from thread
2026-04-23T07:28:35+0000 while calling from thread		
2026-04-23T07:28:36+0000 while calling from thread

I theorize that these messages come from the feed data publishing, which runs in the reactor but initiated from the acq thread, and are ultimately due to inconsistent data getting into a Block. Specifically vectors of different lengths, in the block, are not checked on "append" (when the acq process calls publish) but are checked and will raise an error just prior to encoding and publishing to crossbar. That latter thing happens in the reactor.

Note that this:

        def A():
            assert 1 == 2
        def B():
            reactor.callFromThread(A)
        d = threads.deferToThread(B)

called from inside an OCS Agent reactor, will produce the exact "while calling from thread" error message that we are seeing, strongly suggesting it's originating from the reactor -> thread -> reactor arrangement, which is only really used in the ocs_agent.publish code.

How Has This Been Tested?

This has not been tested! This is draft pending discussion with agent devs and possibly testing at site.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

@mhasself
Copy link
Copy Markdown
Member Author

Let me know of any concerns on this, or who to ask ... @sadachi5 @ykyohei

sadachi5 and others added 2 commits April 24, 2026 09:24
All of the data containers are initialized at the beginning of the while loop of take_data. I moved the place from @mhasself 's modification.
@sadachi5
Copy link
Copy Markdown
Contributor

@mhasself Thanks for your investigation. I just moved the data container initialization location to the beginning of the while loop (and I also added initialization of all of the data containers). I hope it is the same thing you want to modify.

@sadachi5
Copy link
Copy Markdown
Contributor

I just tested this modification using SAT-LF wiregrid in Japan. It can work as usual.

image

Copy link
Copy Markdown
Contributor

@ykyohei ykyohei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, are we sure that this worked after the latest changes..?
I see some syntax errors.

Comment thread socs/agents/wiregrid_encoder/agent.py Outdated
Comment thread socs/agents/wiregrid_encoder/agent.py Outdated
@sadachi5
Copy link
Copy Markdown
Contributor

Thanks @ykyohei . I missed configuring the socs in my system. I'm testing again after fixing the syntax errors.

@sadachi5
Copy link
Copy Markdown
Contributor

I tested again. That was fine.

@mhasself
Copy link
Copy Markdown
Member Author

Thanks for investigating further!

I think that moving the structure init to the top of the loop has a bad side effect -- the session.data will almost always be empty.

You could move the updates to enc_field_dict and irig_field_dict into the if len(self.parser. ... ): blocks to avoid that and keep a useful session.data.

@sadachi5
Copy link
Copy Markdown
Contributor

sadachi5 commented Apr 24, 2026

Thanks for investigating further!

I think that moving the structure init to the top of the loop has a bad side effect -- the session.data will almost always be empty.

You could move the updates to enc_field_dict and irig_field_dict into the if len(self.parser. ... ): blocks to avoid that and keep a useful session.data.

@mhasself Thank you for your comments.

I'm not familiar with the session.data.
What is the purpose? If the enc_field_dict is not updated to be empty, the old data is stored in the session.data. Is it correct behaviour?

@sadachi5 sadachi5 closed this Apr 24, 2026
@sadachi5 sadachi5 reopened this Apr 24, 2026
@mhasself
Copy link
Copy Markdown
Member Author

I'm not familiar with the session.data. What is the purpose? If the enc_field_dict is not updated to be empty, the old data is stored in the session.data. Is it correct behaviour?

The session.data is a block of information exposed by a task or process, for the consumption (by polling) of external entities. The three main uses of session.data are:

  1. To support ocs-web in presenting useful information in a panel (wiregrid doesn't use this, but e.g. ACU does).
  2. To support other agents in learning what this agent is doing, what the status of the hardware is, etc. (It's easier to consume than feed information. This is how HWPSupervisor gets information from other agents.)
  3. To help a developer check the health / status of an agent by inspecting some of the data coming through it. I use this a lot, even when (1) and (2) aren't in use!

Here's the session.data view on SATP2 right now (if you click on the acq "Status" in ocs-web, then click the Data tab, you can see it):
image

So I was using that to see, in this case, that no IRIG packets were coming in, but lots of encoder packets were coming in as expected.

I might misunderstand the timing of the agent, in practice (because I don't know how often self.parser.grab_and_parse_data returns data, and how much data is in there when it does). But strictly analyzing the logic of the code, here is what happens to enc_rdata:

  1. enc_rdata is cleared at the top of each loop iteration
  2. enc_rdata is only updated when the enc_rdata is published -- i.e. when new data has been received and enough data / time has passed that it triggers the publishing block.
  3. enc_rdata is copied into enc_field_dict on every loop iteration.
  4. enc_field_dict is copied into session data on every loop iteration.

So if (2) happens less often than (1+3+4), which I suspect it does, then sometimes the session.data will contain the empty data structure. That's going to make it a noisy and annoying data source for anyone to look at.

It might seem like session.data isn't that useful, for this Agent ... but right now we're debugging a memory leak issue, and diagnostic information like this is actually very helpful to someone like me, who does not have access to a full working development system! So I'd prefer that we not remove any features ... if anything, I'd like to see more logging / information in session.data.

Thanks and sorry for such a long message :)

@sadachi5
Copy link
Copy Markdown
Contributor

@mhasself, thank you for your explanation.
I’ve revised it again based on your suggestion.
Additionally, I moved the updates of irig_field_dict and encoder_field_dict inside the if-block so that they are only updated when the data changes, rather than on every iteration.

Could you check it again?

@mhasself
Copy link
Copy Markdown
Member Author

Looks good! I'll mark this ready for review and @BrianJKoopman can make the final merge.

@mhasself mhasself marked this pull request as ready for review April 27, 2026 14:17
@mhasself mhasself requested a review from BrianJKoopman April 27, 2026 16:38
Copy link
Copy Markdown
Member

@BrianJKoopman BrianJKoopman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Good find!

@BrianJKoopman BrianJKoopman merged commit 1248ff0 into main Apr 28, 2026
15 checks passed
@BrianJKoopman BrianJKoopman deleted the mhasself/wg-enc branch April 28, 2026 14:22
@BrianJKoopman BrianJKoopman changed the title wiregrid_encoder: fix potential race condition wiregrid_encoder: Fix potential race condition Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants