wiregrid_encoder: Fix potential race condition#1043
Conversation
All of the data containers are initialized at the beginning of the while loop of take_data. I moved the place from @mhasself 's modification.
for more information, see https://pre-commit.ci
|
@mhasself Thanks for your investigation. I just moved the data container initialization location to the beginning of the while loop (and I also added initialization of all of the data containers). I hope it is the same thing you want to modify. |
ykyohei
left a comment
There was a problem hiding this comment.
wait, are we sure that this worked after the latest changes..?
I see some syntax errors.
|
Thanks @ykyohei . I missed configuring the socs in my system. I'm testing again after fixing the syntax errors. |
|
I tested again. That was fine. |
|
Thanks for investigating further! I think that moving the structure init to the top of the loop has a bad side effect -- the session.data will almost always be empty. You could move the updates to |
@mhasself Thank you for your comments. I'm not familiar with the session.data. |
|
@mhasself, thank you for your explanation. Could you check it again? |
|
Looks good! I'll mark this ready for review and @BrianJKoopman can make the final merge. |
BrianJKoopman
left a comment
There was a problem hiding this comment.
Looks good to me. Good find!


Description
Creates a new container structure for each bundle of data published to feeds
Prior to this change, the dicts were re-used for multiple calls to publish_to_feed. Since the acq process runs in a thread, the data will be consumed in the reactor, asynchronous to the thread, and thus is vulnerable to corruption.
This is a general problem with publishing data from threads -- it should always be assembled into a unique container instance, published, and then never modified again (other than to delete it or replace it in the thread namespace). Although ocs/ocs_feed tries to "copy" the message and pass the copy to the reactor, it's not a deepcopy so that's actually pretty pointless for our standard Block way of publishing data to a feed.
Motivation and Context
Memory leaks have been observed for this agent, and are associated with rapid logging of message:
I theorize that these messages come from the feed data publishing, which runs in the reactor but initiated from the acq thread, and are ultimately due to inconsistent data getting into a Block. Specifically vectors of different lengths, in the block, are not checked on "append" (when the acq process calls publish) but are checked and will raise an error just prior to encoding and publishing to crossbar. That latter thing happens in the reactor.
Note that this:
called from inside an OCS Agent reactor, will produce the exact "while calling from thread" error message that we are seeing, strongly suggesting it's originating from the reactor -> thread -> reactor arrangement, which is only really used in the ocs_agent.publish code.
How Has This Been Tested?
This has not been tested! This is draft pending discussion with agent devs and possibly testing at site.
Types of changes
Checklist: