New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEBT: Enforce thread-safety for ONNX opset schema API #5291
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Yu Cong <congyc@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for proposing this! Basically looks good to me. It will still need approval from @onnx/sig-operators-approvers.
@q-ycong-p is it possible to test this lock in your runtime from your end (manually) somehow to ensure this worked as expected?
I wrote a small unit test and ran with Below is a snippet of the test case,
|
Thank you for testing it! May I understand what kind of issues you have bumped into? Like an error message? |
Hi @jcwchen , the valgrind I'm attaching the valgrind report below for ref. It should be reproducible with the snippet above. |
@@ -1142,6 +1142,11 @@ class OpSchemaRegistry final : public ISchemaRegistry { | |||
bool fail_duplicate_schema = true) { | |||
ONNX_TRY { | |||
op_schema.Finalize(); | |||
|
|||
// Acquires lock to thread-guard schema map access | |||
auto* registry = OpSchemaRegistry::Instance(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think every time the map returned by OpSchemaRegistry::Instance() is accessed should be protected against concurrent accesses. Function GetMapWithoutEnsuringRegistration
is called by another function not protected by the mutex. Maybe the datarace comes from it. Maybe it would be safer to create a specific class to hold the map storing the schemas and protect the accesses to this class instead of looking to every place this container is accessed and protect it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the function called not protected by the mutex? I assume that this is a dynamic race detection in the given unit-test, where only registration/deregistration are called in parallel, both of which seem to invoke the function under a mutex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetMapWithoutEnsuringRegistration
is used in other places to access the registered schemas. If a schema is removed while another function is counting the number of schemas, it could lead to some unstable state. It is unlikely to happen but it could happen in a mutlithread scenarios. Every call to GetMapWithoutEnsuringRegistration should be protected with a lock.
It seems there is an issue reported by valgrind and let's figure out the root cause first before moving forward.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #5291 +/- ##
=======================================
Coverage 56.68% 56.68%
=======================================
Files 506 506
Lines 30227 30227
Branches 4565 4565
=======================================
Hits 17133 17133
Misses 12268 12268
Partials 826 826 ☔ View full report in Codecov by Sentry. |
Description
Enforce thread-safe access to the static schema map in the implementations that register and de-register schema.
Motivation and Context
This is a discussed feature following up to #5221.
This is required because the schema map is a mutable static object. Runtime can register and de-register schema as needed using the ONNX API
RegisterOnnxOperatorSetSchema
andDeregisterOnnxOperatorSetSchema
which r/w to this map. A runtime is often multi-threaded. This means that it's desirable to implement thread guard on the schema map at ONNX level.This commit adds lock to the critical sections in
OpSchemaRegistry::OpSchemaRegisterOnce
ctor andOpSchemaRegistry::OpSchemaDeregisterAll
which r/w into the schema map. The mutex is a private class member inOpSchemaRegistry
class, and is accessed through singleton mannerOpSchemaRegistry::Instance()->schema_map_mutex_
to ensure the lock is unique acrossOpSchemaRegistry
instances.I'd like to take suggestion from ONNX devs on the correctness here. Thank you!