Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEBT: Enforce thread-safety for ONNX opset schema API #5291

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

q-ycong-p
Copy link
Contributor

Description

Enforce thread-safe access to the static schema map in the implementations that register and de-register schema.

Motivation and Context

This is a discussed feature following up to #5221.
This is required because the schema map is a mutable static object. Runtime can register and de-register schema as needed using the ONNX API RegisterOnnxOperatorSetSchema and DeregisterOnnxOperatorSetSchema which r/w to this map. A runtime is often multi-threaded. This means that it's desirable to implement thread guard on the schema map at ONNX level.

This commit adds lock to the critical sections in OpSchemaRegistry::OpSchemaRegisterOnce ctor and OpSchemaRegistry::OpSchemaDeregisterAll which r/w into the schema map. The mutex is a private class member in OpSchemaRegistry class, and is accessed through singleton manner OpSchemaRegistry::Instance()->schema_map_mutex_ to ensure the lock is unique across OpSchemaRegistry instances.

I'd like to take suggestion from ONNX devs on the correctness here. Thank you!

Signed-off-by: Yu Cong <congyc@amazon.com>
@q-ycong-p
Copy link
Contributor Author

Hi @jcwchen , my apologies to add you directly here - I was thinking since we've discussed this feature in #5221, perheps you could take a quick look whether this change makes sense? TY!

@justinchuby justinchuby requested a review from jcwchen June 12, 2023 18:38
@jcwchen jcwchen added the schema Issues related to operator schemas label Jun 13, 2023
jcwchen
jcwchen previously approved these changes Jun 13, 2023
Copy link
Member

@jcwchen jcwchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for proposing this! Basically looks good to me. It will still need approval from @onnx/sig-operators-approvers.

@q-ycong-p is it possible to test this lock in your runtime from your end (manually) somehow to ensure this worked as expected?

@q-ycong-p
Copy link
Contributor Author

q-ycong-p commented Jun 13, 2023

I wrote a small unit test and ran with valgrind --tool=helgrind, but there're some issues. Current commit doesn't seem to fully ensure thread safety.

Below is a snippet of the test case, valgrind --tool=helgrind ./test_multithreading catches some issue. Need to take a further look!

// test_multithreading.cpp
#include <pthread.h>
#include <random>
#include <iostream>

int GetRandomIntInRange(int min, int max) {
    std::random_device rd; // obtain a random number from hardware
    std::mt19937 gen(rd()); // seed the generator
    std::uniform_int_distribution<> distr(min, max); // define the range
    return distr(gen);
}
void* Reg(void* ) {
    int opset_ver = GetRandomIntInRange(0, 16); // use max ver=16
    using ONNX_NAMESPACE::RegisterOnnxOperatorSetSchema;
    RegisterOnnxOperatorSetSchema(opset_ver, false);
}
void* Dereg(void* ) {
    using ONNX_NAMESPACE::DeregisterOnnxOperatorSetSchema;
    DeregisterOnnxOperatorSetSchema();
}

int main(int argc, char *argv[]) {
    int num_t = 10;
    pthread_t t[num_t];

    std::cout << "[ DEBUG ] Start creating " << num_t << " threads." << std::endl;

    // Create threads that registers random opset version
    // and deregister schema registrations
    for (int i = 0; i < num_t; i = i + 2) {
        pthread_create(&t[i], NULL, Reg, NULL);
        pthread_create(&t[i + 1], NULL, Dereg, NULL);
    }
    
    std::cout << "[ DEBUG ] Done creating " << num_t << " threads. Waiting to join." << std::endl;
    
    // Join all threads
    for (int i = 0; i < num_t; i++) {
        pthread_join(t[i], NULL);
    }

    std::cout << "[ DEBUG ] All threads returned." << std::endl;
    std::cout << "[ SUCCESS ]" << std::endl;
}

@jcwchen
Copy link
Member

jcwchen commented Jun 15, 2023

Below is a snippet of the test case, valgrind --tool=helgrind ./test_multithreading catches some issue. Need to take a further look!

Thank you for testing it! May I understand what kind of issues you have bumped into? Like an error message?

@q-ycong-p
Copy link
Contributor Author

valgrind --tool=helgrind .

Hi @jcwchen , the valgrind helgrind tool reports a couple of possible data races: (1) On SetLoadedSchemaVersion which could be taken care by placing the same lock to guard w/r of OpSchemaRegistry::loaded_schema_version; (2) Another data race tracing into GetOpSchema - I don't have immediate idea here and would need a further look.

I'm attaching the valgrind report below for ref. It should be reproducible with the snippet above.
multithreading_test.valgrind.log

@@ -1142,6 +1142,11 @@ class OpSchemaRegistry final : public ISchemaRegistry {
bool fail_duplicate_schema = true) {
ONNX_TRY {
op_schema.Finalize();

// Acquires lock to thread-guard schema map access
auto* registry = OpSchemaRegistry::Instance();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think every time the map returned by OpSchemaRegistry::Instance() is accessed should be protected against concurrent accesses. Function GetMapWithoutEnsuringRegistration is called by another function not protected by the mutex. Maybe the datarace comes from it. Maybe it would be safer to create a specific class to hold the map storing the schemas and protect the accesses to this class instead of looking to every place this container is accessed and protect it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the function called not protected by the mutex? I assume that this is a dynamic race detection in the given unit-test, where only registration/deregistration are called in parallel, both of which seem to invoke the function under a mutex?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetMapWithoutEnsuringRegistration is used in other places to access the registered schemas. If a schema is removed while another function is counting the number of schemas, it could lead to some unstable state. It is unlikely to happen but it could happen in a mutlithread scenarios. Every call to GetMapWithoutEnsuringRegistration should be protected with a lock.

@jcwchen jcwchen dismissed their stale review June 28, 2023 22:27

It seems there is an issue reported by valgrind and let's figure out the root cause first before moving forward.

Copy link

codecov bot commented Feb 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 56.68%. Comparing base (9140ca7) to head (f7986cb).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5291   +/-   ##
=======================================
  Coverage   56.68%   56.68%           
=======================================
  Files         506      506           
  Lines       30227    30227           
  Branches     4565     4565           
=======================================
  Hits        17133    17133           
  Misses      12268    12268           
  Partials      826      826           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
schema Issues related to operator schemas
Projects
Status: Review in progress
Development

Successfully merging this pull request may close these issues.

None yet

5 participants