Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #12948 (slow domain join) #13026

Open
wants to merge 4 commits into
base: trunk
Choose a base branch
from

Conversation

damiendoligez
Copy link
Member

I think this is what @gadmm had in mind in #12399 (comment).

fixes #12948
Also fixes the first item of #12399

int res;
caml_plat_assert_locked(cond->mutex);
res = pthread_cond_timedwait(&cond->cond, cond->mutex, deadline);
if (res != ETIMEDOUT) check_err("timedwait", res);
Copy link
Contributor

@gadmm gadmm Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One documentation for pthread_cond_timedwait mentions spurious wakeups. In these cases one should probably restart pthread_cond_timedwait rather than raise an exception. (edit: callers should take spurious wakeups into account)

Copy link
Contributor

@gadmm gadmm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading more closely this time around, I changed where I think one should deal with spurious wakeups. I also have added a comment about the choice of clock.

caml_plat_assert_locked(cond->mutex);
res = pthread_cond_timedwait(&cond->cond, cond->mutex, deadline);
if (res != ETIMEDOUT) check_err("timedwait", res);
return res;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return res;
return res != ETIMEDOUT;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, res is either 0 (signal or spurious wake-up) or ETIMEDOUT (deadline reached), so this is just adding a negation. I would rather keep this function as close as possible to pthread_cond_timedwait and leave the return value as-is.

@@ -114,8 +114,7 @@ typedef struct { pthread_cond_t cond; caml_plat_mutex* mutex; } caml_plat_cond;
#define CAML_PLAT_COND_INITIALIZER(m) { PTHREAD_COND_INITIALIZER, m }
void caml_plat_cond_init(caml_plat_cond*, caml_plat_mutex*);
void caml_plat_wait(caml_plat_cond*);
/* like caml_plat_wait, but if nanoseconds surpasses the second parameter
without a signal, then this function returns 1. */
int caml_plat_timedwait(caml_plat_cond*, const struct timespec *);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int caml_plat_timedwait(caml_plat_cond*, const struct timespec *);
/* return 0 on timeout */
int caml_plat_timedwait(caml_plat_cond*, const struct timespec *);

}else{
deadline.tv_sec = curtime.tv_sec;
}
(void) caml_plat_timedwait (&Tick_thread_control.cond, &deadline);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correcting my previous comment, I'd see spurious wakeups handled here:

Suggested change
(void) caml_plat_timedwait (&Tick_thread_control.cond, &deadline);
while(caml_plat_timedwait (&Tick_thread_control.cond, &deadline))
{
if (Tick_thread_control.state == Tick_stop) goto …;
/* In case of spurious wakeup, keep waiting */
};

caml_plat_lock (&Tick_thread_control.mu);
while(1){
if (Tick_thread_control.state == Tick_stop) break;
gettimeofday (&curtime, NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clock for caml_plat_timedwait is set inside caml_plat_cond_init_aux in a platform-dependent manner. Please use e.g. clock_gettime to make sure the deadline is on the same clock.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, clock_gettime is not available on MacOS, so I had to reproduce the platform-dependent part.

caml_plat_unlock (&Tick_thread_control.mu);
caml_plat_signal (&Tick_thread_control.cond);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
caml_plat_unlock (&Tick_thread_control.mu);
caml_plat_signal (&Tick_thread_control.cond);
caml_plat_signal (&Tick_thread_control.cond);
caml_plat_unlock (&Tick_thread_control.mu);

The DEBUG_LOCK seems to enforce that the mutex is locked when entering caml_plat_signal. But there is a PR currently under review which aims to remove the needless restriction.

However according to some sources online, schedulers are programmed to better handle the case where pthread_cond_signal is called with the mutex locked. Not an expert on these details though. Feel free to ignore this change if you have a strong opinion about it. I am curious to know the rationale for the best choice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no strong opinion and I am ready to believe that schedulers are optimized that way.

@damiendoligez
Copy link
Member Author

damiendoligez commented Apr 30, 2024

Many thanks for the thorough review. I changed the code to implement your suggestions.

@damiendoligez
Copy link
Member Author

Running precheck at https://ci.inria.fr/ocaml/job/precheck/972/ .

@gasche
Copy link
Member

gasche commented May 15, 2024

This needs a rebase, it would be useful to get an explicit approval from @gadmm, but also the CI reports a lot of failed tests on Windows:


    tests/backtrace/backtrace_systhreads.ml
    tests/backtrace/callstack.ml
    tests/c-api/test_c_thread_has_lock_systhread.ml
    tests/lib-channels/input_all.ml
    tests/lib-systhreads/boundscheck.ml
    tests/lib-systhreads/multicore_lifecycle.ml
    tests/lib-systhreads/test_c_thread_register.ml
    tests/lib-threads/backtrace_threads.ml
    tests/lib-threads/bank.ml
    tests/lib-threads/bufchan.ml
    tests/lib-threads/close.ml
    tests/lib-threads/fileio.ml
    tests/lib-threads/mutex_errors.ml
    tests/lib-threads/pr5325.ml
    tests/lib-threads/pr7638.ml
    tests/lib-threads/pr8857.ml
    tests/lib-threads/prodcons.ml
    tests/lib-threads/prodcons2.ml
    tests/lib-threads/sieve.ml
    tests/lib-threads/swapchan.ml
    tests/lib-threads/tls.ml
    tests/lib-threads/torture.ml
    tests/lib-threads/uncaught_exception_handler.ml
    tests/lib-unix/win-socketpair/test.ml
    tests/parallel/fib_threads.ml
    tests/parallel/multicore_systhreads.ml
    tests/parallel/test_c_thread_register.ml
    tests/regression/pr12948/test.ml
    tests/statmemprof/blocking_in_callback.ml
    tests/statmemprof/moved_while_blocking.ml
    tests/statmemprof/thread_exit_in_callback.ml

This probably needs some investigation if the results are consistent after the rebase and CI rerun.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Domain.join is suspiciously slow when using systhreads
3 participants