Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] prefer CPython in Windows test environment and use safer approach for cleaning up network (fixes #5509) #5510

Merged
merged 33 commits into from Oct 7, 2022

Conversation

jameslamb
Copy link
Collaborator

@jameslamb jameslamb commented Sep 27, 2022

Fixes #5507.

conda sometimes "downgrades" Python from a CPython build to a PyPy build in our Windows CI jobs. This has historically been because of dependencies introduced by either the python-graphviz or matplotlib conda packages.

This PR proposes trying to prevent that situation by explicitly passing flag --no-update-deps.

From the conda docs (link)

--freeze-installed, --no-update-deps
Do not update or change already-installed dependencies.

This PR proposes explicitly installing python={version}[build=*cpython] to prevent conda from environment solves that switch to pypy-based builds of Python.

I think this should be possible, based on https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/pkg-specs.html#package-match-specifications.

@jameslamb
Copy link
Collaborator Author

This worked! (https://ci.appveyor.com/project/guolinke/lightgbm/builds/44896165/job/xvf4hw3hv45lchyf)

Pushing another commit re-enabling all the other CI that I'd skipped just to save time, but then I think we should adopt this fix. @shiyu1994 @StrikerRUS

@jameslamb jameslamb marked this pull request as ready for review September 27, 2022 13:41
@jameslamb jameslamb changed the title WIP: [ci] avoid updating dependencies when installing plotting dependencies (fixes #5509) [ci] avoid updating dependencies when installing plotting dependencies (fixes #5509) Sep 27, 2022
This reverts commit 704aea4.
@jameslamb jameslamb changed the title [ci] avoid updating dependencies when installing plotting dependencies (fixes #5509) [ci] prefer CPython in Windows test environment (fixes #5509) Sep 27, 2022
@jmoralez
Copy link
Collaborator

It seems the linux jobs are getting stuck as well in this PR. I investigated a bit while working on #5505 but the environments seemed the same, it's strange that it started to happen now. Also they sometimes randomly pass.

@jameslamb
Copy link
Collaborator Author

😭 this is getting so complicated

It looks like the two Azure DevOps Linux jobs that are getting stuck (have been running for more than 50 mins) are both Linux, which means the image built from https://github.com/guolinke/lightgbm-ci-docker.

it's the next thing on my list to try to upgrade that image, so that we can hopefully remove the pin on dask=2022.7.0: #5390 (comment).

I'm going to just try manually rebuilding the timed-out Linux jobs here when they fail, to at least make some forward progress. "We sometimes have to manually re-run builds" is a bad state to be in, but not as bad as "Appveyor is failing on every commit and nothing can be merged to master".

@jameslamb
Copy link
Collaborator Author

I've tried rebuilding the timed-out CUDA and Azure jobs a few times today, hoping to get lucky and have the builds not hit the Dask timeouts...so far, I haven't been successful.

I'll be traveling for the next few days, and I'm not sure how much I'll be able to work on LightGBM during that time.

@shiyu1994 @jmoralez @StrikerRUS if you're able to find a workaround for the Dask issues, it's ok with me if you want to push such fixes directly to this branch. So we can have one PR that resolves the CI issues.

@shiyu1994
Copy link
Collaborator

Thanks. Good to see that the windows environment issue is resolved. Let's see whether we can fix the Dask issue together in this PR.

@shiyu1994
Copy link
Collaborator

I can try to manually debug the Dask tests on our self-hosted CUDA CI agent. Hopefully we can find a solution soon.

@jameslamb
Copy link
Collaborator Author

We could also try pinning all dask and distributed to a specific version, to try to isolate whether this new timeout issue is related to Dask itself or to other changes in the CI environment.

@shiyu1994
Copy link
Collaborator

I think I've found the root of the timeout of Dask tests.

This is because Network class in our C++ code is defined as a static class, which is shared by all boosters of the same process.

However, because in our test_dask.py, a single LocalCluster is created for all the test cases, and the LocalCluster keeps the same subprocesses during the whole run of test_dask.py, without reinitializing them.

Then comes the issue. When a booster A is created in test case 1, it still exists in the memory space of the process. Then when we move to test case 2, a new booster B will be created. However, at this moment, somehow the garbage collection is triggered, and booster A is recycled. And the __del__ of Booster is called.

def __del__(self) -> None:
try:
if self.network:
self.free_network()
except AttributeError:
pass
try:
if self.handle is not None:
_safe_call(_LIB.LGBM_BoosterFree(self.handle))
except AttributeError:
pass

which will deallocate all the network connections by calling free_network!
def free_network(self) -> "Booster":
"""Free Booster's network.
Returns
-------
self : Booster
Booster with freed network.
"""
_safe_call(_LIB.LGBM_NetworkFree())
self.network = False
return self

LightGBM/src/c_api.cpp

Lines 2511 to 2515 in dc4794b

int LGBM_NetworkFree() {
API_BEGIN();
Network::Dispose();
API_END();
}

Then booster A will train as a single process program since the num_machines_ is set to 1.
void Network::Dispose() {
num_machines_ = 1;
rank_ = 0;
linkers_.reset(new Linkers());
reduce_scatter_ext_fun_ = nullptr;
allgather_ext_fun_ = nullptr;
}

However, since we are training in a distributed way, there's another process waiting for the response of the process running booster A. And that process gets stuck forever since there's no response from booster A's process.

I've pushed a quick workaround to this branch, which enforces using a new LocalCluster for each test case. However, this seems to make the testing much slower because initializing the processes for a cluster in Dask seems to be slow.

Could anybody familiar with Dask provide a better solution? Shortly speaking, we want to use new processes for distributed training in each test case.

@jameslamb
Copy link
Collaborator Author

Wow, great investigation!!! I wonder if this is the cause of reports like #4771 or #4942.

I think creating a new Dask cluster every time is not a good solution...LightGBM's users wouldn't be happy with having to do that, since it would mean that, for example, the workflow of "initialize a Dask DataFrame, keep it in distributed memory with .persist(), then run multiple LightGBM training runs over it" would effectively not be supported.

Instead, could we have each distributed training process allocate its own Network and hold a pointer to it, and then have free_network() only free that specific Network? Or would that cause some problems?

@jmoralez
Copy link
Collaborator

I agree with @jameslamb on this, users may want to run consecutive trainings in the same process when doing things like hyperparameter tuning. Also this has been the way we run the dask tests for more than a year (changed in #4159), @shiyu1994 can you think of a recent change that would cause this to start failing now?

@jmoralez
Copy link
Collaborator

jmoralez commented Sep 30, 2022

I tested on mac and in one of the jobs there are many errors that we've previously seen for mac: lightgbm.basic.LightGBMError: Socket recv error, Connection reset by peer (code: 54). I guess it has the same root cause, right? The network was freed while training so the connection was lost to one of the machines but mac somehow realizes this and raises the error instead?

Edit:
I also see this one in the logs lightgbm.basic.LightGBMError: Please initialize the network interface first.

So I believe you're definitely right @shiyu1994.

@jameslamb
Copy link
Collaborator Author

I'm going to try implementing a fix here where, at the end of training, the Python package explicitly calls free_network().

I still think the ideal solution is the one described in #5510 (comment), because the Dask-only fix I'm proposing has the following drawbacks:

  • doesn't help solve this issue for non-Dask settings (e.g. SynapseML or lightgbm-ray)
  • means that only one LightGBM training run can be happening in the cluster at a time

If my proposed fix seems to work, I think we should adopt it to unblock CI and then I'll document this issue in a separate issue that could be worked on.

@jameslamb
Copy link
Collaborator Author

jameslamb commented Oct 6, 2022

Reducing the time_out for test test_machines_should_be_used_if_provided() from 120 to 1 allowed that test to succeed in a reasonable time (a few seconds)!

update: that time_out actually didn't matter.

(link to successful build)

So it seems like the only problematic test might be test_training_succeeds_even_if_some_workers_do_not_have_any_data().

Going to continue investigating that next.

@jameslamb
Copy link
Collaborator Author

jameslamb commented Oct 7, 2022

I know it is difficult to follow all the debugging comments and commits here, but CI is passing and this PR is ready for review 🎉

After switching from _LIB.LGBM_NetworkFree() to Booster.free_network() in the Dask module (to avoid a second call to _LIB.LGBM_NetworkFree() at an unpredictable time when the Booster is garbage cleaned), only one test is still causing timeout issues.

I think we should

  • merge this PR (which adds a pytest.skip() on that test) to unblock development
  • separately document the work to investigate and remove that pytest.skip()

@shiyu1994 @guolinke @jmoralez @StrikerRUS

Copy link
Collaborator

@jmoralez jmoralez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome job! And thanks @shiyu1994 for the insights.

@jameslamb
Copy link
Collaborator Author

Thanks to both of you for your help! This was a really difficult one.

It would be interesting in the future to work on changing the strategy for how LGBM_NetworkFree() works, to try to enable multiple concurrent training runs on the same Dask cluster.

I'm going to merge this and start updating / merging some of the other approved PRs, starting with #5506.

@shiyu1994
Copy link
Collaborator

Sorry for the late response. Just return from our 1-week national holiday.

@jameslamb
Copy link
Collaborator Author

no problem, welcome back! Please look at my comment in #5502 (comment) as soon as possible and respond there...I'm nervous that the R package might be in danger of archiving on CRAN.

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ci] Appveyor builds failing
4 participants