Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck at learning_process.initialize() in DP Tutorial #3756

Open
SamuelGong opened this issue Mar 21, 2023 · 12 comments
Open

Stuck at learning_process.initialize() in DP Tutorial #3756

SamuelGong opened this issue Mar 21, 2023 · 12 comments
Assignees
Labels
bug Something isn't working

Comments

@SamuelGong
Copy link

Describe the bug
In the colab notebook for DP, everything went well until I reached the code block where the program kept running for hours but prompting no log. Further debugging shows that the program never finished executing the line state = learning_process.initialize().

Environment:
The experiment is conducted from scratch using today's TFF (0.51.0). No modification has been made to any part of the notebook.

Expected behavior
The execution of the mentioned line should be able to complete, in an acceptable time like at most minutes.

@SamuelGong SamuelGong added the bug Something isn't working label Mar 21, 2023
@ZacharyGarrett ZacharyGarrett self-assigned this Mar 21, 2023
@ZacharyGarrett
Copy link
Collaborator

Is this a duplicate of #3742?

Please try the new 0.52.0 release (https://github.com/tensorflow/federated/releases/tag/v0.52.0), which was released yesterday to PyPi (https://pypi.org/project/tensorflow-federated/0.52.0/).

@SamuelGong
Copy link
Author

Sorry 0.51.0 is a typo--in fact, I was using 0.52.0 (so I have emphasized that it was the version released today). Could you please re-investigate that? I have just escaped from #3742 but am now trapped in a new one.

@zcharles8
Copy link
Collaborator

To clarify - You can execute code, but the learning_process.initialize() is hanging indefinitely? Do you have any estimate of how long it has run?

@SamuelGong
Copy link
Author

Sure, it was the case. At least three to four hours, and then I lost patience with that. I have tried three times, each of which hung in the same place and no error message was prompted so I could not provide more information.

@zcharles8
Copy link
Collaborator

@SamuelGong I think that if you remove the call to tff.backends.native.set_sync_local_cpp_execution_context it should run. This call should now be mainly unnecessary (as it is the default execution context) though it isn't clear why re-setting causes the hang. Can you see if that changes things?

@SamuelGong
Copy link
Author

It works for me! Thank you very much.

@SamuelGong
Copy link
Author

@SamuelGong I think that if you remove the call to tff.backends.native.set_sync_local_cpp_execution_context it should run. This call should now be mainly unnecessary (as it is the default execution context) though it isn't clear why re-setting causes the hang. Can you see if that changes things?

Since I can now run the tutorial notebook on my local machine, I have access to the jupyter notebook's log. Inspecting on the log, I found that when calling the function tff.backends.native.set_sync_local_cpp_execution_context(), errors like ERROR: Illegal value '3383.0' specified for flag 'max_concurrent_computation_calls' will be prompted in the log. It seems that the expected max_concurrent_computation_calls should be an integer, while the code in the tutorial does not ensure this. I am here to reopen this issue just in case you still not catch the bug.

@SamuelGong SamuelGong reopened this Mar 29, 2023
@zcharles8
Copy link
Collaborator

I think that tff.backends.native.set_sync_local_cpp_execution_context shouldn't be invoked in the tutorial at all, since it's now the default. As for the illegal value, this might be due to using Jupyter - I don't think we have any idea about whether it works with TFF or not (and would generally recommend colab instead).

@deepquantum88
Copy link

deepquantum88 commented Apr 8, 2023

@SamuelGong I stuck with the same hang issue when i execute state = learning_process.initialize()
no error message but execution hang.

even i removed tff.backends.native.set_sync_local_cpp_execution_context
but still it did not work.
TFF version 0.52.0 and Tf 2.11.0
on my local system

Can you please help? how this can be solved

@SamuelGong
Copy link
Author

@SamuelGong I stuck with the same hang issue when i execute state = learning_process.initialize() no error message but execution hang.

even i removed tff.backends.native.set_sync_local_cpp_execution_context but still it did not work. TFF version 0.52.0 and Tf 2.11.0 on my local system

Can you please help? how this can be solved

Hi. For me, previously it was solved by removing the line. However, as TFF is undergoing rapid version change, it may not work now. If not, maybe you should resort to the team.

@niharikagupta2021
Copy link

niharikagupta2021 commented Mar 27, 2024

I'm facing the same issue when I use tensoflow federated in google colab. When I try to run tff.federated_computation(lambda: 'Hello, World!')(), this command is also hanging. The same happens with .initialize() function when i try to start training my model using tff.learning.algorithms.build_weighted_fed_avg. Has anyone faced this issue recently?

@zcharles8
Copy link
Collaborator

@niharikagupta2021 I would encourage you to open a separate github issue for this. Please make sure to include the suggested details - things like version, operating system, etc. are critical to debugging this kind of thing.

@zcharles8 zcharles8 reopened this Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants