Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarting training job doesn't visualize earlier log files #6782

Open
bhack opened this issue Mar 11, 2024 · 27 comments
Open

Restarting training job doesn't visualize earlier log files #6782

bhack opened this issue Mar 11, 2024 · 27 comments

Comments

@bhack
Copy link

bhack commented Mar 11, 2024

I have a training job on a spot instance and a TB log on GCS-CSI fuse.

As it is a spot instance when the job it is going to reastart it going to create a new log file.

If I don't restart the TB the curves are ok but when I restart the TB service it going to read only then last log file.

Is there a ways to workaround this?

@bhack bhack changed the title Restarting training job doesn't visualize the first log Restarting training job doesn't visualize earlier log files Mar 11, 2024
@groszewn
Copy link
Contributor

How are you running TensorBoard in this situation? Are the log files created in the same directory? Are all steps recomputed in the new file, or are only newer steps added?

@bhack
Copy link
Author

bhack commented Mar 11, 2024

Yes the log are created in the same directory. So I have the first log file and then a second log file that we the job restarting after the Google spot preemption.
If I don't restart the TB all it is ok but when I restart it I see only the 2nd log.
To avoid point overlapping in the curves I use purge_step using the last available checkpoint step.

@groszewn
Copy link
Contributor

Have you tried specifying --reload_multifile?

@bhack
Copy link
Author

bhack commented Mar 11, 2024

Yes I am using both --reload_multifile=true and --detect_file_replacement=true for the gcs csi fuse nature

@groszewn
Copy link
Contributor

Can you please provide the output of tensorboard --inspect --logdir <your log directory> as well as the result of diagnose_tensorboard.py as specified in the issue template?

@bhack
Copy link
Author

bhack commented Mar 12, 2024

tensorboard --inspect --logdir

With that command over the log directory where I have found the issue (there are 4 logs file there of the same run)

======================================================================
Processing event files... (this can take a few minutes)
======================================================================

Traceback (most recent call last):
  File "/usr/local/bin/tensorboard", line 8, in <module>
    sys.exit(run_main())
             ^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/tensorboard/main.py", line 41, in run_main
    app.run(tensorboard.main, flags_parser=tensorboard.configure)
  File "/usr/local/lib/python3.11/dist-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.11/dist-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
             ^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/tensorboard/program.py", line 278, in main
    return runner(self.flags) or 0
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/tensorboard/program.py", line 291, in _run_serve_subcommand
    efi.inspect(flags.logdir, event_file, flags.tag)
  File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_inspector.py", line 451, in inspect
    inspection_units = get_inspection_units(logdir, event_file, tag)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_inspector.py", line 405, in get_inspection_units
    field_to_obs=get_field_to_observations_map(generator, tag),
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_inspector.py", line 212, in get_field_to_observations_map
    for event in generator:
  File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_loader.py", line 244, in Load
    for record in super().Load():
  File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_loader.py", line 178, in Load
    yield next(self._iterator)
          ^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.NotFoundError: /logs/my_job10/log/tensorboard/events.out.tfevents.1710242212.mymodel.74.0; No such file or directory

And diagnose_tensorboard.py

//diagnose_tensorboard.py:32: DeprecationWarning: 'pipes' is deprecated and slated for removal in Python 3.13
  import pipes
### Diagnostics

<details>
<summary>Diagnostics output</summary>
--- check: autoidentify
INFO: diagnose_tensorboard.py version df7af2c6fc0e4c4a5b47aeae078bc7ad95777ffa

--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=11, micro=0, releaselevel='candidate', serial=1)
INFO: os.name: posix
INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='tensorboard-2342343-4324', release='6.1.58+', version='#1 SMP PREEMPT_DYNAMIC Mon Jan 29 15:19:25 UTC 2024', machine='x86_64')
INFO: sys.getwindowsversion(): N/A

--- check: package_management
INFO: has conda-meta: False
INFO: $VIRTUAL_ENV: None

--- check: installed_packages
INFO: installed: tensorboard==2.16.2
WARNING: no installation among: ['tensorflow', 'tensorflow-gpu', 'tf-nightly', 'tf-nightly-2.0-preview', 'tf-nightly-gpu', 'tf-nightly-gpu-2.0-preview']
WARNING: no installation among: ['tensorflow-estimator', 'tensorflow-estimator-2.0-preview', 'tf-estimator-nightly']
INFO: installed: tensorboard-data-server==0.7.2

--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '2.16.2'

--- check: tensorflow_python_version
2024-03-12 17:52:11.636867: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO: tensorflow.__version__: '2.16.1'
INFO: tensorflow.__git_version__: 'v2.16.1-0-g5bc9d26649c'

--- check: tensorboard_data_server_version
INFO: data server binary: '/usr/local/lib/python3.11/dist-packages/tensorboard_data_server/bin/server'
INFO: data server binary version: b'rustboard 0.7.2'

--- check: tensorboard_binary_path
INFO: which tensorboard: b'/usr/local/bin/tensorboard\n'

--- check: addrinfos
socket.has_ipv6 = True
socket.AF_UNSPEC = <AddressFamily.AF_UNSPEC: 0>
socket.SOCK_STREAM = <SocketKind.SOCK_STREAM: 1>
socket.AI_ADDRCONFIG = <AddressInfo.AI_ADDRCONFIG: 32>
socket.AI_PASSIVE = <AddressInfo.AI_PASSIVE: 1>
Loopback flags: <AddressInfo.AI_ADDRCONFIG: 32>
Loopback infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 0))]
Wildcard flags: <AddressInfo.AI_PASSIVE: 1>
Wildcard infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('0.0.0.0', 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::', 0, 0, 0))]

--- check: readable_fqdn
INFO: socket.getfqdn(): 'tensorboard-5f94958cbc-w59k5'

--- check: stat_tensorboardinfo
INFO: directory: /tmp/.tensorboard-info
INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=1590, st_dev=3145768, st_nlink=2, st_uid=0, st_gid=0, st_size=4096, st_atime=1710242647, st_mtime=1710242639, st_ctime=1710242639)
INFO: mode: 0o40777

--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['/usr/local/lib/python3.11/dist-packages']; bad_roots (0): []

--- check: full_pip_freeze
INFO: pip freeze --all:
absl-py==2.1.0
astunparse==1.6.3
blinker==1.4
certifi==2024.2.2
charset-normalizer==3.3.2
cryptography==3.4.8
dbus-python==1.2.18
distro==1.7.0
dm-tree==0.1.8
flatbuffers==23.5.26
gast==0.5.4
google-pasta==0.2.0
grpcio==1.62.0
h5py==3.10.0
httplib2==0.20.2
idna==3.6
importlib-metadata==4.6.4
jeepney==0.7.1
keras==3.0.5
keyring==23.5.0
launchpadlib==1.10.16
lazr.restfulclient==0.14.4
lazr.uri==1.0.6
libclang==16.0.6
Markdown==3.5.2
markdown-it-py==3.0.0
MarkupSafe==2.1.5
mdurl==0.1.2
ml-dtypes==0.3.2
more-itertools==8.10.0
namex==0.0.7
numpy==1.26.4
oauthlib==3.2.0
opt-einsum==3.3.0
packaging==23.2
pip==24.0
protobuf==4.25.3
Pygments==2.17.2
PyGObject==3.42.1
PyJWT==2.3.0
pyparsing==2.4.7
python-apt==2.4.0+ubuntu3
requests==2.31.0
rich==13.7.0
SecretStorage==3.3.1
setuptools==69.1.1
six==1.16.0
tensorboard==2.16.2
tensorboard-data-server==0.7.2
tensorflow-cpu==2.16.1
tensorflow-io-gcs-filesystem==0.36.0
termcolor==2.4.0
typing_extensions==4.9.0
urllib3==2.2.1
wadllib==1.3.6
Werkzeug==3.0.1
wheel==0.42.0
wrapt==1.16.0
zipp==1.0.0

</details>

@groszewn
Copy link
Contributor

The output of tensorboard --inspect --logdir looks like you actually passed it tensorboard --inspect --event_file <file> instead. There should not be an error, I would expect either No event files found within logdir <your logdir> or

Found event files in:
<logdir>/<run>/...
<logdir>/<another run>/...
...
These tags are in <logdir>...
...

followed by event statistics.

Can you please attempt a rerun of tensorboard --inspect --logdir <your log directory> pointed at your log directory and resend? Otherwise, I'm inclined to suspect that there has been some corruption of your log directory.

@bhack
Copy link
Author

bhack commented Mar 12, 2024

Ok this is on one of the runs where I found the issue (this one has 4 logs files in the run):

2024-03-12 19:30:25.911301: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
======================================================================
Processing event files... (this can take a few minutes)
======================================================================

Found event files in:
/logs/<my-run>/log/tensorboard

These tags are in /logs/<myrun>/log/tensorboard:
audio -
histograms -
images
  Image1
scalars
   LR
tensor -
======================================================================

Event statistics for /logs/<myrun>/log/tensorboard:
audio -
graph -
histograms -
images
   first_step           0
   last_step            21050
   max_step             21050
   min_step             0
   num_steps            422
   outoforder_steps     [(5600, 5000), (12450, 12000), (12200, 12000)]
scalars
   first_step           0
   last_step            21050
   max_step             21050
   min_step             0
   num_steps            422
   outoforder_steps     [(5600, 5000), (12450, 12000), (12200, 12000)]
sessionlog:checkpoint -
sessionlog:start
   outoforder_steps     []
   steps                [0, 5000, 12000, 12000]
sessionlog:stop -
tensor -
======================================================================

If you see instead scalars and images are starting from the last log (5k) after TB restart
immagine

P.s. I've renamed/removed few things for anonymization.

@groszewn
Copy link
Contributor

Would you mind running with --purge_orphaned_data=False set and see if that helps? It doesn't seem like your situation is the same, but this may help us to debug (doc link).

@bhack
Copy link
Author

bhack commented Mar 12, 2024

--purge_orphaned_data=False

It doesn't solve it seems to me that it is always restarting the scalar graphs and the images from the last log file only.
When I don't restart the service all it is ok on the web frontend.

@groszewn
Copy link
Contributor

Are you able to copy the log directory locally and confirm that things work as expect? Would like to rule out whether this is something internal to TB or isolated to just reading from GCS.

@bhack
Copy link
Author

bhack commented Mar 12, 2024

I've copied the same folder locally and it seems to work correctly.
What I am using on Google cloud is https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver

@groszewn
Copy link
Contributor

Do things work as expected when pointing to GCS directly instead of using the locally mounted path?

@bhack
Copy link
Author

bhack commented Mar 13, 2024

Same bucket passing gcs:// it seems to work correctly

@groszewn
Copy link
Contributor

Hmm, this seems like it may potentially be an issue around the mounted path if all other avenues are working as expected.

One additional thought, TB has some special filesystem handling that is enabled when fsspec is installed. Can you trying installing fsspec to see if that helps?

1 similar comment
@groszewn
Copy link
Contributor

Hmm, this seems like it may potentially be an issue around the mounted path if all other avenues are working as expected.

One additional thought, TB has some special filesystem handling that is enabled when fsspec is installed. Can you trying installing fsspec to see if that helps?

@bhack
Copy link
Author

bhack commented Mar 13, 2024

Ok I will try.
In the meantime are you relying, with TB internals, on some features that conflict with fuse limitation expecting that that path is a standard POSIX filesystem path?

See:
https://cloud.google.com/storage/docs/gcs-fuse?hl=en#differences-and-limitations

@groszewn
Copy link
Contributor

TB itself (when using the python filesystem readers) relies on TensorFlow's filesystem API via tf.io.gfile when TF is installed (and falls back to a stub implementation if TF is not available which does not have as much support), see #5286.

When using RustBoard for file reading (which is the default, you can try disabling with --load_fast=False to see if the python reader itself works in this situation), it looks like we use the Rust standard library path and walkdir (rustboard disk logdir source). I am not aware of any limitations of those implementations.

@bhack
Copy link
Author

bhack commented Mar 13, 2024

@songjiaxun Any opinion about this on the CSI gcsfuse POSIX limitations?

@bhack
Copy link
Author

bhack commented Mar 13, 2024

@groszewn I've isolated the bug. It is going to happen when I use --logdir_spec with multiple logs directory.

The first one in the list is always loaded correctly after the TB restart.
All the other runs are going to load only the last log file.
I've verified this removing one by one:

  • as the log it is going to become the first one in the list it is loaded completely the other ones just the last log.

Now the problem is that on the bucket it is common to have a lot of subfolders/runs with TB logs files and I cannot load all the logs just specifying the main root.
Other classical workarounds, like using softlink, are not available on GCS/fuse (but also on S3/fuse) as we cannot use softlinks in the buckets. So currently we don't have any solution to restrict a log selection.

@bhack
Copy link
Author

bhack commented Mar 13, 2024

Also for the nature of the GCS fuse we are constrained to use --detect_file_replacement=true that currently in TB it is incompatible with --load_fast=true and it will be very hard to rely on a single large parent log path without load_fast.

@groszewn
Copy link
Contributor

I do not believe the team currently has bandwidth to add --detect_file_replacement support to rustboard, but feel free to open another issue so that we can track +1s to help with prioritization.

Unfortunately usage of logdir_spec is discouraged in favor of using symlinks. We rely on external providers for most filesystem/blob storage systems (e.g. tf.io.gfile, fsspec, s3fs), so most specific needs such as this are only supported on a best-effort basis.

@bhack
Copy link
Author

bhack commented Mar 14, 2024

It is really going to be a problem to use symlinks as it is not supported in GCS fuse and also GCS fuse is going to write substituting file in the gcs bucket.

As we are trying to support an official Google cloud product with the TB can you investigate internally as it seems we don't have any other workaround.

@bhack
Copy link
Author

bhack commented Mar 15, 2024

Also this ticket it is still opened with this last comment asking for feedback:
#349 (comment)

Do we need to open a new one?

@groszewn
Copy link
Contributor

Yes, please open a new ticket with your feature request (which seems to be support for --detect_file_replacement in rustboard IIUC). As stated in the issue you linked, --detect_file_replacement is an experimental feature with best-effort support, but we can gauge interest via +1s on the feature request issue to help with prioritization.

@bhack
Copy link
Author

bhack commented Mar 15, 2024

Is TB still actively developed? Or is it mainly in maintenance mode? I a relatively low commit/PR activity.

@bhack
Copy link
Author

bhack commented Mar 15, 2024

In any case,
The ticket is here #6790.

Unfortunately usage of logdir_spec is discouraged in favor of using symlinks.

I still don't understand what filtering option we have on fs that don't support symlinks like Google gcs-csi-fuse

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants