Restarting training job doesn't visualize earlier log files #6782

bhack · 2024-03-11T14:00:43Z

I have a training job on a spot instance and a TB log on GCS-CSI fuse.

As it is a spot instance when the job it is going to reastart it going to create a new log file.

If I don't restart the TB the curves are ok but when I restart the TB service it going to read only then last log file.

Is there a ways to workaround this?

groszewn · 2024-03-11T17:31:52Z

How are you running TensorBoard in this situation? Are the log files created in the same directory? Are all steps recomputed in the new file, or are only newer steps added?

bhack · 2024-03-11T18:00:17Z

Yes the log are created in the same directory. So I have the first log file and then a second log file that we the job restarting after the Google spot preemption.
If I don't restart the TB all it is ok but when I restart it I see only the 2nd log.
To avoid point overlapping in the curves I use purge_step using the last available checkpoint step.

groszewn · 2024-03-11T20:51:33Z

Have you tried specifying --reload_multifile?

bhack · 2024-03-11T21:04:56Z

Yes I am using both --reload_multifile=true and --detect_file_replacement=true for the gcs csi fuse nature

groszewn · 2024-03-12T15:12:14Z

Can you please provide the output of tensorboard --inspect --logdir <your log directory> as well as the result of diagnose_tensorboard.py as specified in the issue template?

bhack · 2024-03-12T17:55:54Z

tensorboard --inspect --logdir

With that command over the log directory where I have found the issue (there are 4 logs file there of the same run)

======================================================================
Processing event files... (this can take a few minutes)
======================================================================

Traceback (most recent call last):
  File "/usr/local/bin/tensorboard", line 8, in <module>
    sys.exit(run_main())
             ^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/tensorboard/main.py", line 41, in run_main
    app.run(tensorboard.main, flags_parser=tensorboard.configure)
  File "/usr/local/lib/python3.11/dist-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.11/dist-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
             ^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/tensorboard/program.py", line 278, in main
    return runner(self.flags) or 0
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/tensorboard/program.py", line 291, in _run_serve_subcommand
    efi.inspect(flags.logdir, event_file, flags.tag)
  File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_inspector.py", line 451, in inspect
    inspection_units = get_inspection_units(logdir, event_file, tag)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_inspector.py", line 405, in get_inspection_units
    field_to_obs=get_field_to_observations_map(generator, tag),
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_inspector.py", line 212, in get_field_to_observations_map
    for event in generator:
  File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_loader.py", line 244, in Load
    for record in super().Load():
  File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_loader.py", line 178, in Load
    yield next(self._iterator)
          ^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.NotFoundError: /logs/my_job10/log/tensorboard/events.out.tfevents.1710242212.mymodel.74.0; No such file or directory

And diagnose_tensorboard.py

//diagnose_tensorboard.py:32: DeprecationWarning: 'pipes' is deprecated and slated for removal in Python 3.13
  import pipes
### Diagnostics

<details>
<summary>Diagnostics output</summary>

--- check: autoidentify
INFO: diagnose_tensorboard.py version df7af2c6fc0e4c4a5b47aeae078bc7ad95777ffa

--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=11, micro=0, releaselevel='candidate', serial=1)
INFO: os.name: posix
INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='tensorboard-2342343-4324', release='6.1.58+', version='#1 SMP PREEMPT_DYNAMIC Mon Jan 29 15:19:25 UTC 2024', machine='x86_64')
INFO: sys.getwindowsversion(): N/A

--- check: package_management
INFO: has conda-meta: False
INFO: $VIRTUAL_ENV: None

--- check: installed_packages
INFO: installed: tensorboard==2.16.2
WARNING: no installation among: ['tensorflow', 'tensorflow-gpu', 'tf-nightly', 'tf-nightly-2.0-preview', 'tf-nightly-gpu', 'tf-nightly-gpu-2.0-preview']
WARNING: no installation among: ['tensorflow-estimator', 'tensorflow-estimator-2.0-preview', 'tf-estimator-nightly']
INFO: installed: tensorboard-data-server==0.7.2

--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '2.16.2'

--- check: tensorflow_python_version
2024-03-12 17:52:11.636867: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO: tensorflow.__version__: '2.16.1'
INFO: tensorflow.__git_version__: 'v2.16.1-0-g5bc9d26649c'

--- check: tensorboard_data_server_version
INFO: data server binary: '/usr/local/lib/python3.11/dist-packages/tensorboard_data_server/bin/server'
INFO: data server binary version: b'rustboard 0.7.2'

--- check: tensorboard_binary_path
INFO: which tensorboard: b'/usr/local/bin/tensorboard\n'

--- check: addrinfos
socket.has_ipv6 = True
socket.AF_UNSPEC = <AddressFamily.AF_UNSPEC: 0>
socket.SOCK_STREAM = <SocketKind.SOCK_STREAM: 1>
socket.AI_ADDRCONFIG = <AddressInfo.AI_ADDRCONFIG: 32>
socket.AI_PASSIVE = <AddressInfo.AI_PASSIVE: 1>
Loopback flags: <AddressInfo.AI_ADDRCONFIG: 32>
Loopback infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 0))]
Wildcard flags: <AddressInfo.AI_PASSIVE: 1>
Wildcard infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('0.0.0.0', 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::', 0, 0, 0))]

--- check: readable_fqdn
INFO: socket.getfqdn(): 'tensorboard-5f94958cbc-w59k5'

--- check: stat_tensorboardinfo
INFO: directory: /tmp/.tensorboard-info
INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=1590, st_dev=3145768, st_nlink=2, st_uid=0, st_gid=0, st_size=4096, st_atime=1710242647, st_mtime=1710242639, st_ctime=1710242639)
INFO: mode: 0o40777

--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['/usr/local/lib/python3.11/dist-packages']; bad_roots (0): []

--- check: full_pip_freeze
INFO: pip freeze --all:
absl-py==2.1.0
astunparse==1.6.3
blinker==1.4
certifi==2024.2.2
charset-normalizer==3.3.2
cryptography==3.4.8
dbus-python==1.2.18
distro==1.7.0
dm-tree==0.1.8
flatbuffers==23.5.26
gast==0.5.4
google-pasta==0.2.0
grpcio==1.62.0
h5py==3.10.0
httplib2==0.20.2
idna==3.6
importlib-metadata==4.6.4
jeepney==0.7.1
keras==3.0.5
keyring==23.5.0
launchpadlib==1.10.16
lazr.restfulclient==0.14.4
lazr.uri==1.0.6
libclang==16.0.6
Markdown==3.5.2
markdown-it-py==3.0.0
MarkupSafe==2.1.5
mdurl==0.1.2
ml-dtypes==0.3.2
more-itertools==8.10.0
namex==0.0.7
numpy==1.26.4
oauthlib==3.2.0
opt-einsum==3.3.0
packaging==23.2
pip==24.0
protobuf==4.25.3
Pygments==2.17.2
PyGObject==3.42.1
PyJWT==2.3.0
pyparsing==2.4.7
python-apt==2.4.0+ubuntu3
requests==2.31.0
rich==13.7.0
SecretStorage==3.3.1
setuptools==69.1.1
six==1.16.0
tensorboard==2.16.2
tensorboard-data-server==0.7.2
tensorflow-cpu==2.16.1
tensorflow-io-gcs-filesystem==0.36.0
termcolor==2.4.0
typing_extensions==4.9.0
urllib3==2.2.1
wadllib==1.3.6
Werkzeug==3.0.1
wheel==0.42.0
wrapt==1.16.0
zipp==1.0.0

</details>

groszewn · 2024-03-12T18:59:52Z

The output of tensorboard --inspect --logdir looks like you actually passed it tensorboard --inspect --event_file <file> instead. There should not be an error, I would expect either No event files found within logdir <your logdir> or

Found event files in:
<logdir>/<run>/...
<logdir>/<another run>/...
...
These tags are in <logdir>...
...

followed by event statistics.

Can you please attempt a rerun of tensorboard --inspect --logdir <your log directory> pointed at your log directory and resend? Otherwise, I'm inclined to suspect that there has been some corruption of your log directory.

bhack · 2024-03-12T19:38:23Z

Ok this is on one of the runs where I found the issue (this one has 4 logs files in the run):

2024-03-12 19:30:25.911301: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
======================================================================
Processing event files... (this can take a few minutes)
======================================================================

Found event files in:
/logs/<my-run>/log/tensorboard

These tags are in /logs/<myrun>/log/tensorboard:
audio -
histograms -
images
  Image1
scalars
   LR
tensor -
======================================================================

Event statistics for /logs/<myrun>/log/tensorboard:
audio -
graph -
histograms -
images
   first_step           0
   last_step            21050
   max_step             21050
   min_step             0
   num_steps            422
   outoforder_steps     [(5600, 5000), (12450, 12000), (12200, 12000)]
scalars
   first_step           0
   last_step            21050
   max_step             21050
   min_step             0
   num_steps            422
   outoforder_steps     [(5600, 5000), (12450, 12000), (12200, 12000)]
sessionlog:checkpoint -
sessionlog:start
   outoforder_steps     []
   steps                [0, 5000, 12000, 12000]
sessionlog:stop -
tensor -
======================================================================

If you see instead scalars and images are starting from the last log (5k) after TB restart

P.s. I've renamed/removed few things for anonymization.

groszewn · 2024-03-12T20:04:36Z

Would you mind running with --purge_orphaned_data=False set and see if that helps? It doesn't seem like your situation is the same, but this may help us to debug (doc link).

bhack · 2024-03-12T20:12:28Z

--purge_orphaned_data=False

It doesn't solve it seems to me that it is always restarting the scalar graphs and the images from the last log file only.
When I don't restart the service all it is ok on the web frontend.

groszewn · 2024-03-12T21:26:19Z

Are you able to copy the log directory locally and confirm that things work as expect? Would like to rule out whether this is something internal to TB or isolated to just reading from GCS.

bhack · 2024-03-12T22:24:26Z

I've copied the same folder locally and it seems to work correctly.
What I am using on Google cloud is https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver

groszewn · 2024-03-13T00:49:49Z

Do things work as expected when pointing to GCS directly instead of using the locally mounted path?

bhack · 2024-03-13T01:22:12Z

Same bucket passing gcs:// it seems to work correctly

groszewn · 2024-03-13T14:05:43Z

Hmm, this seems like it may potentially be an issue around the mounted path if all other avenues are working as expected.

One additional thought, TB has some special filesystem handling that is enabled when fsspec is installed. Can you trying installing fsspec to see if that helps?

groszewn · 2024-03-13T14:06:15Z

Hmm, this seems like it may potentially be an issue around the mounted path if all other avenues are working as expected.

One additional thought, TB has some special filesystem handling that is enabled when fsspec is installed. Can you trying installing fsspec to see if that helps?

bhack · 2024-03-13T14:11:51Z

Ok I will try.
In the meantime are you relying, with TB internals, on some features that conflict with fuse limitation expecting that that path is a standard POSIX filesystem path?

See:
https://cloud.google.com/storage/docs/gcs-fuse?hl=en#differences-and-limitations

groszewn · 2024-03-13T14:22:03Z

TB itself (when using the python filesystem readers) relies on TensorFlow's filesystem API via tf.io.gfile when TF is installed (and falls back to a stub implementation if TF is not available which does not have as much support), see #5286.

When using RustBoard for file reading (which is the default, you can try disabling with --load_fast=False to see if the python reader itself works in this situation), it looks like we use the Rust standard library path and walkdir (rustboard disk logdir source). I am not aware of any limitations of those implementations.

bhack · 2024-03-13T14:31:15Z

@songjiaxun Any opinion about this on the CSI gcsfuse POSIX limitations?

bhack · 2024-03-13T20:17:01Z

@groszewn I've isolated the bug. It is going to happen when I use --logdir_spec with multiple logs directory.

The first one in the list is always loaded correctly after the TB restart.
All the other runs are going to load only the last log file.
I've verified this removing one by one:

as the log it is going to become the first one in the list it is loaded completely the other ones just the last log.

Now the problem is that on the bucket it is common to have a lot of subfolders/runs with TB logs files and I cannot load all the logs just specifying the main root.
Other classical workarounds, like using softlink, are not available on GCS/fuse (but also on S3/fuse) as we cannot use softlinks in the buckets. So currently we don't have any solution to restrict a log selection.

bhack · 2024-03-13T20:29:27Z

Also for the nature of the GCS fuse we are constrained to use --detect_file_replacement=true that currently in TB it is incompatible with --load_fast=true and it will be very hard to rely on a single large parent log path without load_fast.

groszewn · 2024-03-14T19:59:08Z

I do not believe the team currently has bandwidth to add --detect_file_replacement support to rustboard, but feel free to open another issue so that we can track +1s to help with prioritization.

Unfortunately usage of logdir_spec is discouraged in favor of using symlinks. We rely on external providers for most filesystem/blob storage systems (e.g. tf.io.gfile, fsspec, s3fs), so most specific needs such as this are only supported on a best-effort basis.

bhack · 2024-03-14T21:09:21Z

It is really going to be a problem to use symlinks as it is not supported in GCS fuse and also GCS fuse is going to write substituting file in the gcs bucket.

As we are trying to support an official Google cloud product with the TB can you investigate internally as it seems we don't have any other workaround.

bhack · 2024-03-15T01:27:29Z

Also this ticket it is still opened with this last comment asking for feedback:
#349 (comment)

Do we need to open a new one?

groszewn · 2024-03-15T18:13:00Z

Yes, please open a new ticket with your feature request (which seems to be support for --detect_file_replacement in rustboard IIUC). As stated in the issue you linked, --detect_file_replacement is an experimental feature with best-effort support, but we can gauge interest via +1s on the feature request issue to help with prioritization.

bhack · 2024-03-15T18:33:49Z

Is TB still actively developed? Or is it mainly in maintenance mode? I a relatively low commit/PR activity.

bhack · 2024-03-15T19:17:52Z

In any case,
The ticket is here #6790.

Unfortunately usage of logdir_spec is discouraged in favor of using symlinks.

I still don't understand what filtering option we have on fs that don't support symlinks like Google gcs-csi-fuse

bhack changed the title ~~Restarting training job doesn't visualize the first log~~ Restarting training job doesn't visualize earlier log files Mar 11, 2024

bhack mentioned this issue Mar 15, 2024

Support --detect_file_replacement in rustboard #6790

Open

bhack mentioned this issue Mar 26, 2024

About using Tensorboard with gcs-fuse-csi volume GoogleCloudPlatform/gcs-fuse-csi-driver#203

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restarting training job doesn't visualize earlier log files #6782

Restarting training job doesn't visualize earlier log files #6782

bhack commented Mar 11, 2024

groszewn commented Mar 11, 2024

bhack commented Mar 11, 2024

groszewn commented Mar 11, 2024

bhack commented Mar 11, 2024 •

edited

groszewn commented Mar 12, 2024

bhack commented Mar 12, 2024 •

edited

groszewn commented Mar 12, 2024

bhack commented Mar 12, 2024 •

edited

groszewn commented Mar 12, 2024

bhack commented Mar 12, 2024

groszewn commented Mar 12, 2024

bhack commented Mar 12, 2024

groszewn commented Mar 13, 2024

bhack commented Mar 13, 2024

groszewn commented Mar 13, 2024

groszewn commented Mar 13, 2024

bhack commented Mar 13, 2024 •

edited

groszewn commented Mar 13, 2024

bhack commented Mar 13, 2024

bhack commented Mar 13, 2024

bhack commented Mar 13, 2024 •

edited

groszewn commented Mar 14, 2024

bhack commented Mar 14, 2024

bhack commented Mar 15, 2024

groszewn commented Mar 15, 2024

bhack commented Mar 15, 2024

bhack commented Mar 15, 2024 •

edited

Restarting training job doesn't visualize earlier log files #6782

Restarting training job doesn't visualize earlier log files #6782

Comments

bhack commented Mar 11, 2024

groszewn commented Mar 11, 2024

bhack commented Mar 11, 2024

groszewn commented Mar 11, 2024

bhack commented Mar 11, 2024 • edited

groszewn commented Mar 12, 2024

bhack commented Mar 12, 2024 • edited

groszewn commented Mar 12, 2024

bhack commented Mar 12, 2024 • edited

groszewn commented Mar 12, 2024

bhack commented Mar 12, 2024

groszewn commented Mar 12, 2024

bhack commented Mar 12, 2024

groszewn commented Mar 13, 2024

bhack commented Mar 13, 2024

groszewn commented Mar 13, 2024

groszewn commented Mar 13, 2024

bhack commented Mar 13, 2024 • edited

groszewn commented Mar 13, 2024

bhack commented Mar 13, 2024

bhack commented Mar 13, 2024

bhack commented Mar 13, 2024 • edited

groszewn commented Mar 14, 2024

bhack commented Mar 14, 2024

bhack commented Mar 15, 2024

groszewn commented Mar 15, 2024

bhack commented Mar 15, 2024

bhack commented Mar 15, 2024 • edited

bhack commented Mar 11, 2024 •

edited

bhack commented Mar 12, 2024 •

edited

bhack commented Mar 12, 2024 •

edited

bhack commented Mar 13, 2024 •

edited

bhack commented Mar 13, 2024 •

edited

bhack commented Mar 15, 2024 •

edited