Fails to show new data when event files are replaced (e.g. w/ rsync) #349

potiuk · 2017-08-12T07:00:21Z

Problem summary

We have a problem that tensorboard (1.2.1) does not show new run data when we switched to syncing the data from Google Cloud Storage to local directory.

It looks like when the tensorboard reads an event file from local directory - it will not notice that the event file was deleted and recreated (which is quite valid case when you are using gsutil rsync to sync the data from Google Cloud Storage). I have a short-term workaround (to be tested) but I think there is a need to fiix it permanently in tensorboard - it's quite a valid case to handle due to another bug - #158)

Some context and problem manifestation

Initially we stored data from our ML runs in Google Cloud Storage and pointed our Tensorboard to reead the the data directly from there (logdir=gs://BUCKET/FOLDER) . We run the Tensorboard in Kubernetes cluster and made it available to all our team members via oauth proxy. This is great as we can always see fresh data and any member of the team can access it there. We wanted the data to refresh quite frequently - every few minutes. This worked very well but we found that there is a cost downside to it.

Due to #158 we had a lot of GCS accesses generated by TB just scanning GCS (several millions of accesses/day). This was not only slow but also incurred relatively high GCS cost (20 USD/4 days!).

Then we switched to another approach - we sync the data to a local folder from gcs using gsutil -m rsync -d gs://BUCKET/FOLDER LOCAL_FOLDERand then tensorboard points to that locally downloaded folder instead of pointing to gs:// .

Additional information - we have the whole tree structure within that logdir directory so that we can see and compare all the different runs we've already produced.

Unfortunately after switching to rsync we seem to have big problem. The new runs (in new subfolders) are showing up amongst the others after the runs are executed, but they never show any data for those runs until tensorboard gets restarted. It looks like tensorboard picks up some initially synced partial results from the run - but not enough to show the data - and then it somehow blocks from reading any new data for that folder.

When I start an exact copy of Tensorboard at the same machine, it shows the data for those new runs properly - you can see the screenshot below showing the two tensorboards side-by-side. Tensorboard on the left is freshly started, tensorboard on the right has been run before the run has been started.

Result of investigation

I read in detail how gsutl rsync synchronizes the data and I guessed what might be wrong. I confirmed that by looking at the lsof output for the long-running tensorboard instance - and it becomes quite obvious (look at the "deleted" entries):

tensorboa 191 root   30r   REG       8,1  1085346   8881152 /tensorboard-data/dexnet2/convnet/20171108_kcjniy6q/logs/events.out.tfevents.1502483404.sylvester (deleted)
tensorboa 191 root   31r   REG       8,1  1148462   8880988 /tensorboard-data/dexnet2/convnet/20171108_b2ere44n/logs/events.out.tfevents.1502483298.arnold (deleted)
tensorboa 191 root   32r   REG       8,1   715534   8881201 /tensorboard-data/dexnet2/convnet/20171108_silfxox7/logs/events.out.tfevents.1502402857.sylvester
tensorboa 191 root   33r   REG       8,1   930378   8880956 /tensorboard-data/dexnet2/convnet/20171108_61rgid3z/logs/events.out.tfevents.1502403170.sylvester
tensorboa 191 root   34r   REG       8,1   934806   8881119 /tensorboard-data/dexnet2/convnet/20171108_k9en6v98/logs/events.out.tfevents.1502403287.sylvester
tensorboa 191 root   35r   REG       8,1  1744410   8881036 /tensorboard-data/dexnet2/convnet/20171108_g2unt5hi/logs/events.out.tfevents.1502444237.sylvester (deleted)
tensorboa 191 root   36r   REG       8,1  5844954   8881430 /tensorboard-data/dexnet2/resnet/logs_9bli9id2/logs/events.out.tfevents.1501736695.arnold

The gsutil rsync works in the way that it always deletes and recreates files if they change - it never appends changes to the files. They state it in their manual, and it comes from the fact that the GCS files are in fact objects not files, and are immutable, so you don't even have a possibilty to transfer it partially - it's always all-or-nothing.

On the other hand it's quite reasonable what tensorboard does: when it has logdir set to the local folder, it opens the events* file and expects that tensorflow-based machine learning will continue appending data to it rather than delete and recreate the file when new data appears. It's completely resonable assumption - and I guess it's quite different when it accesses GCS directly (for the same reasons gsutil rsync only synchronizes full files).

You could see in the lsof output that we have problem with 20171108_b2ere44n and 20171108_kcjniy6q - in both cases the file is already deleted, but tensorboard still has it open (which is quite normal behaviour in linux/ext4 - the file/inode will only be physically removed after last process keeps it open).

Potential solutions

I see several potential solutions to the problem, and it would be great if someone from tensorboard team can confirm whether the long term solution I proposed is good.

Short term workaround: I will setup double syncing gs -> gsutil rsync -> linux rsync --inplace -> logdir. I tested that I can make the standard linux rsync works in the way that it will not delete/recreate the files but it will replace the data in place (--inplace switch) - I also considered and tested behaviour of --append and --append-verify, but I think it's safer bet to do --inplace (append-verify will silently refuse syncin for file that does not match the content prefix in the old file). The process will sync the data first to some folder with gsutil rsync and then sync the data to final destination with the linux's rsync --inplace . The obvious downsides of this workaround is that it makes our process more complex and prone to error, and we have to keep two copies of our whole log directory in our kubernetes pod. Thats's quite bad long-term when we will have a lot more data to look at.
Proper fix: I think tensorboard should be smarter in reading local data and check if the file has been deleted since it was last read and close/reopen/reread the file when it was. You could use fstat for that (see the second answer here https://stackoverflow.com/questions/12690281/check-if-an-open-file-has-been-deleted-after-open-in-python) Can this be done in next version please? Or maybe there is already a possible solution for current tensorboard (1.2.1) ?

The text was updated successfully, but these errors were encountered:

jart · 2018-05-16T18:49:33Z

How is it possible for a GCS event log to be gsutil rsync'd and then change afterwards? Event log filenames have timestamps in them. Files don't exist on GCS until close(). One thing TensorBoard does is it checks if a run directory got deleted, in which case it'll remove it from memory. It doesn't check file modified time.

jart · 2018-05-16T19:18:17Z

I just spoke with @nfelt. It turns out TensorFlow does resumable uploads, so GCS event logs can be read before close(). The gsutil rsync behavior also probably differs from /usr/bin/rsync which IIRC is smarter about not unlinking inodes.

I assume you want gsutil rsync because --logdir=gs://... is slow? That makes this a duplicate of #158. We're going to make GCS go fast. Will that solve your problem? If not, let me know and I'll reopen.

nfelt · 2018-08-22T17:37:56Z

I'm re-opening this issue since we've gotten a few other reports and they don't always involve GCS - this applies to usage of regular rsync and other syncing programs as well if they update files via replacement rather than by appending. We may or may not fix this but it's fair to track as an open issue.

gokul-uf · 2018-10-09T17:01:57Z

I'm using 1.11.0 and I have this exact issue when using rsync to download data from a GPU cluster.
I currently restart tensorboard every time I use rsync to get the updates.

Seems one could use suggestions from here to watch for updates in folders from --logdir.

I'd be happy to contribute if there's a fix that the devs agree upon.

nfelt · 2018-10-10T18:58:26Z

@gokul-uf can you use rsync --inplace as a workaround?

Re: watching for updates, my understanding is that there is no single cross-platform way to do this, and it's impossible to do for remote filesystems that have no "filesystem notifications" API. We're making some changes to how we traverse logdirs right now so I wouldn't want to complicate things further, but down the road we might consider this approach as an optimization for the cases where it is supported.

gokul-uf · 2018-10-11T11:16:23Z

That should work for now, thanks!

patzm · 2018-11-27T10:51:59Z

a short script implementing the rsync --inplace workflow proposed by @potiuk.

#!/bin/bash

# TODO: allow these folders to be defined as environment variables
TB_LANDING_DIR="${HOME}/tb-landing"
TB_DIR="${HOME}/tb"

src="gs://$1"
tmp="${TB_LANDING_DIR}/$1"
dst="${TB_DIR}/$1"

echo "src: ${src}"
echo "landing on: ${tmp}"
echo "dst: ${dst}"

if [ ! -d "${tmp}" ]; then
    mkdir -p "${tmp}"
fi

if [ ! -d "${dst}" ]; then
    mkdir -p "${dst}"
fi

gsutil -m rsync -r ${src} ${tmp}
rsync --inplace -av ${tmp} ${dst}

# vim: set syntax=bash

b675987273 · 2020-04-29T03:29:34Z

The second solution looks better for all developer with different platforms. And is it diffcult to realize the event file's change?

b675987273 · 2020-04-30T10:22:40Z

Un, I finally found that the evnet generator never refresh itself. Genernators only send new events once eventfile created but not files deleted. I will try restarting the eventfile loaders.

b675987273 · 2020-05-08T03:00:53Z

 with self._generator_mutex:
            # for event in self._generator.Load():
            #     self._ProcessEvent(event)

            # try do fresh the cache
            self.tensors_by_tag = {}
            test_generator = _GeneratorFromPath(self.path, self.event_file_active_filter)
            for event in test_generator.Load():
                self._ProcessEvent(event)
            if self.tensors_by_tag :
                self.OutdatedData = self.tensors_by_tag 
        return self

I rewrite that reload code. It works but wastes a lof of time when rebuild the generator

pindinagesh · 2021-11-24T04:46:40Z

@potiuk

Could you please move this to closed status as the PR already merged.

Thanks.

nfelt · 2021-11-25T00:51:24Z

@pindinagesh I would prefer to keep this open for the reasons I gave in the comment above #349 (comment) (which was many months after the PR you mentioned was merged).

AkshayRoyal · 2021-12-10T09:37:41Z

I am also facing the same issue and for now using rsync --inplace as a workaround on my institute's HPC server but I think tensorboard should do a smart lookup in the log directories and check if the event files have been deleted. Please let me know whenever this is fixed

nfelt · 2022-03-04T01:14:41Z

Update: as of about a month ago, TensorBoard nightly releases have an experimental --detect_file_replacement flag. Enabling this flag will activate a special polling mode in our Python+TensorFlow-based file loader that will attempt to detect and handle the case when an event file is replaced entirely with a new version of the file containing additional data (rather than the new data being appended to the existing file, the normal behavior we expect). Implementation is in #5530 and #5529.

This should handle the rsync case described in this bug, but you will need to pip install tb-nightly instead of tensorboard (until our next release) and pass --detect_file_replacement=true.

Other caveats:

Currently, this flag only works with our slower Python data loader, so using it will disable our faster Rust-based data loader (and it is incompatible with passing --load_fast=true explicitly).
Currently, this flag only works when using TensorFlow tf.io.gfile for low-level file reading, so if you do not have TensorFlow installed (e.g. you're a PyTorch or JAX user), this flag will have no effect.
The implementation issues extra stat() calls to monitor the file size for changes (one call per file, per backend reload cycle), which does mean additional overhead in terms of I/O ops, but I expect it's a negligible difference unless you have very unusual constraints.

Feedback is welcome, but please note that this functionality is experimental and any support for it will continue to be best-effort.

banjiaojuhao · 2022-06-08T12:27:09Z

Why not using inotify to watch directory?

patzm · 2022-06-08T14:40:06Z

I think because that doesn't exist for object storages 🤷 (e.g. GCS).

potiuk mentioned this issue Aug 12, 2017

TensorBoard very slow on GCS #158

Closed

This was referenced Jan 22, 2018

Tensorboard not displaying scalars aws/sagemaker-python-sdk#26

Closed

[RDST-111] keep tensorboard directory clean during s3 sync hudl/sagemaker-python-sdk#3

Merged

jart closed this as completed May 16, 2018

nfelt mentioned this issue Aug 9, 2018

tensorboard does not load new data from events file which was synced from another server via rsync #1344

Closed

nfelt changed the title ~~Tensorboard does not show new data when event files are synced via gsutil rsync~~ Fails to show new data when event files are replaced (e.g. w/ rsync) Aug 22, 2018

nfelt reopened this Aug 22, 2018

nfelt mentioned this issue Aug 22, 2018

TensorBoard doesn't load new data when summaries are stored in GCS. #867

Closed

This was referenced Sep 9, 2019

Reading from SFTP (windows) #2622

Closed

NetworkWriter #2635

Closed

potiuk closed this as completed Nov 24, 2021

nfelt reopened this Nov 25, 2021

nfelt mentioned this issue Jan 25, 2022

backend: implement event file replacement detection #5529

Merged

This was referenced Feb 1, 2022

core: add --detect_file_replacement flag and plumbing #5530

Merged

core: add --detect_file_replacement flag and plumbing (redo) #5546

Merged

bhack mentioned this issue Mar 15, 2024

Restarting training job doesn't visualize earlier log files #6782

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fails to show new data when event files are replaced (e.g. w/ rsync) #349

Fails to show new data when event files are replaced (e.g. w/ rsync) #349

potiuk commented Aug 12, 2017 •

edited

Loading

jart commented May 16, 2018

jart commented May 16, 2018

nfelt commented Aug 22, 2018

gokul-uf commented Oct 9, 2018 •

edited

Loading

nfelt commented Oct 10, 2018

gokul-uf commented Oct 11, 2018

patzm commented Nov 27, 2018

b675987273 commented Apr 29, 2020

b675987273 commented Apr 30, 2020 •

edited

Loading

b675987273 commented May 8, 2020

pindinagesh commented Nov 24, 2021

nfelt commented Nov 25, 2021

AkshayRoyal commented Dec 10, 2021

nfelt commented Mar 4, 2022

banjiaojuhao commented Jun 8, 2022

patzm commented Jun 8, 2022

Fails to show new data when event files are replaced (e.g. w/ rsync) #349

Fails to show new data when event files are replaced (e.g. w/ rsync) #349

Comments

potiuk commented Aug 12, 2017 • edited Loading

Problem summary

Some context and problem manifestation

Result of investigation

Potential solutions

jart commented May 16, 2018

jart commented May 16, 2018

nfelt commented Aug 22, 2018

gokul-uf commented Oct 9, 2018 • edited Loading

nfelt commented Oct 10, 2018

gokul-uf commented Oct 11, 2018

patzm commented Nov 27, 2018

b675987273 commented Apr 29, 2020

b675987273 commented Apr 30, 2020 • edited Loading

b675987273 commented May 8, 2020

pindinagesh commented Nov 24, 2021

nfelt commented Nov 25, 2021

AkshayRoyal commented Dec 10, 2021

nfelt commented Mar 4, 2022

banjiaojuhao commented Jun 8, 2022

patzm commented Jun 8, 2022

potiuk commented Aug 12, 2017 •

edited

Loading

gokul-uf commented Oct 9, 2018 •

edited

Loading

b675987273 commented Apr 30, 2020 •

edited

Loading