Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorboard 2.9.1 --logdir as aws s3 path #6349

Open
Krasner opened this issue Apr 28, 2023 · 6 comments
Open

Tensorboard 2.9.1 --logdir as aws s3 path #6349

Krasner opened this issue Apr 28, 2023 · 6 comments

Comments

@Krasner
Copy link

Krasner commented Apr 28, 2023

I am using Tensorboard 2.9.1, when setting --logdir as s3://<bucket>/<folder> tensorboard is not able to read event files.

On my machine (EC2 instance) i am able to reach that logdir via aws cli (aws s3 ls s3://<bucket>/<folder>).
In python I can also reach the files in that folder using tensorflow_io:

import tensorflow as tf
import tensorflow_io as tfio

data  = tf.io.read_file("s3://<bucket>/<folder>/<file>")

This is the Tensorboard command:

AWS_REGION=us-east-1 S3_REGION=us-east-1 S3_ENDPOINT=s3.us-east-1.amazonaws.com S3_USE_HTTPS=1 S3_VERIFY_SSL=0 AWS_LOG_LEVEL=1 CUDA_VISIBLE_DEVICES="" tensorboard --logdir s3://<bucket>/<folder> --host 0.0.0.0

This is the error code:

2023-04-28 14:37:22.039640: W tensorflow/c/logging.cc:37] Retry Strategy will use the default max attempts.
2023-04-28 14:37:22.039685: W tensorflow/c/logging.cc:37] Retry Strategy will use the default max attempts.
2023-04-28 14:37:22.039714: W tensorflow/c/logging.cc:37] Token file must be specified to use STS AssumeRole web identity creds provider.
2023-04-28 14:37:22.039730: W tensorflow/c/logging.cc:37] Retry Strategy will use the default max attempts.
2023-04-28 14:37:22.068841: E tensorflow/c/logging.cc:40] HTTP response code: 404
Resolved remote host IP address:
Request ID:
Exception name:
Error message: No response body.
5 response headers:
content-type : application/xml
date : Fri, 28 Apr 2023 14:37:21 GMT
server : AmazonS3
x-amz-id-2 : Am5XM8hPcYQIbatGgTDYxOo0yxcPBkGFh5tg5tdM1bor4zc9Yzb1jkBZ0cd0rjaJ1XXJXoHk/tY=
x-amz-request-id : RNH62MB09RNQRT3H
2023-04-28 14:37:22.068889: W tensorflow/c/logging.cc:37] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2023-04-28 14:37:22.097549: E tensorflow/c/logging.cc:40] HTTP response code: 404

I would expect Tensorboard to use Tensorflow_IO's tensorflow_io/core/filesystems/s3/ but from the message above that does not seem to be happening.
Notice in the diagnostics report I am using tensorflow-io==0.26.0 and tensorflow-io-gcs-filesystem==0.26.0

Additionally I tried running tensorboard from a python script but get the same problem:

import os
import tensorflow as tf
import tensorflow_io as tfio
from tensorboard import program

os.environ["AWS_REGION"]="us-east-1"
os.environ["S3_REGION"]="us-east-1"
os.environ["S3_ENDPOINT"]="s3.us-east-1.amazonaws.com"
os.environ["S3_USE_HTTPS"]="1"
os.environ["S3_VERIFY_SSL"]="0"
os.environ["AWS_LOG_LEVEL"]="1"

tracking_address = 's3://<bucket>/<folder>' # the path of your log file.
host_ip = "0.0.0.0"

if __name__ == "__main__":
    tb = program.TensorBoard()
    tb.configure(argv=[None, '--logdir', tracking_address, '--bind_all'])
    url = tb.launch()
    print(f"Tensorflow listening on {url}")

Environment information (required)

Diagnostics

Diagnostics output
--- check: autoidentify
INFO: diagnose_tensorboard.py version df7af2c6fc0e4c4a5b47aeae078bc7ad95777ffa

--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=9, micro=5, releaselevel='final', serial=0)
INFO: os.name: posix
INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='ip-xxx-xx-xx-xxx', release='5.15.0-1026-aws', version='#30~20.04.2-Ubuntu SMP Fri Nov 25 14:53:22 UTC 2022', machine='x86_64')
INFO: sys.getwindowsversion(): N/A

--- check: package_management
INFO: has conda-meta: False
INFO: $VIRTUAL_ENV: None

--- check: installed_packages
INFO: installed: tensorboard==2.9.1
INFO: installed: tensorflow==2.9.2
INFO: installed: tensorflow-estimator==2.9.0
INFO: installed: tensorboard-data-server==0.6.1

--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '2.9.1'

--- check: tensorflow_python_version
INFO: tensorflow.__version__: '2.9.2'
INFO: tensorflow.__git_version__: 'v2.9.1-132-g18960c44ad3'

--- check: tensorboard_data_server_version
INFO: data server binary: '/home/ubuntu/.local/lib/python3.9/site-packages/tensorboard_data_server/bin/server'
INFO: data server binary version: b'rustboard 0.6.1'

--- check: tensorboard_binary_path
INFO: which tensorboard: b'/home/ubuntu/.local/bin/tensorboard\n'

--- check: addrinfos
socket.has_ipv6 = True
socket.AF_UNSPEC = <AddressFamily.AF_UNSPEC: 0>
socket.SOCK_STREAM = <SocketKind.SOCK_STREAM: 1>
socket.AI_ADDRCONFIG = <AddressInfo.AI_ADDRCONFIG: 32>
socket.AI_PASSIVE = <AddressInfo.AI_PASSIVE: 1>
Loopback flags: <AddressInfo.AI_ADDRCONFIG: 32>
Loopback infos: [(<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::1', 0, 0, 0)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 0))]
Wildcard flags: <AddressInfo.AI_PASSIVE: 1>
Wildcard infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('0.0.0.0', 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::', 0, 0, 0))]

--- check: readable_fqdn
INFO: socket.getfqdn(): 'ip-xxx-xx-xx-xxx.ec2.internal'

--- check: stat_tensorboardinfo
INFO: directory: /tmp/.tensorboard-info
INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=144084, st_dev=66306, st_nlink=2, st_uid=1000, st_gid=1000, st_size=4096, st_atime=1682631005, st_mtime=1682691735, st_ctime=1682691735)
INFO: mode: 0o40777

--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['/home/ubuntu/.local/lib/python3.9/site-packages']; bad_roots (0): []

--- check: full_pip_freeze
INFO: pip freeze --all:
absl-py==1.3.0
aiohttp==3.8.1
aiohttp-cors==0.7.0
aiosignal==1.3.1
alabaster==0.7.12
albumentations==1.2.0
alembic==1.10.3
antlr4-python3-runtime==4.9.3
anyio==3.6.2
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arrow==1.2.3
asttokens==2.2.1
astunparse==1.6.3
async-generator==1.10
async-timeout==4.0.2
attrs==21.4.0
Automat==0.8.0
autopage==0.5.1
Babel==2.11.0
backcall==0.2.0
beautifulsoup4==4.11.1
black==22.12.0
bleach==5.0.1
blessed==1.20.0
blinker==1.4
bokeh==3.0.2
boto3==1.22.6
botocore==1.25.13
cachetools==5.2.0
certifi==2022.12.7
cffi==1.15.1
cfgv==3.3.1
chardet==3.0.4
charset-normalizer==2.1.1
click==8.1.3
cliff==4.2.0
cloud-init==23.1.2
cloudpickle==2.0.0
cmaes==0.9.1
cmd2==2.4.3
colorama==0.4.3
colorful==0.5.5
colorlog==6.7.0
comm==0.1.2
command-not-found==0.3
commonmark==0.9.1
configobj==5.0.6
constantly==15.1.0
contourpy==1.0.6
conversions==0.0.2
cryptography==2.8
curio==1.6
cycler==0.11.0
Cython==0.29.32
dbus-python==1.2.16
debugpy==1.6.4
decorator==5.1.1
defusedxml==0.7.1
dill==0.3.6
distinctipy==1.2.2
distlib==0.3.6
distro==1.4.0
distro-info===0.23ubuntu1
dm-tree==0.1.7
docker-pycreds==0.4.0
docrepr==0.2.0
docutils==0.17.1
ec2-hibinit-agent==1.0.0
edward2==0.0.2
einops==0.4.1
entrypoints==0.3
etils==0.9.0
exceptiongroup==1.0.4
executing==1.2.0
fastjsonschema==2.16.2
filelock==3.8.2
flatbuffers==1.12
focal-loss==0.0.7
fonttools==4.38.0
fqdn==1.5.1
frozenlist==1.3.3
fsspec==2022.11.0
gast==0.4.0
gin-config==0.5.0
gitdb==4.0.10
GitPython==3.1.29
google-api-core==2.11.0
google-api-python-client==2.69.0
google-auth==2.15.0
google-auth-httplib2==0.1.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
googleapis-common-protos==1.57.0
gpustat==1.1
greenlet==2.0.2
grpcio==1.43.0
gviz-api==1.10.0
h5py==3.7.0
hibagent==1.0.1
html2text==2020.1.16
httplib2==0.21.0
hydra-colorlog==1.1.0
hydra-core==1.2.0
hydra-joblib-launcher==1.2.0
hydra-optuna-sweeper==1.2.0
hydra-ray-launcher==1.2.0
hyperlink==19.0.0
identify==2.5.9
idna==3.4
imageio==2.22.4
imagesize==1.4.1
importlib-metadata==4.13.0
importlib-resources==5.10.1
incremental==16.10.1
iniconfig==1.1.1
ipykernel==6.19.2
ipympl==0.9.2
ipyparallel==8.4.1
ipython==8.7.0
ipython-genutils==0.2.0
ipywidgets==8.0.4
isoduration==20.11.0
jedi==0.18.2
Jinja2==3.1.2
jmespath==1.0.1
joblib==1.1.1
jsonpatch==1.22
jsonpointer==2.0
jsonschema==4.17.3
jupyter-events==0.5.0
jupyter_client==7.4.8
jupyter_core==5.1.0
jupyter_server==2.0.5
jupyter_server_terminals==0.4.3
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.5
kaggle==1.5.12
keras==2.9.0
Keras-Preprocessing==1.1.2
keyring==18.0.1
kiwisolver==1.4.4
language-selector==0.1
launchpadlib==1.10.13
lazr.restfulclient==0.14.2
lazr.uri==1.0.3
libclang==14.0.6
llvmlite==0.39.1
lxml==4.9.1
Mako==1.2.4
Markdown==3.4.1
MarkupSafe==2.1.1
matplotlib==3.5.1
matplotlib-inline==0.1.6
mistune==2.0.4
more-itertools==4.2.0
msgpack==1.0.5
multidict==6.0.4
multiprocess==0.70.14
mypy-extensions==0.4.3
nbclassic==0.4.8
nbclient==0.7.2
nbconvert==7.2.7
nbformat==5.7.1
nest-asyncio==1.5.6
netifaces==0.10.4
networkx==2.8.8
nmslib==2.1.1
nodeenv==1.7.0
notebook==6.5.2
notebook_shim==0.2.2
numba==0.56.4
numpy==1.21.5
nvidia-ml-py==11.525.112
oauth2client==4.1.3
oauthlib==3.2.2
omegaconf==2.2.2
opencensus==0.11.2
opencensus-context==0.1.3
opencv-python==4.6.0.66
opencv-python-headless==4.6.0.66
opt-einsum==3.3.0
optuna==2.10.1
outcome==1.2.0
packaging==22.0
pandas==1.4.3
pandocfilters==1.5.0
parso==0.8.3
pathos==0.3.0
pathspec==0.10.3
pathtools==0.1.2
pbr==5.11.1
pexpect==4.6.0
pickle5==0.0.11
pickleshare==0.7.5
Pillow==9.3.0
pip==23.1.2
platformdirs==2.6.0
pluggy==1.0.0
portalocker==2.6.0
pox==0.3.2
ppft==1.7.6.6
pre-commit==2.20.0
prettytable==3.7.0
prometheus-client==0.13.1
promise==2.3
prompt-toolkit==3.0.36
protobuf==3.19.6
protobuf3-to-dict==0.1.5
psutil==5.9.4
ptyprocess==0.7.0
pure-eval==0.2.2
py==1.11.0
py-cpuinfo==9.0.0
py-spy==0.3.14
pyarrow==10.0.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.6.1
pycocotools==2.0.6
pycparser==2.21
pydash==5.1.2
Pygments==2.13.0
PyGObject==3.36.0
pygwalker==0.1.4
PyHamcrest==1.9.0
PyJWT==1.7.1
pymacaroons==0.13.0
PyNaCl==1.3.0
pynndescent==0.5.8
PyOpenGL==3.1.6
pyOpenSSL==19.0.0
pyparsing==3.0.9
pyperclip==1.8.2
PyQt5==5.14.1
PyQt6==6.4.0
PyQt6-Qt6==6.4.1
PyQt6-sip==13.4.0
pyqtgraph==0.13.1
pyrsistent==0.15.5
pyserial==3.4
pytest==6.2.5
pytest-asyncio==0.20.3
python-apt==2.0.0+ubuntu0.20.4.8
python-dateutil==2.8.2
python-debian===0.1.36ubuntu1
python-json-logger==2.0.4
python-slugify==7.0.0
python-version==0.0.2
pytz==2022.6
PyWavelets==1.4.1
PyYAML==5.4.1
pyzmq==24.0.1
qtconsole==5.4.0
QtPy==2.3.0
qudida==0.0.4
ray==1.12.0
recommonmark==0.7.1
regex==2022.10.31
requests==2.28.1
requests-oauthlib==1.3.1
requests-unixsocket==0.2.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==12.6.0
rsa==4.9
s3fs==0.4.2
s3transfer==0.5.2
sacrebleu==2.3.1
sagemaker==2.109.0
scikit-image==0.18.3
scikit-learn==1.1.1
scipy==1.7.3
seaborn==0.12.1
SecretStorage==2.3.1
Send2Trash==1.8.0
sentencepiece==0.1.97
sentry-sdk==1.11.1
seqeval==1.2.2
service-identity==18.1.0
setproctitle==1.3.2
setuptools==61.2.0
shortuuid==1.0.11
simplejson==3.16.0
sip==4.19.21
six==1.16.0
smart-open==6.3.0
smdebug-rulesconfig==1.0.1
smmap==5.0.0
sniffio==1.3.0
snowballstemmer==2.2.0
sortedcontainers==2.4.0
sos==4.4
soupsieve==2.3.2.post1
Sphinx==5.3.0
sphinx-markdown-builder==0.5.5
sphinx-rtd-theme==1.1.1
sphinxcontrib-applehelp==1.0.2
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==2.0.0
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.5
SQLAlchemy==2.0.9
ssh-import-id==5.10
stack-data==0.6.2
stevedore==5.0.0
systemd-python==234
tabulate==0.9.0
tensorboard==2.9.1
tensorboard-data-server==0.6.1
tensorboard-plugin-profile==2.11.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.9.2
tensorflow-addons==0.17.1
tensorflow-datasets==4.7.0
tensorflow-decision-forests==0.2.7
tensorflow-estimator==2.9.0
tensorflow-hub==0.12.0
tensorflow-io==0.26.0
tensorflow-io-gcs-filesystem==0.26.0
tensorflow-metadata==1.12.0
tensorflow-model-optimization==0.7.3
tensorflow-probability==0.17.0
tensorflow-similarity==0.16.8
tensorflow-text==2.9.0
termcolor==2.1.1
terminado==0.17.1
testpath==0.6.0
text-unidecode==1.3
tf-models-official==2.9.2
tf-slim==1.1.0
threadpoolctl==3.1.0
tifffile==2022.10.10
tinycss2==1.2.1
toml==0.10.2
tomli==2.0.1
tornado==6.2
tqdm==4.64.1
traitlets==5.7.0
trio==0.22.0
Twisted==18.9.0
typeguard==2.13.3
typing_extensions==4.4.0
ubuntu-advantage-tools==27.12
ufw==0.36
umap-learn==0.5.3
unattended-upgrades==0.1
unify==0.5
untokenize==0.1.1
uri-template==1.2.0
uritemplate==4.1.1
urllib3==1.26.13
validators==0.20.0
virtualenv==20.17.1
vit-keras==0.1.0
wadllib==1.3.3
wandb==0.12.18
wcwidth==0.2.5
webcolors==1.12
webencodings==0.5.1
websocket-client==1.4.2
Werkzeug==2.2.2
wheel==0.38.4
widgetsnbextension==4.0.5
wrapt==1.14.1
wurlitzer==3.0.3
xyzservices==2022.9.0
yapf==0.32.0
yarl==1.8.2
zipp==3.11.0
zope.interface==4.7.1

@Krasner
Copy link
Author

Krasner commented Apr 28, 2023

As expected the problem is with tensorflow_io not being used. I propose a few solutions:

  1. Imports
    In backend/event_processing/io_wrapper.py:
import tensorflow as tf
import tensorflow_io as tfio
import s3fs

Note the import of s3fs - this is because tf.io.gfile.glob is VERY slow for recursing through an aws s3 path.

  1. Walk through s3 path:
def S3ListRecursivelyViaWalking(top):
    s3 = s3fs.S3FileSystem()
    for dir_path, _, filenames in s3.walk(top, topdown=True, refresh=True):
        yield (
            "s3://" + dir_path,
            (os.path.join("s3://" + dir_path, filename) for filename in filenames),
        )
  1. Use above method to index s3 path:
if io_util.IsCloudPath(path):
        # Glob-ing for files can be significantly faster than recursively
        # walking through directories for some file systems.
        logger.info(
            "GetLogdirSubdirectories: Starting to list directories via glob-ing."
        )
        if io_util.IsS3Path(path):
            traversal_method = S3ListRecursivelyViaWalking
        else:
            traversal_method = ListRecursivelyViaGlobbing
  1. Add io_util.IsS3Path function in util/io_util.py:
def IsS3Path(path):
    return path.startswith("s3://")

Thoughts?

@yatbear
Copy link
Member

yatbear commented Apr 28, 2023

Hi @Krasner,

We added S3 support in #5491 (since TensorBoard v2.6). If the S3 directory parsing failed due to tensorflow-io not found, the error message would be something like Error: Unsupported filename scheme S3... (e.g. #5480), and it will prompt you to install TF I/O. I can see that TF I/O dependency exists in your environment from the diagnostics output, so I'm not sure if this is an issue with identifying and parsing S3 files.

The error messages Error message: No response body and If the signature check failed. This could be because of a time skew. Attempting to adjust the signer look like permission or configuration issue related to S3. I'm not familiar with AWS, is it possible to adjust the AWS_LOG_LEVEL (or maybe there is another arg) to get more information about the failure?

@Krasner
Copy link
Author

Krasner commented Apr 28, 2023

@yatbear I don't think it's a permission issue - as I noted above, I can access aws s3 from my ec2 instance, and if I import tensorflow_io in my script then I am also able to access aws files with tf.io.gfile. However without the explicit import of this library tf.io.gfile will fail.

Interestingly, after the fixes above the error messages are still visible with AWS_LOG_LEVEL=1 but tensorboard is able to access event files on s3.

Additionally, as I mentioned tf.io.gfile is very slow compared to s3fs for accessing s3 files.

@yatbear
Copy link
Member

yatbear commented Apr 28, 2023

@Krasner, thanks for the clarification and the proposed solutions above! I just saw this open issue under tensorflow-io repo: tensorflow/io#1731, which suggests the problem lies here. A temporary workaround mentioned in tensorflow/io#1731 (comment) is to pin tensorflow-io dependency to 0.27.0, could you try this? In the meantime, I will do a bit more investigation before adding the new dependency s3fs.

@yatbear
Copy link
Member

yatbear commented May 15, 2023

I saw this recent fix related S3: tensorflow/io#1790, but it is not included to the latest tensorflow-io pip version: https://pypi.org/project/tensorflow-io/#history, and their nightly is also stale, left a comment under the aforementioned PR.

@ngohoanganh96
Copy link

I am get the same error when using tensorboard --logdir s3://zenml-minio-store/logs/...
I used version as below;
tensorflow=2.8.0, tensorboard=2.8.0, tensorflow-io=0.24.0
I have tried to update to tensorflow=2.12.0, tensorboard=2.12.3, tensorflow-io= 0.33.0, but i doesn't work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants