Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boto3 incompatible with python zip import #1770

Open
ADH-LukeBollam opened this issue Nov 14, 2018 · 22 comments
Open

Boto3 incompatible with python zip import #1770

ADH-LukeBollam opened this issue Nov 14, 2018 · 22 comments
Labels
feature-request This issue requests a feature. needs-review p2 This is a standard priority issue

Comments

@ADH-LukeBollam
Copy link

One of Python's useful features is its ability to load modules from a .zip archive (PEP here), allowing you to package up multiple dependencies into a single file.

Boto breaks when trying to import it from a .zip, throwing:

  File "C:\code sandbox\boto.zip\boto3\session.py", line 263, in client
  File "C:\code sandbox\boto.zip\botocore\session.py", line 799, in create_client
  File "C:\code sandbox\boto.zip\botocore\session.py", line 668, in _get_internal_component
  File "C:\code sandbox\boto.zip\botocore\session.py", line 870, in get_component
  File "C:\code sandbox\boto.zip\botocore\session.py", line 150, in create_default_resolver
  File "C:\code sandbox\boto.zip\botocore\loaders.py", line 132, in _wrapper
  File "C:\code sandbox\boto.zip\botocore\loaders.py", line 424, in load_data
botocore.exceptions.DataNotFoundError: Unable to load data for: endpoints

How to Reproduce:

  1. Create a .zip containing boto3 and botocore
  2. Create a .py file in the same directory as the zip (access keys removed for obvious reasons):
sys.path.insert(0, 'boto.zip')
import boto3

s3 = boto3.client('s3', aws_access_key_id='access_key', aws_secret_access_key='secret_key')
  1. Run

Tested on Python 3.6.7
boto3 1.9.39
botocore 1.12.39

@joguSD
Copy link
Contributor

joguSD commented Nov 20, 2018

Confirmed. Our data loaders can't handle when being run from a zip.
Specifically, we try to search for the data in the following directory:

'.../botocore.zip/botocore/data'

Which fails our isdir check and is thus skipped.

Marking this as a feature request.

@joguSD joguSD added the feature-request This issue requests a feature. label Nov 20, 2018
@ADH-LukeBollam
Copy link
Author

What are the odds of getting this implemented? Its preventing us from distributing boto3, which makes it very hard to provide a package that depends on it in PySpark.

@gliptak
Copy link
Contributor

gliptak commented Apr 3, 2019

https://stackoverflow.com/a/22646702 has a snippet processing a zip.

@krish5989
Copy link

krish5989 commented Oct 4, 2019

is this issue resolved? Am also stuck with loading boto3 from .zip file when using --py-files option in spark2-submit. Appreciate any help to overcome this situation

@philboltt
Copy link

pytz has a similar issue reading timezone data in the zoneinfo folder from a packaged directory. To get around this is uses pkg_resources.resource_stream from setuptools - https://github.com/stub42/pytz/blob/7b1a844c8ecf2996142ac0eb32201b676e9dcb9a/src/pytz/__init__.py#L101

https://setuptools.readthedocs.io/en/latest/pkg_resources.html

It adds setuptools as a dependency when distributing as a zip, but at least it works. Would be great to have a fix for this. Workarounds are needlessly ugly.

@gliptak
Copy link
Contributor

gliptak commented Feb 7, 2020

Is #1008 also a duplicate?

@gliptak
Copy link
Contributor

gliptak commented Feb 17, 2020

I submitted PR boto/botocore#1969
Could a committer review?

philipkimmey added a commit to roverdotcom/botocore that referenced this issue May 13, 2020
Using pkg_resources allows for loading
modules from a zip.

Original author: Gábor Lipták <gliptak@gmail.com>

See:

- boto/boto3#1770
- boto#1969
philipkimmey added a commit to roverdotcom/botocore that referenced this issue May 14, 2020
philipkimmey added a commit to roverdotcom/botocore that referenced this issue May 14, 2020
philipkimmey added a commit to roverdotcom/botocore that referenced this issue May 14, 2020
@shadowdsp
Copy link

@gliptak Hello, I encounter this problem now, can we reopen boto/botocore#1969 and fix this problem?

@gliptak
Copy link
Contributor

gliptak commented Aug 17, 2020

@shadowdsp we need a commiter's help on that repo to move forward

@wolfch-elsevier
Copy link

Hi @gliptak - I tried your PR as a patch to botocore and zipped up the patched boto3/ botocore to s3cip_deps.zip and submitted to Amazon EMR (pyspark), via :

 spark-submit --deploy-mode cluster --py-files s3://data-nonprod/emr_demo/s3cip_deps.zip s3://data-nonprod/emr_demo/s3cip.py

and got:

  File "./s3cip_deps.zip/botocore/loaders.py", line 421, in load_data
   for possible_path in self._potential_locations(name):
 File "./s3cip_deps.zip/botocore/loaders.py", line 436, in _potential_locations
   path = pkg_resources.resource_filename(path1, 'data')
 File "/usr/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1226, in resource_filename
   self, resource_name
 File "/usr/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1722, in get_resource_filename
   "resource_filename() only supported for .egg, not .zip"
NotImplementedError: resource_filename() only supported for .egg, not .zip

@gliptak
Copy link
Contributor

gliptak commented Mar 11, 2021

@wolfch-elsevier you might try removing that folder the Python search path or this has a pointer

https://stackoverflow.com/questions/25872134/cxfreeze-error-resource-filename-only-supported-for-egg-not-zipp

@wolfch-elsevier
Copy link

@wolfch-elsevier you might try removing that folder the Python search path or this has a pointer

https://stackoverflow.com/questions/25872134/cxfreeze-error-resource-filename-only-supported-for-egg-not-zipp

Hi, thanks. No but this directory, /botocore.zip/botocore/data is an integral part of the botocore package. I thought your PR was a workaround for this?

@fmxleg
Copy link

fmxleg commented Mar 15, 2021

@wolfch-elsevier I need to fix the same problem. Have you found any solution for that?

@wolfch-elsevier
Copy link

wolfch-elsevier commented Mar 15, 2021

@fmxleg Unfortunately, no. I'm really surprised that Amazon hasn't come up with a working example to help promote their EMR ("cloudized" Apache Spark). I mean their example creates an RDD from a CSV in S3 - and that works (so it prob uses Scala or Java AWS-SDK to access) However, I need to read XML files which are documents, not series of records.

I tried this solution, which uses the spark.yarn.dist.archives config property. It's supposed to unzip the archive when it's pushed to the worker nodes:
https://stackoverflow.com/questions/36461054/i-cant-seem-to-get-py-files-on-spark-to-work

It didn't work for me.

@dsonavane-rgare
Copy link

dsonavane-rgare commented Apr 23, 2021

Found a work around for this. You can pass spark conf args to have spark unzip the dependencies and include in path, something like this,

	--conf spark.yarn.dist.archives=s3://<bucket+path>/sparkapp.zip#deps" \
	--conf spark.yarn.appMasterEnv.PYTHONPATH=deps" \
	--conf spark.executorEnv.PYTHONPATH=deps" \

Worked with EMR 6.2.0 and Python 3.7.9

jonathanindig added a commit to polynote/polynote that referenced this issue Jun 24, 2021
Some python modules should not be distributed via PySpark for various
reasons. For example, Boto doesn't work when distributed as a zip file
(which is how the PySpark distribution works), see
boto/boto3#1770 for details.

This PR adds a set of modules excluded by default, as well as
configurability of this feature. It can be turned off entirely, or a
different list of exclusions can be provided.
@kojiromike
Copy link

I have made a new PR, boto/botocore#2437, to attempt to resubmit boto/botocore#1969

@nickolashkraus
Copy link

This is also an issue for SaltStack modules:

Python 2.3 and higher allows developers to directly import Zip archives containing Python code.

Source

Salt execution modules are imported using zipimporter:

mod = zipimporter(fpath).load_module(name)

If one were to create a Zip archive containing botocore, the following error will occur when attempting to execute the module:

boto3.exceptions.ResourceNotExistsError: The 'dynamodb' resource does not exist.

This is due to the fact that the botocore loader (botocore/loaders.py) checks the path botocore/data/ for model files. If the path to botocore is a Zip archive, this check fails and botocore fails to load the models (EC2, S3, DynamoDB, etc.).

This renders SaltStack modules distributed as Zip modules using botocore useless.

@alete89
Copy link

alete89 commented Dec 2, 2021

Hi! I'm experiencing this error trying to use boto3 in modules within a zip dependency files on EMR. I think this worth a fix.

@kojiromike
Copy link

The fix is in boto/botocore#2437, but someone from AWS will have to review, approve and merge it.

@alete89
Copy link

alete89 commented Dec 2, 2021

Found a work around for this. You can pass spark conf args to have spark unzip the dependencies and include in path, something like this,

	--conf spark.yarn.dist.archives=s3://<bucket+path>/sparkapp.zip#deps" \
	--conf spark.yarn.appMasterEnv.PYTHONPATH=deps" \
	--conf spark.executorEnv.PYTHONPATH=deps" \

Worked with EMR 6.2.0 and Python 3.7.9

Hey @dsonavane-rgare I'm trying this without success. Can you elaborate a bit more?
This is how I was sending my file and deps (this throws boto3 not found because one of my zipped files uses boto3):

spark-submit --py-files s3://<bucket>/code/spark/dependencies.zip s3://<bucket>/code/spark/job.py args

This is what I've tried now, based on your example:

spark-submit --conf spark.yarn.dist.archives=s3://<bucket>/code/spark/dependencies.zip#deps --conf spark.yarn.appMasterEnv.PYTHONPATH=deps --conf spark.executorEnv.PYTHONPATH=deps s3://<bucket>/code/spark/job.py args

and this as well:

spark-submit --py-files s3://<bucket>/code/spark/dependencies.zip --conf spark.yarn.dist.archives=s3://<bucket>/code/spark/dependencies.zip#deps --conf spark.yarn.appMasterEnv.PYTHONPATH=deps --conf spark.executorEnv.PYTHONPATH=deps s3://<bucket>/code/spark/job.py 2021-12-01

Thanks

@aBurmeseDev aBurmeseDev added the p2 This is a standard priority issue label Nov 10, 2022
@MatheusAnciloto
Copy link

Does anyone have a work around for this?

@kojiromike
Copy link

Does anyone have a work around for this?

The only thing that ever worked for me was to run on systems with boto3 already installed, and exposed to the PYTHONPATH.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request This issue requests a feature. needs-review p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests