Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are we looking for S3 integration where the input file could be anything (xml/csv) and output to be dict #17

Closed
tasneem-hyder opened this issue Jun 11, 2020 · 25 comments
Assignees
Labels
enhancement New feature or request

Comments

@tasneem-hyder
Copy link

Input : S3 location files (all supported file format of benedict)
Should support all operations there after

@tasneem-hyder tasneem-hyder added the enhancement New feature or request label Jun 11, 2020
@fabiocaccamo
Copy link
Owner

fabiocaccamo commented Jun 11, 2020

I don’t understand what you need to do that is not actually possible to do.

What do you mean? Could you show me an example of what you would like to do?

@tasneem-hyder
Copy link
Author

Suppose we have to use below method:
https://github.com/fabiocaccamo/python-benedict#from_xml

This method takes url as first input, what if the url belongs to Amazon S3 or compatible S3 location for a hosted file.
Expectation : the method should be able to read the file directly from s3 location if we provide necessary configurations (AWS_KEY, AWS_SECRET, BUCKET, KEY etc)

I am checking if this kind of feature already there or work going on ?

@fabiocaccamo
Copy link
Owner

Ok, it's clear.
Actually this is not planned and it is the first time that I receive this request.
This is an interesting feature, it should be not hard to implement it using boto: https://stackoverflow.com/questions/4993439/how-can-i-access-s3-files-in-python-using-urls#5010733

If you do it feel free to submit a PR.

@tasneem-hyder
Copy link
Author

Yes, I am using boto3 to download the file first then using python-benedict to process it. I will check your repo internals implementation for the integration.
Thanks.

@fabiocaccamo
Copy link
Owner

I think you must just change the read_url method here: benedict/dicts/io/io_util.py

@fabiocaccamo
Copy link
Owner

@uknow2009 could you show me your boto3 implementation to download the file?

@tasneem-hyder
Copy link
Author

Using boto3 as below to download:

import boto3
from boto3.s3.transfer import TransferConfig

S3_HOST = "s3host"
ACCESS_KEY = "accesskey"
ACCESS_SECRET = "accesssecret"
bucket = "bucket"
obj_key = "objectkey"
download_file_path = "download_file_path"

config = TransferConfig(multipart_threshold=1024 * 25, max_concurrency=10,
                                multipart_chunksize=1024*25, use_threads=True)
try:
    s3_conn = boto3.resource('s3', endpoint_url=S3_HOST,
                                     aws_access_key_id=ACCESS_KEY,
                                     aws_secret_access_key=ACCESS_SECRET)
    s3_conn.meta.client.download_file(bucket, obj_key, download_file_path, Config=config)
    print("Download successfull")
except Exception as e:
    print("Unable to download file", str(e))

@tasneem-hyder
Copy link
Author

^ @fabiocaccamo

@fabiocaccamo
Copy link
Owner

@uknow2009 I think that wrapping all this code with a single method that requires 8 arguments to work would be not a good idea... what do you think about it?

@tasneem-hyder
Copy link
Author

@fabiocaccamo Yes, passing all arguments doesn't seem good, but what can be done is passing some config from our own defined s3 config file and passing only limited arguments like host, bucket and object key via method params. Download file path can be set to process root directory or tmp location as that will be used as intermediate step and later used by from_xml method.

@fabiocaccamo
Copy link
Owner

@uknow2009 this library could be the solution: https://github.com/dask/s3fs

@tasneem-hyder
Copy link
Author

@fabiocaccamo yes, as per documentation s3fs looks more promising to integrate with benedict. This has already ways to pass the arguments and credentials to be used.

@hudgeon
Copy link

hudgeon commented Dec 4, 2021

@tasneem-hyder We do a lot of this with benedict and s3fs (Thanks @fabiocaccamo BTW). Below is a code snippet that shows how we do it.

We thought about opening a pull request but haven't because boto3 is a big library that would make benedict a much heavier package. But we're happy to do so if this would be useful.

from benedict import benedict
import s3fs

s3_xml_path = 's3://sample_bucket/sample_cxml_po.xml'
s3 = s3fs.S3FileSystem(anon=False, default_cache_type='none')
xml_file = s3.cat(s3_xml_path).decode('utf-8')
json_file = benedict.from_xml(xml_file)
json_file

@hudgeon
Copy link

hudgeon commented Dec 4, 2021

And, for completeness, here is some code that writes XML to S3

from benedict import benedict
import s3fs

d = benedict({'xml_root': {'a': 1, 'b':2}})
s3_xml_path = 's3://sample_bucket/sample_cxml_po.xml'
s3 = s3fs.S3FileSystem(anon=False, default_cache_type='none')
with s3.open(f"{s3_xml_path}", 'wb') as f:
    f.write(d.to_xml().encode('utf-8'))

@fabiocaccamo
Copy link
Owner

fabiocaccamo commented Dec 4, 2021

@hudgeon thanks for pointing out at your implementation, it looks pretty easy to add s3 support using boto3.

I agree, the boto3 dependency is a big library, but we could add it optional, so it will be available only if the library is installed with the optional requirement: pip install python-benedict[s3].

What do you think about this solution?

As the library is getting bigger this approach could be also extended to the different I/O formats, in this way it would be possible to keep the library more lightweight as possible (#18).

@hudgeon
Copy link

hudgeon commented Dec 4, 2021

@fabiocaccamo I reckon that'd be pretty useful to a lot of people.

Are you thinking that if the first argument of the IO Method starts with s3:// then it uses s3fs as the 'filesystem' perhaps with read_s3_file and write_s3_file functions in here: https://github.com/fabiocaccamo/python-benedict/blob/eb950c57c0a17c58dfc247fd85a6e3baffb77e5f/benedict/dicts/io/io_util.py

@fabiocaccamo
Copy link
Owner

fabiocaccamo commented Dec 4, 2021

That's exactly what I would do!

If the filepath argument in read_file and write_file starts with 's3://' then call the s3 IO method:

  • read_s3_file
  • write_s3_file

@hudgeon
Copy link

hudgeon commented Dec 5, 2021

:) Nice that we're on the same page. How would you like to proceed with this work and how can we assist?

@hudgeon
Copy link

hudgeon commented Dec 5, 2021

@fabiocaccamo FYI, I've just taken a look at s3fs and it actually doesn't use boto3. It uses https://github.com/aio-libs/aiobotocore and https://github.com/boto/botocore.

@fabiocaccamo
Copy link
Owner

@hudgeon could you submit a PR including tests?

@hudgeon
Copy link

hudgeon commented Dec 7, 2021

@fabiocaccamo No worries. We should be able to get to it this weekend.

@fabiocaccamo fabiocaccamo added the help wanted Extra attention is needed label Dec 23, 2021
@fabiocaccamo
Copy link
Owner

@hudgeon I implemented s3fs I/O operations in the s3 branch, could you add some tests here?
(just for the methods that involves the s3 storage is enough)

@hudgeon
Copy link

hudgeon commented Jan 21, 2022

@fabiocaccamo We've also implemented it and have been using it from our private pip repository for several weeks now - but didn't push it to your branch because we haven't written tests for it. :)

In order to do the tests properly, we reckon we'll need to use localstack (https://github.com/localstack/localstack) which we haven't done before. Currently our S3 tests use an S3 bucket we own. Can you think of an approach to testing that wouldn't involve using localstack?

@fabiocaccamo
Copy link
Owner

@hudgeon I never did tests with an S3 bucket, probably adding localstack for doing tests would be overkill in this case.

Do you think we can avoid localstack by doing tests with a public bucket?

Also checking how s3fs tests s3 could be helpful:
https://github.com/fsspec/s3fs/blob/main/s3fs/tests/test_s3fs.py

@fabiocaccamo fabiocaccamo removed the help wanted Extra attention is needed label May 18, 2022
@fabiocaccamo
Copy link
Owner

@tasneem-hyder @hudgeon s3 support has been added in 0.27.0 version.

https://github.com/fabiocaccamo/python-benedict#io-methods

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants