Are we looking for S3 integration where the input file could be anything (xml/csv) and output to be dict #17

tasneem-hyder · 2020-06-11T11:21:41Z

Input : S3 location files (all supported file format of benedict)
Should support all operations there after

fabiocaccamo · 2020-06-11T12:04:02Z

I don’t understand what you need to do that is not actually possible to do.

What do you mean? Could you show me an example of what you would like to do?

tasneem-hyder · 2020-06-12T07:06:39Z

Suppose we have to use below method:
https://github.com/fabiocaccamo/python-benedict#from_xml

This method takes url as first input, what if the url belongs to Amazon S3 or compatible S3 location for a hosted file.
Expectation : the method should be able to read the file directly from s3 location if we provide necessary configurations (AWS_KEY, AWS_SECRET, BUCKET, KEY etc)

I am checking if this kind of feature already there or work going on ?

fabiocaccamo · 2020-06-12T11:53:15Z

Ok, it's clear.
Actually this is not planned and it is the first time that I receive this request.
This is an interesting feature, it should be not hard to implement it using boto: https://stackoverflow.com/questions/4993439/how-can-i-access-s3-files-in-python-using-urls#5010733

If you do it feel free to submit a PR.

tasneem-hyder · 2020-06-13T07:18:12Z

Yes, I am using boto3 to download the file first then using python-benedict to process it. I will check your repo internals implementation for the integration.
Thanks.

fabiocaccamo · 2020-06-13T07:33:19Z

I think you must just change the read_url method here: benedict/dicts/io/io_util.py

fabiocaccamo · 2020-08-12T10:55:37Z

@uknow2009 could you show me your boto3 implementation to download the file?

tasneem-hyder · 2020-08-20T07:31:26Z

Using boto3 as below to download:

import boto3
from boto3.s3.transfer import TransferConfig

S3_HOST = "s3host"
ACCESS_KEY = "accesskey"
ACCESS_SECRET = "accesssecret"
bucket = "bucket"
obj_key = "objectkey"
download_file_path = "download_file_path"

config = TransferConfig(multipart_threshold=1024 * 25, max_concurrency=10,
                                multipart_chunksize=1024*25, use_threads=True)
try:
    s3_conn = boto3.resource('s3', endpoint_url=S3_HOST,
                                     aws_access_key_id=ACCESS_KEY,
                                     aws_secret_access_key=ACCESS_SECRET)
    s3_conn.meta.client.download_file(bucket, obj_key, download_file_path, Config=config)
    print("Download successfull")
except Exception as e:
    print("Unable to download file", str(e))

tasneem-hyder · 2020-08-28T12:06:28Z

^ @fabiocaccamo

fabiocaccamo · 2020-09-04T14:18:55Z

@uknow2009 I think that wrapping all this code with a single method that requires 8 arguments to work would be not a good idea... what do you think about it?

tasneem-hyder · 2020-09-08T14:06:35Z

@fabiocaccamo Yes, passing all arguments doesn't seem good, but what can be done is passing some config from our own defined s3 config file and passing only limited arguments like host, bucket and object key via method params. Download file path can be set to process root directory or tmp location as that will be used as intermediate step and later used by from_xml method.

fabiocaccamo · 2020-09-10T09:51:17Z

@uknow2009 this library could be the solution: https://github.com/dask/s3fs

tasneem-hyder · 2020-09-10T17:51:47Z

@fabiocaccamo yes, as per documentation s3fs looks more promising to integrate with benedict. This has already ways to pass the arguments and credentials to be used.

hudgeon · 2021-12-04T04:59:58Z

@tasneem-hyder We do a lot of this with benedict and s3fs (Thanks @fabiocaccamo BTW). Below is a code snippet that shows how we do it.

We thought about opening a pull request but haven't because boto3 is a big library that would make benedict a much heavier package. But we're happy to do so if this would be useful.

from benedict import benedict
import s3fs

s3_xml_path = 's3://sample_bucket/sample_cxml_po.xml'
s3 = s3fs.S3FileSystem(anon=False, default_cache_type='none')
xml_file = s3.cat(s3_xml_path).decode('utf-8')
json_file = benedict.from_xml(xml_file)
json_file

hudgeon · 2021-12-04T05:16:04Z

And, for completeness, here is some code that writes XML to S3

from benedict import benedict
import s3fs

d = benedict({'xml_root': {'a': 1, 'b':2}})
s3_xml_path = 's3://sample_bucket/sample_cxml_po.xml'
s3 = s3fs.S3FileSystem(anon=False, default_cache_type='none')
with s3.open(f"{s3_xml_path}", 'wb') as f:
    f.write(d.to_xml().encode('utf-8'))

fabiocaccamo · 2021-12-04T13:17:13Z

@hudgeon thanks for pointing out at your implementation, it looks pretty easy to add s3 support using boto3.

I agree, the boto3 dependency is a big library, but we could add it optional, so it will be available only if the library is installed with the optional requirement: pip install python-benedict[s3].

What do you think about this solution?

As the library is getting bigger this approach could be also extended to the different I/O formats, in this way it would be possible to keep the library more lightweight as possible (#18).

hudgeon · 2021-12-04T22:57:33Z

@fabiocaccamo I reckon that'd be pretty useful to a lot of people.

Are you thinking that if the first argument of the IO Method starts with s3:// then it uses s3fs as the 'filesystem' perhaps with read_s3_file and write_s3_file functions in here: https://github.com/fabiocaccamo/python-benedict/blob/eb950c57c0a17c58dfc247fd85a6e3baffb77e5f/benedict/dicts/io/io_util.py

fabiocaccamo · 2021-12-04T23:23:54Z

That's exactly what I would do!

If the filepath argument in read_file and write_file starts with 's3://' then call the s3 IO method:

read_s3_file
write_s3_file

hudgeon · 2021-12-05T01:13:12Z

:) Nice that we're on the same page. How would you like to proceed with this work and how can we assist?

hudgeon · 2021-12-05T04:01:42Z

@fabiocaccamo FYI, I've just taken a look at s3fs and it actually doesn't use boto3. It uses https://github.com/aio-libs/aiobotocore and https://github.com/boto/botocore.

fabiocaccamo · 2021-12-05T22:12:13Z

@hudgeon could you submit a PR including tests?

hudgeon · 2021-12-07T12:03:21Z

@fabiocaccamo No worries. We should be able to get to it this weekend.

fabiocaccamo · 2022-01-21T15:36:21Z

@hudgeon I implemented s3fs I/O operations in the s3 branch, could you add some tests here?
(just for the methods that involves the s3 storage is enough)

hudgeon · 2022-01-21T20:50:09Z

@fabiocaccamo We've also implemented it and have been using it from our private pip repository for several weeks now - but didn't push it to your branch because we haven't written tests for it. :)

In order to do the tests properly, we reckon we'll need to use localstack (https://github.com/localstack/localstack) which we haven't done before. Currently our S3 tests use an S3 bucket we own. Can you think of an approach to testing that wouldn't involve using localstack?

fabiocaccamo · 2022-01-25T17:33:06Z

@hudgeon I never did tests with an S3 bucket, probably adding localstack for doing tests would be overkill in this case.

Do you think we can avoid localstack by doing tests with a public bucket?

Also checking how s3fs tests s3 could be helpful:
https://github.com/fsspec/s3fs/blob/main/s3fs/tests/test_s3fs.py

fabiocaccamo · 2022-10-12T13:18:34Z

@tasneem-hyder @hudgeon s3 support has been added in 0.27.0 version.

https://github.com/fabiocaccamo/python-benedict#io-methods

tasneem-hyder added the enhancement New feature or request label Jun 11, 2020

tasneem-hyder assigned fabiocaccamo Jun 11, 2020

fabiocaccamo added the help wanted Extra attention is needed label Dec 23, 2021

fabiocaccamo added the sponsored label Mar 18, 2022

fabiocaccamo removed the sponsored label May 8, 2022

fabiocaccamo removed the help wanted Extra attention is needed label May 18, 2022

fabiocaccamo closed this as completed Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are we looking for S3 integration where the input file could be anything (xml/csv) and output to be dict #17

Are we looking for S3 integration where the input file could be anything (xml/csv) and output to be dict #17

tasneem-hyder commented Jun 11, 2020

fabiocaccamo commented Jun 11, 2020 •

edited

tasneem-hyder commented Jun 12, 2020

fabiocaccamo commented Jun 12, 2020

tasneem-hyder commented Jun 13, 2020

fabiocaccamo commented Jun 13, 2020

fabiocaccamo commented Aug 12, 2020

tasneem-hyder commented Aug 20, 2020

tasneem-hyder commented Aug 28, 2020

fabiocaccamo commented Sep 4, 2020

tasneem-hyder commented Sep 8, 2020

fabiocaccamo commented Sep 10, 2020

tasneem-hyder commented Sep 10, 2020

hudgeon commented Dec 4, 2021

hudgeon commented Dec 4, 2021

fabiocaccamo commented Dec 4, 2021 •

edited

hudgeon commented Dec 4, 2021 •

edited

fabiocaccamo commented Dec 4, 2021 •

edited

hudgeon commented Dec 5, 2021

hudgeon commented Dec 5, 2021

fabiocaccamo commented Dec 5, 2021

hudgeon commented Dec 7, 2021

fabiocaccamo commented Jan 21, 2022

hudgeon commented Jan 21, 2022

fabiocaccamo commented Jan 25, 2022

fabiocaccamo commented Oct 12, 2022

Are we looking for S3 integration where the input file could be anything (xml/csv) and output to be dict #17

Are we looking for S3 integration where the input file could be anything (xml/csv) and output to be dict #17

Comments

tasneem-hyder commented Jun 11, 2020

fabiocaccamo commented Jun 11, 2020 • edited

tasneem-hyder commented Jun 12, 2020

fabiocaccamo commented Jun 12, 2020

tasneem-hyder commented Jun 13, 2020

fabiocaccamo commented Jun 13, 2020

fabiocaccamo commented Aug 12, 2020

tasneem-hyder commented Aug 20, 2020

tasneem-hyder commented Aug 28, 2020

fabiocaccamo commented Sep 4, 2020

tasneem-hyder commented Sep 8, 2020

fabiocaccamo commented Sep 10, 2020

tasneem-hyder commented Sep 10, 2020

hudgeon commented Dec 4, 2021

hudgeon commented Dec 4, 2021

fabiocaccamo commented Dec 4, 2021 • edited

hudgeon commented Dec 4, 2021 • edited

fabiocaccamo commented Dec 4, 2021 • edited

hudgeon commented Dec 5, 2021

hudgeon commented Dec 5, 2021

fabiocaccamo commented Dec 5, 2021

hudgeon commented Dec 7, 2021

fabiocaccamo commented Jan 21, 2022

hudgeon commented Jan 21, 2022

fabiocaccamo commented Jan 25, 2022

fabiocaccamo commented Oct 12, 2022

fabiocaccamo commented Jun 11, 2020 •

edited

fabiocaccamo commented Dec 4, 2021 •

edited

hudgeon commented Dec 4, 2021 •

edited

fabiocaccamo commented Dec 4, 2021 •

edited