Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3_client.copy() drops metadata for small files #246

Open
cariaso opened this issue Jul 26, 2022 · 10 comments
Open

s3_client.copy() drops metadata for small files #246

cariaso opened this issue Jul 26, 2022 · 10 comments

Comments

@cariaso
Copy link

cariaso commented Jul 26, 2022

Describe the bug

s3_client.copy() stores metadata for files large enough to trigger multipart handling, but smaller files trigger a different code path, and the metadata is lost.

Expected Behavior

metadata should be copied regardless of the file size

Current Behavior

metadata is missing

Reproduction Steps

    import boto3                                                                                                                                                                         
    import time                                                                                                                                                                          
    s3Res = boto3.resource("s3")                                                                                                                                                         
    s3Client = boto3.client("s3")                                                                                                                                                        
                                                                                                                                                                                         
    srcBucket = 'yellow7-nf-runs'                                                                                                                                                        
    srcKey = 'results/theoriginal'                                                                                                                                                       
                                                                                                                                                                                         
    destBucket = srcBucket                                                                                                                                                               
    destKey = 'results/thecopy'                                                                                                                                                          
                                                                                                                                                                                         
    metaKey = 'somekeyname'                                                                                                                                                              
    metaVal = str(time.time())                                                                                                                                                           
                                                                                                                                                                                         
    # config.multipart_threshold from 
    # https://github.com/boto/s3transfer/blob/6bae40b6306b2071c06c854608a14ba60e478d74/s3transfer/__init__.py#L117
    multipart_threshold_size = 8388608                                                                                                                                                   
                                                                                                                                                                                         
    for size in [                                                                                                                                                                        
            multipart_threshold_size + 1, # slightly bigger, works                                                                                                                       
            multipart_threshold_size,     # slightly bigger, works                                                                                                                       
            multipart_threshold_size - 1, # slightly smaller, fails                                                                                                                      
            multipart_threshold_size - 2, # slightly smaller, fails                                                                                                                      
    ]:                                                                                                                                                                                   
        print(f"populating {size} bytes into s3://{srcBucket}/{srcKey}")                                                                                                                 
        s3Res.Object(srcBucket, srcKey).put(Body=FakeFile(size))                                                                                                                         
                                                                                                                                                                                         
        s3Client.copy(                                                                                                                                                                   
            CopySource={"Bucket": srcBucket, "Key": srcKey},                                                                                                                             
            Bucket=destBucket,                                                                                                                                                           
            Key=destKey,                                                                                                                                                                 
            ExtraArgs={"Metadata": {metaKey: metaVal}},                                                                                                                                  
        )                                                                                                                                                                                
                                                                                                                                                                                         
        for i in range(5):                                                                                                                                                               
            gotval = s3Res.Object(destBucket, destKey).metadata.get(metaKey)                                                                                                             
            if gotval is None:                                                                                                                                                           
                print("metadata is still empty")                                                                                                                                         
                time.sleep(1)                                                                                                                                                            
            else:                                                                                                                                                                        
                print("metadata was found", gotval)                                                                                                                                      
                break                                                                                                                                                                    

I'm using this to emulate variable sized files, but you can replace it with a static file.

class FakeFile():                                                                                                                                                                        
    '''A file-like object that generates a fixed byte stream, so that uploads                                                                                                            
    can be verified without having a large file on disk.''' 
    # https://waymoot.org/home/python_string/
    # https://stackoverflow.com/questions/43463180/efficient-way-to-generate-content-for-a-fake-file-like-object-in-python                                                                                                                             
    def __init__(self, length = 10 * 1024 * 1024): # 10 megabytes                                                                                                                        
        self.idx = 0                                                                                                                                                                     
        self.length = length                                                                                                                                                             
    def read(self, size=None):                                                                                                                                                           
        r = b''                                                                                                                                                                          
        max = self.length - self.idx                                                                                                                                                     
        if size == None or size > max:                                                                                                                                                   
            size = max                                                                                                                                                                   
        r = b"".join([b'%x' % (c%16) for c in range(self.idx, self.idx+size)])                                                                                                           
        self.idx += size                                                                                                                                                                 
        return r                                                                                                                                                                         
    #def readLine(self, size=None): # - read one entire line from the file.                                                                                                              
    #    return self.read()                                                                                                                                                              
    #def close(): # - close the file.                                                                                                                                                    
    #    pass                                                                                                                                                                            
    def seek(self, i, dir=None):                                                                                                                                                         
        import os                                                                                                                                                                        
        if dir is None:                                                                                                                                                                  
            dir=os.SEEK_SET                                                                                                                                                              
        if dir==os.SEEK_CUR:                                                                                                                                                             
            self.idx += i                                                                                                                                                                
        elif dir==os.SEEK_SET:                                                                                                                                                           
            self.idx = i                                                                                                                                                                 
        elif dir==os.SEEK_END:                                                                                                                                                           
            self.idx = self.length - i                                                                                                                                                   
        else:                                                                                                                                                                            
            raise ValueError()                                                                                                                                                           
                                                                                                                                                                                         
    def tell(self):                                                                                                                                                                      
        return self.idx                                                                                                                                                                  

Possible Solution

the size determination is made, at

def _submit(

and results in either calling _submit_copy_request or _submit_multipart_request

        if transfer_future.meta.size < config.multipart_threshold:                                                                                                                                                            
            self._submit_copy_request(                                                                                                                                                                                        
                client, config, osutil, request_executor, transfer_future                                                                                                                                                     
            )                                                                                                                                                                                                                 
        else:                                                                                                                                                                                                                 
            self._submit_multipart_request(                                                                                                                                                                                   
                client, config, osutil, request_executor, transfer_future                                                                                                                                                     
            )                                                                                                                                                                                                                 

Additional Information/Context

No response

SDK version used

boto3 1.24.37

Environment details (OS name and version, etc.)

Ubuntu 22.04 LTS

@cariaso cariaso added the bug label Jul 26, 2022
@cariaso
Copy link
Author

cariaso commented Jul 26, 2022

adding a TransferConfig of multipart_threshold=1 will work around this, in all except the pathological 0 byte file scenario.

from boto3.s3.transfer import TransferConfig                                                                                                                   
threshold = 1                                                                                                                                      
config = TransferConfig(multipart_threshold=threshold)
s3Client.copy(                                                                                                                                                                   
            CopySource={"Bucket": srcBucket, "Key": srcKey},                                                                                                                             
            Bucket=destBucket,                                                                                                                                                           
            Key=destKey,                                                                                                                                                                 
            ExtraArgs={"Metadata": {metaKey: metaVal}},       
            Config=config,                                                                                                                           
        )                                                                                             

@Kaelten
Copy link

Kaelten commented Jul 26, 2022

adding a TransferConfig of multipart_threshold=1 will work around this, in all except the pathological 0 byte file scenario.

from boto3.s3.transfer import TransferConfig                                                                                                                   
threshold = 1                                                                                                                                      
config = TransferConfig(multipart_threshold=threshold)
s3Client.copy(                                                                                                                                                                   
            CopySource={"Bucket": srcBucket, "Key": srcKey},                                                                                                                             
            Bucket=destBucket,                                                                                                                                                           
            Key=destKey,                                                                                                                                                                 
            ExtraArgs={"Metadata": {metaKey: metaVal}},       
            Config=config,                                                                                                                           
        )                                                                                             

Thanks for the tip, I'll try that promptly, Any insight into what's going on?

@cariaso
Copy link
Author

cariaso commented Jul 26, 2022

I chased it all the way down to a function called emit and looked at the full request. It looked like amazon's servers just didn't do what the api asked them to do.

@Kaelten
Copy link

Kaelten commented Jul 26, 2022

I've been spending hours digging through libraries trying to get to a similar point, thanks much!

@Kaelten
Copy link

Kaelten commented Jul 26, 2022

I chased it all the way down to a function called emit and looked at the full request. It looked like amazon's servers just didn't do what the api asked them to do.

I think I figured it out.

Once I added that setting I received an exception related to a missing permission on a kms key needed to encrypt the object at rest.

Now that I've fixed the permission, I no longer receive the EOF issue, even without the transfer config. This makes me think it was somehow masking the underlying error.

@tim-finnigan
Copy link

@cariaso just to clarify is the metadata lost when using the default multipart_threshold value? If there is an issue with s3transfer then we would consider transferring this issue here.

@cariaso
Copy link
Author

cariaso commented Aug 3, 2022 via email

@tim-finnigan tim-finnigan self-assigned this Aug 4, 2022
@tim-finnigan
Copy link

Thanks @cariaso for following up. Have you also tried using the copy_from method to copy the metadata over as described here?

@tim-finnigan
Copy link

A similar issue was opened recently and another user confirmed that copy_from works for copying metadata: boto/boto3#3378 (comment)

I also tested the copy_object command and it copies the metadata by default. There may be slight performance implications/trade-offs between these three copy commands. But I think this issue specifically relates to s3transfer and so I'm going to move it to that repository for further review and clarification from the team.

@tim-finnigan tim-finnigan transferred this issue from boto/boto3 Aug 5, 2022
@tim-finnigan tim-finnigan removed their assignment Aug 5, 2022
@tim-finnigan
Copy link

For more background on copy, copy_from and copy_object see the discussion in boto/boto3#3051. We are currently tracking that issue in boto3 for improvements to the documentation related to the distinctions between these commands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants