Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use s3 for temporary storage #177

Closed
wants to merge 1 commit into from
Closed

Conversation

slifty
Copy link
Contributor

@slifty slifty commented May 19, 2023

This PR replaces the use of local storage for temporary upload progress with S3.

By making this change we prevent the risk of local file storage being filled.

Eventually the SDK and permanent backend will support multi-part upload, at which point we will be able to remove the S3 dependencies / direct integration with S3.

Resolves #145

Copy link
Contributor

@Fenn-CS Fenn-CS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@cecilia-donnelly
Copy link
Member

Tagged @liam-lloyd since he has been looking at multipart upload too.

src/classes/TemporaryFile.ts Show resolved Hide resolved
src/classes/TemporaryFile.ts Outdated Show resolved Hide resolved
@slifty slifty force-pushed the 145-use-s3-for-temp-storage branch 4 times, most recently from 249938b to a6ae32c Compare May 25, 2023 17:15
We were using the local file system for storing files temporarily as
they were being uploaded to the SFTP service.  We knew this was not
a long term solution for several reasons, including the security risk of
a denial of service by filling up the disk space.

We're moving to the long term solution of using S3 directly for that
upload.  We can't use the existing upload-service and permanent API for
this because, at least for now, it requires us to know the size of the
file AND does not allow us to upload over multiple parts.

For this reason we're integrating with S3 directly.

Issue #145 A user could create a DOS by uploading files that are too large
@slifty slifty force-pushed the 145-use-s3-for-temp-storage branch from a6ae32c to 2f0b78e Compare May 25, 2023 17:19
@slifty slifty requested review from Fenn-CS and liam-lloyd May 25, 2023 17:20
Comment on lines +16 to +28
if (process.env.TEMPORARY_FILE_S3_BUCKET === undefined) {
throw new SystemConfigurationError('TEMPORARY_FILE_S3_BUCKET must be populated in order to upload to s3.');
}
if (process.env.AWS_ACCESS_KEY_ID === undefined) {
throw new SystemConfigurationError('AWS_ACCESS_KEY_ID must be populated in order to upload to s3.');
}
if (process.env.AWS_ACCESS_SECRET === undefined) {
throw new SystemConfigurationError('AWS_ACCESS_SECRET must be populated in order to upload to s3.');
}
if (process.env.TEMPORARY_FILE_S3_BUCKET_REGION === undefined) {
throw new SystemConfigurationError('TEMPORARY_FILE_S3_BUCKET_REGION must be populated in order to upload to s3.');
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly the same check in the TemporaryFile class. Can you wrap this check in a function and call it in these two places?

Not a hill I would die on though.

@slifty
Copy link
Contributor Author

slifty commented Jun 1, 2023

I was about to write in the issue but realized the PR is probably the best place for it at this point -- In testing this out locally I realized that the "local buffer" implementation doesn't quite work out with the way sftp sends data (and, in fact, wouldn't have worked even without the local buffer, and maybe wouldn't even have worked necessarily in the original implementation!).

The SFTP protocol includes an offset:

`offset' is the offset (in bytes) from the
   beginning of the file where to start writing, and `data' is the data
   to be written.

   The write will extend the file if writing beyond the end of the file.
   It is legal to write way beyond the end of the file; the semantics
   are to write zeroes from the end of the file to the specified offset
   and then the data.  On most operating systems, such writes do not
   allocate disk space but instead leave "holes" in the file.

Which is to say, writes are done in parallel AND are not guaranteed to be sequential. I'm currently working to figure out a good way to handle this.

@slifty
Copy link
Contributor Author

slifty commented Jun 1, 2023

It looks like the SFTP protocol (v3) does not provide a way to signal to clients that writes must be sequential.

I think this means that, well, we need to support non-sequential writes and know how to honor the offset.

The trouble here is we now have some conflicting constratings:

  1. We want a finite cap on the amount of local data that can be stored per connection.
  2. S3 requires us to write 5MB chunks.

If data is being written sequentially we have no issue: just append to a single local temporary file and when it gets large enough, upload it.

Since data is NOT being written sequentially, we have no way of ensuring that we will get a 5MB chunk.

For example, one could imagine imagine an SFTP client that breaks a 100MB file into twenty separate 5MB segments, and sends parts of each segment in a round robin style. This would functionally result in storing all 100MB in local temporary storage before writing anything to S3!

It does appear that rclone is writing sequentially, by the way, it's just that we need to appreciate that this is not guaranteed.

I'm going to reflect on alternate approaches given these revelations. One approach could be to use s3 for even the "temporary" storage as a series of individual, small files that are indexed by their offsets and ultimately get coalesced. I am concerned that this will lead to massive overhead in creating so many tiny S3 objects (and I am also unsure if S3 prices per object, which I could see making this tactic very expensive).

Another approach could be to simply implement our current plan, knowing that generally clients are generally writing sequentially. We could also add limits to the allowed local storage (but those limits might be higher than 5MB). This could represent a "sweet spot" of preventing DDOS risk but also could lead to confusing file upload failures for users if their SFTP client has an obnoxious write fragmentation algorithim

@slifty slifty marked this pull request as draft June 1, 2023 20:54
@slifty
Copy link
Contributor Author

slifty commented Jun 1, 2023

Marking this as draft for now since there have been enough odd curve balls that it warrants stepping back and making sure this is still the approach we want to take.

I also just spoke with @kfogel and I'm going to at least update the current (full tmp storage) implementation to properly account for offset.

@cecilia-donnelly cecilia-donnelly removed their request for review June 5, 2023 21:28
@slifty slifty closed this Aug 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A user could create a DOS by uploading files that are too large
4 participants