From 6e7ec0cbd5577ab7effe647db652c2e60dce7b68 Mon Sep 17 00:00:00 2001 From: Martin Thoma Date: Tue, 20 Dec 2022 23:39:59 +0100 Subject: [PATCH] DOC: How to read PDFs from S3 --- docs/user/streaming-data.md | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/docs/user/streaming-data.md b/docs/user/streaming-data.md index 3cfa5c315..78f960039 100644 --- a/docs/user/streaming-data.md +++ b/docs/user/streaming-data.md @@ -53,3 +53,24 @@ with BytesIO() as bytes_stream: Body=bytes_stream, RequestRoute=request_route, RequestToken=request_token ) ``` + +## Reading PDFs directly from cloud services + +One option is to first download the file and then pass the local file path to `PdfReader`. +Another option is to get a byte stream. + +For AWS S3 it works like this: + +```python +from io import BytesIO + +import boto3 +from PyPDF2 import PdfReader + + +s3 = boto3.client("s3") +obj = s3.get_object(Body=csv_buffer.getvalue(), Bucket="my-bucket", Key="my/doc.pdf") +reader = PdfReader(BytesIO(obj["Body"].read())) +``` + +It works similarly for Google Cloud Storage ([example](https://stackoverflow.com/a/68403628/562769))