Copy data as fast as possible with a job in the cloud (Big query and Cloud SQL) #803

JPFrancoia · 2024-04-29T16:33:38Z

JPFrancoia
Apr 29, 2024

Hi,

I'm trying to copy data between two data sources. If you're curious, I'm copying data between Big query and Cloud SQL. It's not really relevant here, this question should be transposable to other types of database.

I can dump data from my first data source (Big query) into files, either json or csv files. And I can read from these files (either in batch or streaming), and I want to write this data into a postgres database (cloud SQL). I would like to do this as quickly/efficiently as possible. I'm using a job (a cloud run job) to read the data from the files. It looks like this:

    with open("data/super_data.csv", "r") as f:
        with global_pool.connection() as conn:
            with conn.cursor() as cur:
                with cur.copy("COPY my_table FROM STDIN") as copy:
                    while data := f.read(1024):
                        copy.write(data)

I have a few questions regarding this approach:

I'm voluntarily trying to not parse the csv file or to use write_row. Is there a more efficient way to do this? Maybe tweaking the 1024 parameter? Maybe use the pipeline mode?
I tried reading from a json file, but it doesn't work, it looks like the text from the file is being read as is, and the symbols (coming from the json) crash the copy. Is there a way to modify the snippet above to read from json? I can do something like this with psql:

create table temp (data jsonb);
\copy temp (data) FROM 'data/super_data_ndjson.json';

Answered by dvarrazzo

Apr 29, 2024

using write() instead of write_row() is definitely a better approach: you read block by block and write block by block with no Python parsing.
pipeline mode and COPY are not compatible. You can't use COPY in pipeline mode.
Everything that postgres understand is documented in the COPY command. This interest you especially around CVS parsing (separators, title row, etc). If postgres understand what you are sending it, then you can use psycopg block copy without any parsing on the client. If your CSV is in a dialect that Postgres doesn't understand, you are out of luck and you will need some form of parsing. I don't know what reading from a JSON file would mean.
Beware of #745: we have a CO…

View full answer

dvarrazzo · 2024-04-29T18:38:20Z

dvarrazzo
Apr 29, 2024
Maintainer

using write() instead of write_row() is definitely a better approach: you read block by block and write block by block with no Python parsing.
pipeline mode and COPY are not compatible. You can't use COPY in pipeline mode.
Everything that postgres understand is documented in the COPY command. This interest you especially around CVS parsing (separators, title row, etc). If postgres understand what you are sending it, then you can use psycopg block copy without any parsing on the client. If your CSV is in a dialect that Postgres doesn't understand, you are out of luck and you will need some form of parsing. I don't know what reading from a JSON file would mean.
Beware of Binary copy performance on Mac #745: we have a COPY performance bug to address.
Whether 1024 is an optimal value or not, I have no idea. You have to measure it.

0 replies

JPFrancoia · 2024-04-29T18:57:16Z

JPFrancoia
Apr 29, 2024
Author

That's pretty much a perfect answer, thank you very much.
There is one thing I couldn't find in the docs though, it's how to read a json file directly with copy and with pyscopg.

The psql equivalent is below, so I imagine that postgres knows how to parse json. Note that postgres requires new line delimited json:

\copy temp (data) FROM 'data/super_data_ndjson.json';

I imagine this is still valid:

                with cur.copy("COPY my_table FROM STDIN") as copy:
                    while data := f.read(1024):
                        copy.write(data)

But I need to figure out how to read the json file (it's not as easy as just doing with open("data/super_data.json", "r") as f). Have you seen that before?

Just for your curiosity, it's not that I prefer json over csv, it's because very often people use nested fields in big query and these can't be exported to csv easily

2 replies

dvarrazzo Apr 29, 2024
Maintainer

The psql equivalent is below, so I imagine that postgres knows how to parse json. Note that postgres requires new line delimited json:

The command you are running works just because you are importing into a single column of your database. However the file has still a "format": the COPY format described in the docs already linked. [warning: what writing now is on top of my memory, please read the docs to confirm] This means that if your file contains e.g. a \n (as in two bytes, \ and n, to represent a newline in json) then probably it should be escaped as \\n to pass the JSON representation through the COPY format. If the copy is in CSV then you'll need to escape the commas, or whatever separator you choose, etc.

In general, I don't think that a "newline separated json" is, in general, a valid copy input.

In order to solve this problem you can probably copy row-by-row. You don't even need to parse the json probably: something like the (untested) following should work:

with open("data/super_data_ndjson.json") as f:
    with cur.copy("COPY my_table (data) FROM STDIN") as copy:
        for line in f:
            copy.write_row((line,))

doing so, psycopg will escape the json data according to the COPY protocol. I don't think you will leave a lot of performance on the floor, compared to block-by-block copy: most likely this is I/O bound, parsing the file is trivial (just newline split) and psycopg copy escaping is optimized in C (if you use the psycopg[c] or [binary] extras).

The command you used:

\copy temp (data) FROM 'data/super_data_ndjson.json';

copies a local file, it means that psql and the server are running on the same machine and postgres has access to the file. Maybe the database you are running locally and the remote postgres are two different versions (and vendors)?

JPFrancoia Apr 30, 2024
Author

Thank you for your very precise answer.
I normally just run this command on a normal json file to turn it into a ndjson file:

jq -c '.[]' data/super_data.json > data/super_data_ndjson.json

It doesn't do any escaping, it's litterally just a json blob per line. I can then copy (I use vanilla postgres 16 in a docker container):

create table temp (data jsonb);
\copy temp (data) FROM 'data/super_data_ndjson.json';

I just tested it, this works:

    with open("data/super_data_ndjson.json", "r") as f:
        with global_pool.connection() as conn:
            with conn.cursor() as cur:
                with cur.copy(
                    "copy temp(data) from stdin"
                ) as copy:
                    while data := f.read(1024):
                        copy.write(data)

As you said, it works because:

The command you are running works just because you are importing into a single column of your database.

And I think this is how I'd import json data. Once it's in a temp table I can move data in between tables by looking at the json blob in postgres (I much prefer that than trying to escape json).

But overall, I think you're right, the ndjson doesn't look like a good format to work with. I'll try and stick to csv.

Thanks again for your time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Copy data as fast as possible with a job in the cloud (Big query and Cloud SQL) #803

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Copy data as fast as possible with a job in the cloud (Big query and Cloud SQL) #803

JPFrancoia Apr 29, 2024

Replies: 2 comments · 2 replies

dvarrazzo Apr 29, 2024 Maintainer

JPFrancoia Apr 29, 2024 Author

dvarrazzo Apr 29, 2024 Maintainer

JPFrancoia Apr 30, 2024 Author

JPFrancoia
Apr 29, 2024

Replies: 2 comments 2 replies

dvarrazzo
Apr 29, 2024
Maintainer

JPFrancoia
Apr 29, 2024
Author

dvarrazzo Apr 29, 2024
Maintainer

JPFrancoia Apr 30, 2024
Author