Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

advice: best bulk upsert method that still allows to track # of affected rows? #755

Open
ale-dd opened this issue May 6, 2021 · 8 comments

Comments

@ale-dd
Copy link

ale-dd commented May 6, 2021

I've been relying on the newest implementation of executemany() to perform bulk upserts, but it has the shortcoming that it will not allow to easily determine the number of affected rows by parsing the statusmsg.

The number of effectively upserted rows can easily be less than the number of rows I attempt to upsert, since I qualify my ON CONFLICT clause with a further WHERE clause specifying that the update should only happen if the new and excluded tuples are distinct.

INSERT INTO "table_name" AS __destination_row (
    id,
    other_column
) VALUES ($1, $2)
ON CONFLICT (id)
DO UPDATE SET
    id = excluded.id,
    other_column = excluded.other_column
WHERE
    (__destination_row.id IS DISTINCT FROM excluded.id)
 OR
    (__destination_row.other_column IS DISTINCT FROM excluded.other_column)
;

(regular Postgres would allow for a much terser syntax, but this is the only syntax that is accepted by CockroachDB)

Suppose that at times knowing the exact number of effectively upserted rows is more crucial than the bulk performance, and yet I would prefer not to go to the extreme of upserting one row at a time, what would be the best compromise?

Should I rely on a temporary table and then upserting into the physical tables from that temporary table?

INSERT INTO "table_name" AS __destination_row (
    id,
    other_column
) SELECT (
    id,
    other_column
) FROM "__temp_table_name"
ON CONFLICT (id)
DO UPDATE SET
    id = excluded.id,
    other_column = excluded.other_column
WHERE
    (__destination_row.id IS DISTINCT FROM excluded.id)
 OR
    (__destination_row.other_column IS DISTINCT FROM excluded.other_column)
;

Should I instead use a transaction with several individual upserts of values once again provided by the client?

Are there other approaches I should explore?

@elprans
Copy link
Member

elprans commented May 6, 2021

Should I rely on a temporary table and then upserting into the physical tables from that temporary table?

This would be the best alternative approach.

@ale-dd
Copy link
Author

ale-dd commented May 7, 2021

@fantix does issuing two statements back to back at the same psql prompt somewhat emulate what happens on the wire with the new executemany()?

test_db=> INSERT INTO "table_name" AS __destination_row ( id, other_column ) SELECT FROM "table_name" WHERE (id BETWEEN 1 AND 200) ON CONFLICT (id) DO UPDATE SET id = excluded.id, other_column = excluded.other_column; INSERT INTO "table_name" AS __destination_row ( id, other_column ) SELECT FROM "table_name" WHERE (id BETWEEN 201 AND 500) ON CONFLICT (id) DO UPDATE SET id = excluded.id, other_column = excluded.other_column;
INSERT 0 200
INSERT 0 300
test_db=>

if so, do you think it would be possible to expose these status messages?

@pauldraper
Copy link
Contributor

A simpler method than a temp table is an array parameter.

INSERT INTO "table_name" AS __destination_row (
    id,
    other_column
)
SELECT unnest($1::int[], $2::text[])
ON CONFLICT (id)
DO UPDATE SET
    id = excluded.id,
    other_column = excluded.other_column
WHERE
    (__destination_row.id IS DISTINCT FROM excluded.id)
 OR
    (__destination_row.other_column IS DISTINCT FROM excluded.other_column)
;

IDK how it compares to a temp table, but I use this approach frequently and find the the performance to be quite good.

@pauldraper
Copy link
Contributor

pauldraper commented May 13, 2021

does issuing two statements back to back at the same psql prompt somewhat emulate what happens on the wire with the new executemany()?

I think psql parses the SQL commands (at least, tokenizes them) and sends them separately.

Regardless, the simple query protocol (which asyncpg is using to send multiple statements without parsing them) does expose multiple result sets. https://www.postgresql.org/docs/13/protocol-flow.html#id-1.10.5.7.4

@elprans
Copy link
Member

elprans commented May 13, 2021

Exposing the results of CommandComplete in executemany() context seems rather complicated.

Why not use another temp table to store the number of affected rows?

@ale-dd
Copy link
Author

ale-dd commented May 17, 2021

@elprans: what exactly do you mean by "store the number of affected rows"? How can that be achieved?

@elprans
Copy link
Member

elprans commented May 17, 2021

Something like:

CREATE TEMP TABLE merge_ops(rowcount int);

Then,

conn.executemany(
   '''
   WITH
      ins AS (<your-insert-statement> RETURNING *)
   INSERT INTO merge_ops(rowcount) (SELECT count(*) FROM ins)
   '''
)

Then,

number_inserted = conn.fetchval('SELECT sum(rowcount) FROM merge_ops')

@matthewhegarty
Copy link

Just wanted to link the other related thread. You mention that you could allow executemany to return the statement result, is this on the roadmap?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants