advice: best bulk upsert method that still allows to track # of affected rows? #755

ale-dd · 2021-05-06T06:59:59Z

I've been relying on the newest implementation of executemany() to perform bulk upserts, but it has the shortcoming that it will not allow to easily determine the number of affected rows by parsing the statusmsg.

The number of effectively upserted rows can easily be less than the number of rows I attempt to upsert, since I qualify my ON CONFLICT clause with a further WHERE clause specifying that the update should only happen if the new and excluded tuples are distinct.

INSERT INTO "table_name" AS __destination_row (
    id,
    other_column
) VALUES ($1, $2)
ON CONFLICT (id)
DO UPDATE SET
    id = excluded.id,
    other_column = excluded.other_column
WHERE
    (__destination_row.id IS DISTINCT FROM excluded.id)
 OR
    (__destination_row.other_column IS DISTINCT FROM excluded.other_column)
;

(regular Postgres would allow for a much terser syntax, but this is the only syntax that is accepted by CockroachDB)

Suppose that at times knowing the exact number of effectively upserted rows is more crucial than the bulk performance, and yet I would prefer not to go to the extreme of upserting one row at a time, what would be the best compromise?

Should I rely on a temporary table and then upserting into the physical tables from that temporary table?

INSERT INTO "table_name" AS __destination_row (
    id,
    other_column
) SELECT (
    id,
    other_column
) FROM "__temp_table_name"
ON CONFLICT (id)
DO UPDATE SET
    id = excluded.id,
    other_column = excluded.other_column
WHERE
    (__destination_row.id IS DISTINCT FROM excluded.id)
 OR
    (__destination_row.other_column IS DISTINCT FROM excluded.other_column)
;

Should I instead use a transaction with several individual upserts of values once again provided by the client?

Are there other approaches I should explore?

The text was updated successfully, but these errors were encountered:

elprans · 2021-05-06T16:09:35Z

Should I rely on a temporary table and then upserting into the physical tables from that temporary table?

This would be the best alternative approach.

ale-dd · 2021-05-07T01:30:02Z

@fantix does issuing two statements back to back at the same psql prompt somewhat emulate what happens on the wire with the new executemany()?

test_db=> INSERT INTO "table_name" AS __destination_row ( id, other_column ) SELECT FROM "table_name" WHERE (id BETWEEN 1 AND 200) ON CONFLICT (id) DO UPDATE SET id = excluded.id, other_column = excluded.other_column; INSERT INTO "table_name" AS __destination_row ( id, other_column ) SELECT FROM "table_name" WHERE (id BETWEEN 201 AND 500) ON CONFLICT (id) DO UPDATE SET id = excluded.id, other_column = excluded.other_column;
INSERT 0 200
INSERT 0 300
test_db=>

if so, do you think it would be possible to expose these status messages?

pauldraper · 2021-05-12T23:56:15Z

A simpler method than a temp table is an array parameter.

INSERT INTO "table_name" AS __destination_row (
    id,
    other_column
)
SELECT unnest($1::int[], $2::text[])
ON CONFLICT (id)
DO UPDATE SET
    id = excluded.id,
    other_column = excluded.other_column
WHERE
    (__destination_row.id IS DISTINCT FROM excluded.id)
 OR
    (__destination_row.other_column IS DISTINCT FROM excluded.other_column)
;

IDK how it compares to a temp table, but I use this approach frequently and find the the performance to be quite good.

pauldraper · 2021-05-13T00:00:52Z

does issuing two statements back to back at the same psql prompt somewhat emulate what happens on the wire with the new executemany()?

I think psql parses the SQL commands (at least, tokenizes them) and sends them separately.

Regardless, the simple query protocol (which asyncpg is using to send multiple statements without parsing them) does expose multiple result sets. https://www.postgresql.org/docs/13/protocol-flow.html#id-1.10.5.7.4

elprans · 2021-05-13T01:15:07Z

Exposing the results of CommandComplete in executemany() context seems rather complicated.

Why not use another temp table to store the number of affected rows?

ale-dd · 2021-05-17T21:46:52Z

@elprans: what exactly do you mean by "store the number of affected rows"? How can that be achieved?

elprans · 2021-05-17T22:10:29Z

Something like:

CREATE TEMP TABLE merge_ops(rowcount int);

Then,

conn.executemany(
   '''
   WITH
      ins AS (<your-insert-statement> RETURNING *)
   INSERT INTO merge_ops(rowcount) (SELECT count(*) FROM ins)
   '''
)

Then,

number_inserted = conn.fetchval('SELECT sum(rowcount) FROM merge_ops')

matthewhegarty · 2021-09-30T20:49:49Z

Just wanted to link the other related thread. You mention that you could allow executemany to return the statement result, is this on the roadmap?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

advice: best bulk upsert method that still allows to track # of affected rows? #755

advice: best bulk upsert method that still allows to track # of affected rows? #755

ale-dd commented May 6, 2021 •

edited

elprans commented May 6, 2021

ale-dd commented May 7, 2021

pauldraper commented May 12, 2021

pauldraper commented May 13, 2021 •

edited

elprans commented May 13, 2021

ale-dd commented May 17, 2021

elprans commented May 17, 2021

matthewhegarty commented Sep 30, 2021

advice: best bulk upsert method that still allows to track # of affected rows? #755

advice: best bulk upsert method that still allows to track # of affected rows? #755

Comments

ale-dd commented May 6, 2021 • edited

elprans commented May 6, 2021

ale-dd commented May 7, 2021

pauldraper commented May 12, 2021

pauldraper commented May 13, 2021 • edited

elprans commented May 13, 2021

ale-dd commented May 17, 2021

elprans commented May 17, 2021

matthewhegarty commented Sep 30, 2021

ale-dd commented May 6, 2021 •

edited

pauldraper commented May 13, 2021 •

edited