-
-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement cursor.setinputsizes()
#163
Comments
I agree this method has a terrible name such that almost nobody takes advantage of it. for the moment our psycopg3 DBAPI should work fine with our new approach to run the casts in the SQL ourselves, and it's also anticipated this will be easier to integrate into the "fast executemany" feature, which I am planning to implement outside of psycopg2 as a generic feature for any SQLAlchemy dialect, but most particularly the PostgreSQL ones where "insert many values + returning" is sorely needed (users are complaining all the time of asyncpg being "slow" compared to psycopg2). |
We are planning to implement a fast executemany in #145 based on the pipeline/batch mode. Where "fast" means optimised at network level, not only at sql level (we would send a bind-exec-bind-exec...-sync sequence, instead of a bind-exec-sync-bind-exec-sync... as we currently do). That's what asyncpg currently does I'm told, and is the obvious thing to do if you write a network-level protocol, but it was unavailable before Postgres 14 for drivers wrapping the libpq. Currently, executemany() discards the return values, but actually we could make them available using the nextset() mechanism. mmm, sounds kinda obvious, I should have thought about it before. However, if an user doesn't know about it, they may accidentally use a lot of memory if they execute a long executemany with a An "efficient executemany-returning" might be built on the more complex pipeline mode (#116). |
I'm not familiar with what asyncpg does, but their "insertmany" performance without RETURNING is basically the same as psycopg2 w/ fast execution, but since they dont have RETURNING (or they do and I haven't figured it out), we cant use "executemany()" when we need the PKs so we get terrible performance in that case. See sqlalchemy/sqlalchemy#7352 . for us, we need the INSERT + many VALUES + RETURNING, which you did implement for the psycopg2 helpers and we use those, but I think for us it will be easier to manage if we just implement the batching on our end, that way we can log it transparently and have it behave the same on all the different DBAPIs. i guess nextset() is another approach but yes that seems to get into yet another area of weird APIs , from my end I'm mostly looking to have simpler code for all these different drivers. |
You are herding cats with your project, man 😄 You might have seen that the fast executemany on psycopg2 are pretty trivial functions composing a query on the client side and send it in one go (or in fewer batches) to the server. Even if psycopg 3 normally does server-side binding, you can still bind easily client-side, using the The trivial way to transform the query client side loses the types information. To be more useful they should be retained and converted to casts. In order to do so the transformation machinery comes in help. I have a brain dump for you, but that's better delivered into #101, so I'll add it later there. Two detail that you might not necessarily know about postgres, but they can come handy for you to design your feature:
>>> import psycopg
>>> cnn.execute("create table test (id serial primary key, data text)")
>>> cur = cnn.execute("""
... insert into test (data) values ('hello') returning id;
... insert into test (data) values ('world') returning id
... """)
>>> cur.fetchall()
[(1,)]
>>> cur.nextset()
True
>>> cur.fetchall()
[(2,)]
cur.nextset()
None Here's your ids-from-executemany. psycopg2, not very smartly, discards all the previous results and only returns the last one. As a consequence we could only add |
yeah this is why i just want to vendor that and generalize it. i think the same approach is valid for MySQL / SQL Server and Oracle too. We already had to change the SQL compiler to support generating that "fragment" of the INSERT statement.
OK will have a look, I think we already have our own way of doing the casts and stuff but I'll check out what this has.
that's fine, I probably don't even want to use that here. We have "batch executemany" using the psycopg2 fast helpers but I don't think the "stmt; stmt; stmt" thing is that compelling and I want to take it out for SQLAlchemy 2.0. mostly the INSERT is where we need the help.
yeah I think I knew that nextset() did this. Here is a concern I have though, psycopg3 is using prepared statements. is it caching those prepared statements and if so is there a way to control that? I would not want it caching enormous INSERT statements that have an arbitrary number of VALUES clauses and wont be used again. |
A statement, by default, is prepared after 5 times it is executed. The threshold can be tweaked and disabled by setting If However, if you know for sure that your query is a one-off, you can use So I guess you can be proactive in asking to not prepare the query, but even if you are not, it won't cause unbound client memory or server resources usage. Details at https://www.psycopg.org/psycopg3/docs/advanced/prepare.html |
that's certainly interesting, yes I will likely send prepare=False for this use case. to do the "5 times after it is executed" means you still have to put every statement in a cache somewhere, which I would assume is also LRU. Which means if i execute a certain statement 5 times, but after the driver has forgotten about that statement, it still would not prepare it, is that right? doesn't matter on this end just wondering how you went about implementing that :) |
Yes, if you execute it 4 times, then execute enough other queries, eventually this 4 will be forgotten and the query won't be prepared until seen other 5 times. The matter is implemented in the At the moment there is an ordered dict, whose keys are (statement, arg types) and the values are either the number of times the query has been seen, or the name of the prepared query, if it passes the prepare threshold. Whenever a query is looked up, it's brought to the top of the dict. Whenever we query we also check if the cache has grown to more than prepare_max. If so, we pop the item from the bottom of the dict. If the value is a number, it means it's an old query seen one or more times we never prepared, and we just forget about it. If the value is a name, then it means that the query had been prepared, so we also execute a DEALLOCATE to release it from the server. That means that we keep track of 100 queries at most, but only a subset of them are actually prepared. Because of some work that we are doing to accommodate the pipeline mode, we might split the cache in two LRU dicts: one for tallying, one with the prepared names, in which case the number of prepared statements might hit 100 more easily, but never more than that. |
Stemming from a conversation about implementing the SQLAlchemy dialect for Psycopg 3: See #157.
I originally overlooked this DBAPI method as I was confused by its name and thought it was only about selecting memory buffer size. Here's the spec:
however it might be useful to specify types too, in the same way
Copy.set_types()
does.The way it would be used would also be similar: if the types are preset then they are converted into an array of dumpers and the arguments types/values is not looked at at all. If some value is not compatible with its dumper, eventually some
dump()
will explode on the user's face.The DBAPI is, as usual, base on wide simplification. If it's a type object in the sense of e.g.
NUMBER
that helps in no way to decide what number to pass the server (It may be any of int2, int4, int8, float4, float8, numeric). Specifying a number for the size also gets in the way: if they were number it would be nice that they were oids.So, all in all, maybe it would be better to implement a
cursor.set_types()
function instead, behaving as theCopy
equivalent, and if any implementsetinputsizes()
on top of it (but I wouldn't know what to do if apsycopg.NUMBER
is passed...) Passing names or oid is the moral equivalent of passing types objects, except that we have a customizable mapping in the cursor context.Involving @zzzeek as potential best beneficiary of the feature.
The text was updated successfully, but these errors were encountered: