Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement cursor.setinputsizes() #163

Open
dvarrazzo opened this issue Nov 23, 2021 · 8 comments
Open

Implement cursor.setinputsizes() #163

dvarrazzo opened this issue Nov 23, 2021 · 8 comments
Labels
enhancement New feature or request

Comments

@dvarrazzo
Copy link
Member

Stemming from a conversation about implementing the SQLAlchemy dialect for Psycopg 3: See #157.

I originally overlooked this DBAPI method as I was confused by its name and thought it was only about selecting memory buffer size. Here's the spec:

.setinputsizes(sizes)

This can be used before a call to .execute*() to predefine memory areas for the operation's parameters.

sizes is specified as a sequence — one item for each input parameter. The item should be a Type Object that corresponds to the input that will be used, or it should be an integer specifying the maximum length of a string parameter. If the item is None, then no predefined memory area will be reserved for that column (this is useful to avoid predefined areas for large inputs).

This method would be used before the .execute*() method is invoked.

Implementations are free to have this method do nothing and users are free to not use it.

however it might be useful to specify types too, in the same way Copy.set_types() does.

The way it would be used would also be similar: if the types are preset then they are converted into an array of dumpers and the arguments types/values is not looked at at all. If some value is not compatible with its dumper, eventually some dump() will explode on the user's face.

The DBAPI is, as usual, base on wide simplification. If it's a type object in the sense of e.g. NUMBER that helps in no way to decide what number to pass the server (It may be any of int2, int4, int8, float4, float8, numeric). Specifying a number for the size also gets in the way: if they were number it would be nice that they were oids.

So, all in all, maybe it would be better to implement a cursor.set_types() function instead, behaving as the Copy equivalent, and if any implement setinputsizes() on top of it (but I wouldn't know what to do if a psycopg.NUMBER is passed...) Passing names or oid is the moral equivalent of passing types objects, except that we have a customizable mapping in the cursor context.

Involving @zzzeek as potential best beneficiary of the feature.

@dvarrazzo dvarrazzo added the enhancement New feature or request label Nov 23, 2021
@dvarrazzo dvarrazzo added this to the 3.1 milestone Nov 23, 2021
@zzzeek
Copy link

zzzeek commented Nov 23, 2021

I agree this method has a terrible name such that almost nobody takes advantage of it.

for the moment our psycopg3 DBAPI should work fine with our new approach to run the casts in the SQL ourselves, and it's also anticipated this will be easier to integrate into the "fast executemany" feature, which I am planning to implement outside of psycopg2 as a generic feature for any SQLAlchemy dialect, but most particularly the PostgreSQL ones where "insert many values + returning" is sorely needed (users are complaining all the time of asyncpg being "slow" compared to psycopg2).

@dvarrazzo
Copy link
Member Author

We are planning to implement a fast executemany in #145 based on the pipeline/batch mode. Where "fast" means optimised at network level, not only at sql level (we would send a bind-exec-bind-exec...-sync sequence, instead of a bind-exec-sync-bind-exec-sync... as we currently do). That's what asyncpg currently does I'm told, and is the obvious thing to do if you write a network-level protocol, but it was unavailable before Postgres 14 for drivers wrapping the libpq.

Currently, executemany() discards the return values, but actually we could make them available using the nextset() mechanism. mmm, sounds kinda obvious, I should have thought about it before. However, if an user doesn't know about it, they may accidentally use a lot of memory if they execute a long executemany with a SELECT function_with_side_effect(), and it might be useful to have a way to disable it.

An "efficient executemany-returning" might be built on the more complex pipeline mode (#116).

@zzzeek
Copy link

zzzeek commented Nov 24, 2021

I'm not familiar with what asyncpg does, but their "insertmany" performance without RETURNING is basically the same as psycopg2 w/ fast execution, but since they dont have RETURNING (or they do and I haven't figured it out), we cant use "executemany()" when we need the PKs so we get terrible performance in that case. See sqlalchemy/sqlalchemy#7352 .

for us, we need the INSERT + many VALUES + RETURNING, which you did implement for the psycopg2 helpers and we use those, but I think for us it will be easier to manage if we just implement the batching on our end, that way we can log it transparently and have it behave the same on all the different DBAPIs. i guess nextset() is another approach but yes that seems to get into yet another area of weird APIs , from my end I'm mostly looking to have simpler code for all these different drivers.

@dvarrazzo
Copy link
Member Author

You are herding cats with your project, man 😄

You might have seen that the fast executemany on psycopg2 are pretty trivial functions composing a query on the client side and send it in one go (or in fewer batches) to the server.

Even if psycopg 3 normally does server-side binding, you can still bind easily client-side, using the psycopg.sql objects. An example, but too simplistic, is discussed at #153. A few extra details are in #101, because I think that eventually a client-side binding cursor might come handy.

The trivial way to transform the query client side loses the types information. To be more useful they should be retained and converted to casts. In order to do so the transformation machinery comes in help. I have a brain dump for you, but that's better delivered into #101, so I'll add it later there.

Two detail that you might not necessarily know about postgres, but they can come handy for you to design your feature:

  • extended query protocol does not support multiple statements. So, in psycopg 3 you can't do execute("insert into x values (%s); insert into y values (%s)", (10, 20))
  • the simple query protocol (aka query with no parameters) supports multiple parameters (that's why it works in psycopg2: client side binding). But, unlike psycopg 2, psycopg 3 will return all the results, via nextset(). This is overlooked in the docs, it should be probably extended, but it's quite powerful:
>>> import psycopg
>>> cnn.execute("create table test (id serial primary key, data text)")
>>> cur = cnn.execute("""
... insert into test (data) values ('hello') returning id;
... insert into test (data) values ('world') returning id
... """)

>>> cur.fetchall()
[(1,)]
>>> cur.nextset()
True
>>> cur.fetchall()
[(2,)]
cur.nextset()
None

Here's your ids-from-executemany.

psycopg2, not very smartly, discards all the previous results and only returns the last one. As a consequence we could only add fetch=True to execute_values(), but not to execute_batch(); if similar functions would be to exist in psycopg 3 they could both support fetch. But I think this is pretty easy stuff to put together in SQLAlchemy so you are not tied to the progress of our interface and you can organise your codebase to share code for different adapters.

@zzzeek
Copy link

zzzeek commented Nov 24, 2021

You are herding cats with your project, man smile

You might have seen that the fast executemany on psycopg2 are pretty trivial functions composing a query on the client side and send it in one go (or in fewer batches) to the server.

yeah this is why i just want to vendor that and generalize it. i think the same approach is valid for MySQL / SQL Server and Oracle too. We already had to change the SQL compiler to support generating that "fragment" of the INSERT statement.

Even if psycopg 3 normally does server-side binding, you can still bind easily client-side, using the psycopg.sql objects. An example, but too simplistic, is discussed at #153. A few extra details are in #101, because I think that eventually a client-side binding cursor might come handy.

The trivial way to transform the query client side loses the types information. To be more useful they should be retained and converted to casts. In order to do so the transformation machinery comes in help. I have a brain dump for you, but that's better delivered into #101, so I'll add it later there.

OK will have a look, I think we already have our own way of doing the casts and stuff but I'll check out what this has.

Two detail that you might not necessarily know about postgres, but they can come handy for you to design your feature:

* extended query protocol _does not support multiple statements_. So, in psycopg 3 [you can't do](https://www.psycopg.org/psycopg3/docs/basic/from_pg2.html#multiple-statements-in-the-same-query) `execute("insert into x values (%s); insert into y values (%s)", (10, 20))`

that's fine, I probably don't even want to use that here. We have "batch executemany" using the psycopg2 fast helpers but I don't think the "stmt; stmt; stmt" thing is that compelling and I want to take it out for SQLAlchemy 2.0. mostly the INSERT is where we need the help.

* the simple query protocol (aka query with no parameters) _supports multiple parameters_ (that's why it works in psycopg2: client side binding). But, unlike psycopg 2, psycopg 3 will _return all the results_, via `nextset()`. This is overlooked in the docs, it should be probably extended, but it's quite powerful:

yeah I think I knew that nextset() did this.

Here is a concern I have though, psycopg3 is using prepared statements. is it caching those prepared statements and if so is there a way to control that? I would not want it caching enormous INSERT statements that have an arbitrary number of VALUES clauses and wont be used again.

@dvarrazzo
Copy link
Member Author

Here is a concern I have though, psycopg3 is using prepared statements. is it caching those prepared statements and if so is there a way to control that? I would not want it caching enormous INSERT statements that have an arbitrary number of VALUES clauses and wont be used again.

A statement, by default, is prepared after 5 times it is executed. The threshold can be tweaked and disabled by setting Connection.prepare_threshold to None.

If prepare_threshold is not None, the query is kept in memory to keep the tally of the times it's executed. However it's kept in an LRU cache, so, if it's not executed anymore, and more queries are, eventually it gets evicted, because the cache is limited to keep at least prepared_max (default: 100) statements.

However, if you know for sure that your query is a one-off, you can use execute(..., prepare=False) to avoid it being prepared, regardless of the connection settings.

So I guess you can be proactive in asking to not prepare the query, but even if you are not, it won't cause unbound client memory or server resources usage.

Details at https://www.psycopg.org/psycopg3/docs/advanced/prepare.html

@zzzeek
Copy link

zzzeek commented Nov 24, 2021

that's certainly interesting, yes I will likely send prepare=False for this use case. to do the "5 times after it is executed" means you still have to put every statement in a cache somewhere, which I would assume is also LRU. Which means if i execute a certain statement 5 times, but after the driver has forgotten about that statement, it still would not prepare it, is that right? doesn't matter on this end just wondering how you went about implementing that :)

@dvarrazzo
Copy link
Member Author

Yes, if you execute it 4 times, then execute enough other queries, eventually this 4 will be forgotten and the query won't be prepared until seen other 5 times.

The matter is implemented in the PrepareManager, an internal object of which every connection has an instance.

At the moment there is an ordered dict, whose keys are (statement, arg types) and the values are either the number of times the query has been seen, or the name of the prepared query, if it passes the prepare threshold. Whenever a query is looked up, it's brought to the top of the dict.

Whenever we query we also check if the cache has grown to more than prepare_max. If so, we pop the item from the bottom of the dict. If the value is a number, it means it's an old query seen one or more times we never prepared, and we just forget about it. If the value is a name, then it means that the query had been prepared, so we also execute a DEALLOCATE to release it from the server.

That means that we keep track of 100 queries at most, but only a subset of them are actually prepared.

Because of some work that we are doing to accommodate the pipeline mode, we might split the cache in two LRU dicts: one for tallying, one with the prepared names, in which case the number of prepared statements might hit 100 more easily, but never more than that.

@dvarrazzo dvarrazzo removed this from the 3.1 milestone May 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants