Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build parameters in Composables #339

Open
ryanhiebert opened this issue Jul 18, 2022 · 11 comments · May be fixed by #441
Open

Build parameters in Composables #339

ryanhiebert opened this issue Jul 18, 2022 · 11 comments · May be fixed by #441
Labels
enhancement New feature or request

Comments

@ryanhiebert
Copy link

SQL and Composable are great for composing queries. Coming from SQLAlchemy, I made an incorrect assumption that it would allow me to build up the query, including parameters at the same time. It doesn't, but I'm not understanding any technical limitation that would make that a bad idea. Am I missing a good reason to avoid that approach, or some technical limits that make that more difficult than it appears to me?

Here's a rough example, ignorant of the details of what would need to happen, of the kind of thing I'm talking about:

filters = {'a': 1, 'b': 2}
query = (
    sql.SQL("SELECT * FROM {tbl} WHERE {filters}")
    .format(
        tbl=sql.Identifier('mytable'),
        filters=sql.SQL(' AND ').join(
            sql.SQL("{col} = %s", (value,)).format(col=sql.Identifier(row))
            for row, value in filters.items()
        )
    )
)
cursor.execute(query)  # OR
q, params = query.compile()

There's sure to be some complexity regarding when to use positional and keyword % parameters. And more complexity with composing the parameters, especially if they are positional parameters and we need to figure out where the % parameters are compared to the {} type formatting as they are composed. This seems like a really nice way to build dynamic queries, at least from my admittedly limited perspective.

@dvarrazzo
Copy link
Member

Looking at your proposal, it seems that most Composable might take an argument being the value or values to format. You show it in SQL but surely it should be in Composite and in Placeholder. too Not really needed in Identifier or Literal, but it makes sense to have it there too: if you have something like sql.SQL("{} = {}).format(Identifier("my_field"), Placeholder()), I would rather attach the value to the identifier than to the sql or the placeholder, even if it's the placeholder that would really be replaced.

The added argument can be a sequence of values or a mapping of values. As a special case might be a single value, to be interpreted as (value,)... mind the strings, which are sequences but should be considered singletons. A similar check is implemented in psycopg._queries._validate_and_reorder_params(). But then what is its name? sql.SQL("x = %s, value=10) works, sql.SQL("x = %s and y = %s", value=(10, 20)) not so much, as much as values=10 wouldn't. Should we accept either name? Or a positional-only argument (feels like it would paint us in a corner making future extensions to the constructors harder).

Then comes the difficult part: when you create a Composable (e.g. using SQL.format()) you have to accumulate the parameters. In which order? What if they are dicts? With repetitions? Maybe, trivially concatenating sequences as they are found, is just ok. With mappings, later valuer would overwrite former ones.

Would we be able to tell apart the following? It doesn't seem easy because the %s currently have no special value in sql.SQL.

sql.SQL("where x = %s and {}", value=10).format(sql.SQL("y = %s", value=20))
sql.SQL("where {} and x = %s", value=10).format(sql.SQL("y = %s", value=20))

By the look of it, it seems that the parameters should "bubble up", aggregated, to the root object of the tree, so that execute() can find them on its argument. query.compile() doesn't make a lot of sense for me.

This is some real(ish) code I have which would benefit of this feature:

    async def _fetch(
        self,
        conn: AsyncConnection[Any],
        *,
        id: UUID | None = None,
        group_id: UUID | None = None,
        category: UUID | None = None,
        deleted: bool = False,
    ) -> list[Res]:
        conds: list[sql.Composable] = []
        args: dict[str, Any] = {}
        if id is not None:
            conds.append(sql.SQL("r.id = %(id)s"))
            args["id"] = id
        if group_id is not None:
            conds.append(sql.SQL("r.group_id = %(group_id)s"))
            args["group_id"] = group_id
        if category is not None:
            conds.append(sql.SQL("r.category = %(category)s"))
            args["category"] = category
        if not deleted:
            conds.append(sql.SQL("r.deleted_at is null"))
        if not conds:
            conds.append(sql.SQL("true"))

        objs_sql = sql.SQL(
            """
select
    id, name, description, ...
from {res_table} r
where {filters}
"""
        ).format(
            res_table=self.res_table_name,
            filters=sql.SQL(" and ").join(conds),
        )

        async with conn.cursor(row_factory=class_row(self.res_class)) as cur:
            await cur.execute(objs_sql, args)
            recs = await cur.fetchall()

        return recs

Using your proposal, this code wouldn't need a separate args variable anymore, but the values would be accumulated from the conds items, to the SQL(" and "), up to the root object objs_sql.

I don't like very much the loss of separation that cur.execute(objs_sql) would have, with the parameter buried in the query. But maybe it's ok to just be explicit and expose a values object on the Composable, allowing cur.execute(objs_sql, objs_sql.values).

The changes to the code could be:

--- tmp-old.py	2022-07-19 00:01:29.143346787 +0100
+++ tmp-new.py	2022-07-19 00:03:43.588569435 +0100
@@ -8,16 +8,12 @@
         deleted: bool = False,
     ) -> list[Res]:
         conds: list[sql.Composable] = []
-        args: dict[str, Any] = {}
         if id is not None:
-            conds.append(sql.SQL("r.id = %(id)s"))
-            args["id"] = id
+            conds.append(sql.SQL("r.id = %s", value=id))  # value= optional?
         if group_id is not None:
-            conds.append(sql.SQL("r.group_id = %(group_id)s"))
-            args["group_id"] = group_id
+            conds.append(sql.SQL("r.group_id = %s", group_id))
         if category is not None:
-            conds.append(sql.SQL("r.category = %(category)s"))
-            args["category"] = category
+            conds.append(sql.SQL("r.category = %s", category))
         if not deleted:
             conds.append(sql.SQL("r.deleted_at is null"))
         if not conds:
@@ -36,7 +32,7 @@
         )
 
         async with conn.cursor(row_factory=class_row(self.res_class)) as cur:
-            await cur.execute(objs_sql, args)
+            await cur.execute(objs_sql, objs_sql.values)
             recs = await cur.fetchall()
 
         return recs

So, uhm... It's not an earth-shattering feature, but I think I'd just be an old grumpy fart if I just said "I don't like that! I can do it with an extra var!"... I am inclined to give this idea a chance, unless the implementation is not so awfully messy to suggest it's the wrong direction. The case of {} and %s mixed in SQL, aka finding the right order to reconstruct the array, seems the biggest problem.

Would you like to give a try to implement it?

@ryanhiebert
Copy link
Author

It seems to me that the response you gave is the best one I could have hoped for! Thank you for the work you put into responding to my idea. Here are some thoughts in response, in no particular order.

I would rather attach the value to the identifier than to the sql or the placeholder, even if it's the placeholder that would really be replaced.

I have the opposite preference, and precisely for the reason you mention. If we wanted to make a more ergonomic spelling, we could have another class that was a shorthand for constructing the identifier to placeholder mapping, but I think that sounds tricky, as you'd likely want to specify the operator as well. I think your suggestion of putting it on the identifier would lead to a messy code smell in the implementation.

The added argument can be a sequence of values or a mapping of values. As a special case might be a single value

These are constructor arguments, and I'd expect subclasses to decide what their optimal function signature would be. A Placeholder would take a single value, while an SQL would take an iterable or a mapping.

As a related aside, it's odd to me that there is an __init__ method defined for the Composable ABC. It doesn't seem to me that its really part of the interface at all: subclasses should decide what the constructor signature should be.

[... tricky edge cases]

Mixing sequences and mappings

cursor.execute only uses a single argument, if we find a sequence params attempting to be combined with mapping params, we know that's going to be a problem. We should throw an error as soon as we see it.

combining overlapping mappings

Non-overlapping mappings can and should be combined. I see three possible options for how we can handle overlapping keys in mapping params.

  1. Raise an error.
  2. Validate that they are the same.
  3. Pick one, perhaps arbitrarily.

Its not clear to me which option is preferable, but I think I'm inclined toward raising an error. While its quite common to wish to re-use a keyword parameter, I think that validation of sameness is error prone, and picking one without the choice coming from the user seems like a bad design decision.

combining interspersed sequences

As you've identified, this is particularly tricky. A first pass solution is to disallow complex combinations to use sequence params. This would significantly reduce the applicability of this feature, but potentially could get us a proof of concept more quickly.

But I do think that this is a solvable problem. The % placeholder syntax is well-defined and I think we could scan for those reliably. I expect that we can likely do the same thing for the .format() expansion as well, and when we have those together we should be able to figure out how to merge sequences.

detecting mismatch arguments to placeholders

If we got to the place that we were parsing fragments as described above, it could allow us to notice if the parameters passed in for any particular fragment didn't match. Passing extra sequence or mapping params seems like it has no obvious semantic meaning, so it would be helpful if it would notice that and raise a helpful error for the developer to get early.

Missing params for placeholders

If an incomplete number of params is given, such as a smaller sequence or a mapping that's missing named keys, I can see potential utility for that. Even if we would want that, we might prefer to still disallow it for the first proof-of-concept implementation.

But it might be helpful to consider what utility that might have, and what semantics we'd expect if it were useful. I think that it would mean "I'd like to pass the missing parameters in later myself". If this was a worthwhile use-case, it might suggest that the mechanism for retrieving the gathered parameters after the query is built should be a function rather than a property, so that missing params from fragments can be filled in the order they are present, even if other later fragments have additional params.

If that were what we wanted, then if the function was called without additional params, it could give placeholders for positional arguments that needed to be filled, with some new constant with a helpful repr. I think the ugliest part might be if we'd end up needing to have a check for that new constant to make sure that threw an error if passed into cursor.execute().

So, uhm... It's not an earth-shattering feature [...] Would you like to give a try to implement it?

I'll be careful about making promises I haven't always been able to keep, but it at least sounds interesting to me, and I'd like to! You're right that this isn't earth-shattering, and it doesn't fit as well as the similar feature from SQLAlchemy does. SQLAlchemy's position is different, because it's in a place to completely control how all the parameters are referenced or named, so it doesn't have to worry about what to do in the edge-cases we've identified above.

@dvarrazzo dvarrazzo added the enhancement New feature or request label Sep 5, 2022
@ryanhiebert ryanhiebert linked a pull request Nov 21, 2022 that will close this issue
@madsruben
Copy link

madsruben commented Jun 2, 2023

Here's a rough example, ignorant of the details of what would need to happen, of the kind of thing I'm talking about:

filters = {'a': 1, 'b': 2}
query = (
    sql.SQL("SELECT * FROM {tbl} WHERE {filters}")
    .format(
        tbl=sql.Identifier('mytable'),
        filters=sql.SQL(' AND ').join(
            sql.SQL("{col} = %s", (value,)).format(col=sql.Identifier(row))
            for row, value in filters.items()
        )
    )
)
cursor.execute(query)  # OR
q, params = query.compile()

Would this (real) example be what you are looking for?

filters = {'a': 1, 'b': 2}
query = (
    sql.SQL("SELECT * FROM {tbl} WHERE {filters}")
    .format(
        tbl=sql.Identifier('mytable'),
        filters=sql.SQL(' AND ').join([sql.Identifier(key) + sql.SQL('=') + sql.Literal(value) for key, value in filters.items()])
        )
    )

Apologies if there are additional complexities here that I don't realize.

@dpikt
Copy link

dpikt commented Jul 21, 2023

@madsruben I believe your example would be vulnerable to SQL injection.

Also, I want to give a big 👍 to this enhancement idea overall. Binding parameters with queries makes it a lot easier to compose queries, for example you can write reusable functions for subqueries:

def active_users_subquery(last_active_date):
	return sql.SQL("SELECT * FROM USERS WHERE last_active > {last_active}")
                   .format(last_active=Parameter(last_active))

# re-use elsewhere
active_users = active_users_subquery(last_active)
query = SQL("""
     SELECT * FROM COMMENTS 
     INNER JOIN ({users}) as active_users 
    ON COMMENTS.user_uuid = USERS.user_uuid
""").format(users=active_users)

cursor.execute(query)

This example is trivial and could be accomplished with bound parameters, but complex queries with several optionally provided parameters would be difficult to compose without this feature.

@dpikt
Copy link

dpikt commented Jul 21, 2023

This is also how https://github.com/Drizin/DapperQueryBuilder works, although that's able to use string interpolation directly because it's c#.

@madsruben
Copy link

@madsruben I believe your example would be vulnerable to SQL injection.

Hey @dpikt. Would you care to comment further on why you believe this code would be vulnerable?
Legit question. I was under the impression that sql.Identifier() would correctly escape an identifier exactly to avoid injection, and the same with sql.Literal() to correctly escape a literal value such as a string or a number from user input.

I've also tested this thoroughly without finding any issues, so please enlighten us as to why you believe it to be different. Thanks.

@dpikt
Copy link

dpikt commented Jul 21, 2023

@madsruben you're completely right - I looked at the docs again and I was misunderstanding what sql.Literal() did.

With that being the case, the reason to use Parameter() over Literal() would be to take advantage of server-side parameter binding.

@madsruben
Copy link

@dpikt thank you for the update.
I actually do what you proposed, having functions that return sql.Composables based parameters, but of course built client side.

I agree that this feature with server-side parameter binding would be awesome.

@ryanhiebert
Copy link
Author

@madsruben @dpikt would either of you be interested in reviewing the pull request I made, and see what you think? The tests aren't passing and quite out of date at the moment, but it may give you a sense of what I'm going for. #441

@dpikt
Copy link

dpikt commented Jul 25, 2023

@ryanhiebert I'd be happy give it a look, although I'd like to check with the maintainer first that this is still a feature he's interested in adding (it seems like it's been lingering a while).

@ryanhiebert
Copy link
Author

He never got around to looking at it, and he’s not as keen on it as we are, but until he says something different I think it reasonable to think he still sees merit in it generally. I’m sure he’s trying to manage various priorities and suspect he may not be super responsive to us on this, and that seems okay to me. If we come up with an implementation that pleases us, I think that will likely make his job easier when he goes to look at it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants