-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(go/adbc/driver/snowflake): handle quotes properly #1738
Conversation
I'm fixing CI so we can provide a wheel for testing. |
I'm not sure this is the way to go. One, we're basically opening things up to SQL injection attacks if we don't escape/quote input. Two, it breaks the idea that what you pass in is the actual table name, not the concrete syntax. |
@lidavidm Ultimately this comes down to the question of who is responsible for ensuring we avoid sql injection vs ease of use. Because we provide this whole bulk ingestion API via options to define the table name etc, is ADBC itself (and therefore every driver) responsible for ensuring there's no sql injection? Or is the application that is using ADBC responsible for ensuring there's no injection in the value they pass for the option of the table name? Since ADBC itself allows arbitrary SQL strings it feels like it should be the application's responsibility for ensuring there's no sql injection rather than ADBC (and therefore every driver). There's also the case that blindly adding quotes like we were doing isn't sufficient to prevent sql injection anyways and only served to create failure cases for the ingestion rather than potentially reasonable results (see here). One possible solution could be for us to try to use the snowflake |
I feel like the injection thing is a red herring. Mainly, I think this depends on whether or not the API in question is defined as accepting a table name or a table identifier. If the former, then I would expect it to take the desired table name and do whatever is necessary to ensure a table with that name is created, including quoting if necessary. If the latter, then I would expect the identifier to already be quoted properly by the user of the API. This distinction matters, because if the API allows me to create a table for a non-default schema (or in another database) then I have to be able to supply an identifier. If the separator is the canonical period, then I can't supply a table name containing a period without quoting it myself because anything else would be ambiguous. |
There are separate parameters for schema/catalog, so we're already down the path of name, not syntactical identifier |
@jduo since you have lots of JDBC/ODBC background, mind chiming in? I would argue that the parameters here make the most sense as names and not syntactical identifiers, both because we have separate parameters for catalog/schema/table and by analogy to JDBC/ODBC catalog APIs, e.g. getTables says
which to my reading means that it is case sensitive, etc. That said SQLTables appears to work both ways
|
I think this is purely because we were not actually applying the right quoting rules, and just assuming the Go quoting rules applied here (they do not) |
IMO the right thing to do is to let the driver tweak he input that works with the underlying database's client protocol library. Which means that the application shouldn't care about quoting identifier inputs. How to handle identifiers is an implementation detail (which could be anything including wrapping the input a double quotes and putting it in a query to a system view, using bound parameters, running a stored procedures, running an RPC call...). This is far too complicated for the application layer to know about. |
Also if quotes are supplied as part of the input, it could very well mean that the caller intended the object being searched for to have embedded quotes in the name. The behavior for this scenario should be the same for every driver. |
This is the sqlglot reference.. There is a default dialect and 20 other flavors including snowflake, postgres, etc.. https://github.com/tobymao/sqlglot/blob/main/sqlglot/dialects/dialect.py def normalize_identifier(self, expression: E) -> E:
|
After talking with @lidavidm and based on the responses from everyone here, I'm gonna update this PR to have the following behavior:
If an option is desired to allow turning off this auto-quoting to allow a user to provide a literal, we can add that in a follow-up PR. Does this work for everyone? |
@davlee1972 we're not aiming to be a SQL dialect with ADBC, so I don't think that sqlglot is a good comparison here. we're more similar to ODBC and JDBC, which have behavior as quoted by @lidavidm's comments. |
I'm not asking ADBC to be a SQL dialect, but you might want to incorporate the dialects into the CI builds so they find the right table name.. Here's some samples:
Dialect results..
|
We expect a catalog, schema, and table name, and as others have chimed in above this is typical of database APIs. If Snowflake's quoting rules mean that the SQL identifier It happened to work before because we were just concatenating SQL strings internally, but this does not work for all drivers and all users - for instance, Flight SQL now has an ingestion API that works on names and not SQL identifiers. And, not all users are using sqlglot to escape names for them, nor would we expect or require them to do so; hence why these APIs all work on names. I'd be supportive of a Snowflake driver-specific option to disable this at the user's own risk, but for the purposes of this PR let's fix the quoting first. |
I filed #1766 for an option to disable the quoting behavior |
Can we change 1766 from disable quoting to something like allow identifiers.. This would cover both quoting and case sensitivity |
@davlee1972 do you mean just changing what it is called? Or are you saying that it should functionally do more than just disable the quoting? |
Just the name.. Removing quotes essentially treats the value as an identifier who's resolution to name (no quotes and casing) must be handled by the underlying product.. If a product doesn't support lower case identifiers then passing in dummy without quotes would bomb which is a user generated error when using ADBC.. I also have some reservations using quotes for some non-ADBC supported products.. Sybase comes to mind.. It doesn't support quotes at all.. 1> select * from "fund_data" where 1 = 0 Msg 156, Level 15, State 2: This does work in Sybase, but.. "Both Adaptive Server Enterprise and Sybase IQ provide a quoted_identifier option that allows the interpretation of delimited strings to be changed. By default, the quoted_identifier option is set to OFF in Adaptive Server Enterprise, and to ON in Sybase IQ." 1> set quoted_identifier on |
Sure that makes perfect sense to me. I'd be fine with naming it as such. |
Sorry for the back and forth.. I think it should called "quoted_identifier" = True by default.. This is a common term used across database products.. A quoted identifier begins and ends with double quotation marks ("). https://www.postgresql.org/docs/current/sql-syntax-lexical.html https://docs.oracle.com/cd/B10500_01/server.920/a96540/sql_elements9a.htm https://docs.snowflake.com/en/sql-reference/identifiers-syntax |
let's shift the discussion to #1766 so that we don't lose track of it |
fixes #1721