Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(go/adbc/driver/snowflake): handle quotes properly #1738

Merged
merged 8 commits into from
Apr 25, 2024

Conversation

zeroshade
Copy link
Member

fixes #1721

@zeroshade zeroshade requested a review from lidavidm as a code owner April 19, 2024 17:00
@github-actions github-actions bot added this to the ADBC Libraries 1.0.0 milestone Apr 19, 2024
@lidavidm
Copy link
Member

I'm fixing CI so we can provide a wheel for testing.

@lidavidm
Copy link
Member

I'm not sure this is the way to go. One, we're basically opening things up to SQL injection attacks if we don't escape/quote input. Two, it breaks the idea that what you pass in is the actual table name, not the concrete syntax.

@zeroshade
Copy link
Member Author

@lidavidm Ultimately this comes down to the question of who is responsible for ensuring we avoid sql injection vs ease of use.

Because we provide this whole bulk ingestion API via options to define the table name etc, is ADBC itself (and therefore every driver) responsible for ensuring there's no sql injection? Or is the application that is using ADBC responsible for ensuring there's no injection in the value they pass for the option of the table name? Since ADBC itself allows arbitrary SQL strings it feels like it should be the application's responsibility for ensuring there's no sql injection rather than ADBC (and therefore every driver).

There's also the case that blindly adding quotes like we were doing isn't sufficient to prevent sql injection anyways and only served to create failure cases for the ingestion rather than potentially reasonable results (see here).

One possible solution could be for us to try to use the snowflake IDENTIFIER function https://docs.snowflake.com/en/sql-reference/identifier-literal which might solve the issue while preventing injection? it's still not perfect though unless we also check for un-escaped single quotes in the passed in string and forcibly escape them etc.

@CurtHagenlocher
Copy link
Contributor

I feel like the injection thing is a red herring. Mainly, I think this depends on whether or not the API in question is defined as accepting a table name or a table identifier. If the former, then I would expect it to take the desired table name and do whatever is necessary to ensure a table with that name is created, including quoting if necessary. If the latter, then I would expect the identifier to already be quoted properly by the user of the API.

This distinction matters, because if the API allows me to create a table for a non-default schema (or in another database) then I have to be able to supply an identifier. If the separator is the canonical period, then I can't supply a table name containing a period without quoting it myself because anything else would be ambiguous.

@lidavidm
Copy link
Member

if the API allows me to create a table for a non-default schema (or in another database) then I have to be able to supply an identifier

There are separate parameters for schema/catalog, so we're already down the path of name, not syntactical identifier

@lidavidm
Copy link
Member

@jduo since you have lots of JDBC/ODBC background, mind chiming in?

I would argue that the parameters here make the most sense as names and not syntactical identifiers, both because we have separate parameters for catalog/schema/table and by analogy to JDBC/ODBC catalog APIs, e.g. getTables says

tableNamePattern - a table name pattern; must match the table name as it is stored in the database

which to my reading means that it is case sensitive, etc.

That said SQLTables appears to work both ways

If the SQL_ATTR_METADATA_ID statement attribute is set to SQL_TRUE, TableName is treated as an identifier and its case is not significant. If it is SQL_FALSE, TableName is a pattern value argument; it is treated literally, and its case is significant.

@lidavidm
Copy link
Member

There's also the case that blindly adding quotes like we were doing isn't sufficient to prevent sql injection anyways and only served to create failure cases for the ingestion rather than potentially reasonable results (see here).

I think this is purely because we were not actually applying the right quoting rules, and just assuming the Go quoting rules applied here (they do not)

@jduo
Copy link
Member

jduo commented Apr 23, 2024

IMO the right thing to do is to let the driver tweak he input that works with the underlying database's client protocol library.

Which means that the application shouldn't care about quoting identifier inputs. How to handle identifiers is an implementation detail (which could be anything including wrapping the input a double quotes and putting it in a query to a system view, using bound parameters, running a stored procedures, running an RPC call...). This is far too complicated for the application layer to know about.

@jduo
Copy link
Member

jduo commented Apr 23, 2024

Also if quotes are supplied as part of the input, it could very well mean that the caller intended the object being searched for to have embedded quotes in the name. The behavior for this scenario should be the same for every driver.

@davlee1972
Copy link

This is the sqlglot reference.. There is a default dialect and 20 other flavors including snowflake, postgres, etc..

https://github.com/tobymao/sqlglot/blob/main/sqlglot/dialects/dialect.py

def normalize_identifier(self, expression: E) -> E:
"""
Transforms an identifier in a way that resembles how it'd be resolved by this dialect.

    For example, an identifier like `FoO` would be resolved as `foo` in Postgres, because it
    lowercases all unquoted identifiers. On the other hand, Snowflake uppercases them, so
    it would resolve it as `FOO`. If it was quoted, it'd need to be treated as case-sensitive,
    and so any normalization would be prohibited in order to avoid "breaking" the identifier.

    There are also dialects like Spark, which are case-insensitive even when quotes are
    present, and dialects like MySQL, whose resolution rules match those employed by the
    underlying operating system, for example they may always be case-sensitive in Linux.

    Finally, the normalization behavior of some engines can even be controlled through flags,
    like in Redshift's case, where users can explicitly set enable_case_sensitive_identifier.

    SQLGlot aims to understand and handle all of these different behaviors gracefully, so
    that it can analyze queries in the optimizer and successfully capture their semantics.
    """

@zeroshade
Copy link
Member Author

After talking with @lidavidm and based on the responses from everyone here, I'm gonna update this PR to have the following behavior:

  • We will still wrap ingesting table names in quotes, following the example of JDBC
  • I'll fix the behavior so it will properly escape quotes for Snowflake's dialect. We'll first do a pass and replace " with "" before wrapping the input with quotes, this should prevent things from erroring like in the examples in the linked issue.

If an option is desired to allow turning off this auto-quoting to allow a user to provide a literal, we can add that in a follow-up PR.

Does this work for everyone?

@zeroshade
Copy link
Member Author

@davlee1972 we're not aiming to be a SQL dialect with ADBC, so I don't think that sqlglot is a good comparison here. we're more similar to ODBC and JDBC, which have behavior as quoted by @lidavidm's comments.

@davlee1972
Copy link

davlee1972 commented Apr 23, 2024

I'm not asking ADBC to be a SQL dialect, but you might want to incorporate the dialects into the CI builds so they find the right table name..

Here's some samples:

from sqlglot import exp
from sqlglot.optimizer import optimize

table_names = {
    'lower_case | unquoted': exp.table_("dummy", quoted=False),
    'lower_case | quoted': exp.table_("dummy", quoted=True),
    'upper_case | unquoted': exp.table_("DUMMY", quoted=False),
    'upper_case | quoted': exp.table_("DUMMY", quoted=True),
    'mixed_case | unquoted': exp.table_("duMMy", quoted=False),
    'mixed_case | quoted': exp.table_("duMMy", quoted=True)
}

dialects = ['snowflake', 'postgres', 'mysql', 'tsql', 'oracle']

for dialect in dialects:
    for k, v in table_names.items():
        print("dialect: " + dialect + ", case + quoted: " + k)
        print("in: " + str(v.this))
        print("out: " + optimize(v, dialect=dialect).name)
        print("")

Dialect results..

dialect: snowflake, case + quoted: lower_case | unquoted
in: dummy
out: DUMMY

dialect: snowflake, case + quoted: lower_case | quoted
in: "dummy"
out: dummy

dialect: snowflake, case + quoted: upper_case | unquoted
in: DUMMY
out: DUMMY

dialect: snowflake, case + quoted: upper_case | quoted
in: "DUMMY"
out: DUMMY

dialect: snowflake, case + quoted: mixed_case | unquoted
in: duMMy
out: DUMMY

dialect: snowflake, case + quoted: mixed_case | quoted
in: "duMMy"
out: duMMy

dialect: postgres, case + quoted: lower_case | unquoted
in: dummy
out: dummy

dialect: postgres, case + quoted: lower_case | quoted
in: "dummy"
out: dummy

dialect: postgres, case + quoted: upper_case | unquoted
in: DUMMY
out: dummy

dialect: postgres, case + quoted: upper_case | quoted
in: "DUMMY"
out: DUMMY

dialect: postgres, case + quoted: mixed_case | unquoted
in: duMMy
out: dummy

dialect: postgres, case + quoted: mixed_case | quoted
in: "duMMy"
out: duMMy

dialect: mysql, case + quoted: lower_case | unquoted
in: dummy
out: dummy

dialect: mysql, case + quoted: lower_case | quoted
in: "dummy"
out: dummy

dialect: mysql, case + quoted: upper_case | unquoted
in: DUMMY
out: DUMMY

dialect: mysql, case + quoted: upper_case | quoted
in: "DUMMY"
out: DUMMY

dialect: mysql, case + quoted: mixed_case | unquoted
in: duMMy
out: duMMy

dialect: mysql, case + quoted: mixed_case | quoted
in: "duMMy"
out: duMMy

dialect: tsql, case + quoted: lower_case | unquoted
in: dummy
out: dummy

dialect: tsql, case + quoted: lower_case | quoted
in: "dummy"
out: dummy

dialect: tsql, case + quoted: upper_case | unquoted
in: DUMMY
out: dummy

dialect: tsql, case + quoted: upper_case | quoted
in: "DUMMY"
out: dummy

dialect: tsql, case + quoted: mixed_case | unquoted
in: duMMy
out: dummy

dialect: tsql, case + quoted: mixed_case | quoted
in: "duMMy"
out: dummy

dialect: oracle, case + quoted: lower_case | unquoted
in: dummy
out: DUMMY

dialect: oracle, case + quoted: lower_case | quoted
in: "dummy"
out: dummy

dialect: oracle, case + quoted: upper_case | unquoted
in: DUMMY
out: DUMMY

dialect: oracle, case + quoted: upper_case | quoted
in: "DUMMY"
out: DUMMY

dialect: oracle, case + quoted: mixed_case | unquoted
in: duMMy
out: DUMMY

dialect: oracle, case + quoted: mixed_case | quoted
in: "duMMy"
out: duMMy

@lidavidm
Copy link
Member

We expect a catalog, schema, and table name, and as others have chimed in above this is typical of database APIs. If Snowflake's quoting rules mean that the SQL identifier dummy results in the name DUMMY, then we expect the latter.

It happened to work before because we were just concatenating SQL strings internally, but this does not work for all drivers and all users - for instance, Flight SQL now has an ingestion API that works on names and not SQL identifiers. And, not all users are using sqlglot to escape names for them, nor would we expect or require them to do so; hence why these APIs all work on names. I'd be supportive of a Snowflake driver-specific option to disable this at the user's own risk, but for the purposes of this PR let's fix the quoting first.

@lidavidm
Copy link
Member

I filed #1766 for an option to disable the quoting behavior

@davlee1972
Copy link

Can we change 1766 from disable quoting to something like allow identifiers.. This would cover both quoting and case sensitivity

@zeroshade
Copy link
Member Author

@davlee1972 do you mean just changing what it is called? Or are you saying that it should functionally do more than just disable the quoting?

@davlee1972
Copy link

davlee1972 commented Apr 25, 2024

Just the name.. Removing quotes essentially treats the value as an identifier who's resolution to name (no quotes and casing) must be handled by the underlying product.. If a product doesn't support lower case identifiers then passing in dummy without quotes would bomb which is a user generated error when using ADBC..

I also have some reservations using quotes for some non-ADBC supported products..

Sybase comes to mind.. It doesn't support quotes at all..

1> select * from "fund_data" where 1 = 0
2> go

Msg 156, Level 15, State 2:
Server '??????", Line 1:
Incorrect syntax near the keyword 'where'.

This does work in Sybase, but..

"Both Adaptive Server Enterprise and Sybase IQ provide a quoted_identifier option that allows the interpretation of delimited strings to be changed. By default, the quoted_identifier option is set to OFF in Adaptive Server Enterprise, and to ON in Sybase IQ."

1> set quoted_identifier on
2> go
1>
2> select * from "fund_data" where 1=0
3> go
cusip parent_cusip fund_type

@zeroshade
Copy link
Member Author

Sure that makes perfect sense to me. I'd be fine with naming it as such.

@zeroshade zeroshade merged commit 8bd0e9b into apache:main Apr 25, 2024
37 of 39 checks passed
@zeroshade zeroshade deleted the snowflake-quotes branch April 25, 2024 20:12
@davlee1972
Copy link

davlee1972 commented Apr 25, 2024

Sorry for the back and forth.. I think it should called "quoted_identifier" = True by default..

This is a common term used across database products..

A quoted identifier begins and ends with double quotation marks (").

https://www.postgresql.org/docs/current/sql-syntax-lexical.html
There is a second kind of identifier: the delimited identifier or quoted identifier. It is formed by enclosing an arbitrary sequence of characters in double-quotes (").

https://docs.oracle.com/cd/B10500_01/server.920/a96540/sql_elements9a.htm
Every database object has a name. In a SQL statement, you represent the name of an object with a quoted identifier or a nonquoted identifier.

https://docs.snowflake.com/en/sql-reference/identifiers-syntax
To work around this limitation, Snowflake provides the QUOTED_IDENTIFIERS_IGNORE_CASE session

@zeroshade
Copy link
Member Author

let's shift the discussion to #1766 so that we don't lose track of it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Snowflake 0.11 driver incorrectly wrapping table names in double quotes
5 participants