fix(go/adbc/driver/snowflake): handle quotes properly #1738

zeroshade · 2024-04-19T17:00:14Z

lidavidm · 2024-04-22T02:03:00Z

I'm fixing CI so we can provide a wheel for testing.

lidavidm · 2024-04-22T02:10:25Z

I'm not sure this is the way to go. One, we're basically opening things up to SQL injection attacks if we don't escape/quote input. Two, it breaks the idea that what you pass in is the actual table name, not the concrete syntax.

zeroshade · 2024-04-22T16:16:48Z

@lidavidm Ultimately this comes down to the question of who is responsible for ensuring we avoid sql injection vs ease of use.

Because we provide this whole bulk ingestion API via options to define the table name etc, is ADBC itself (and therefore every driver) responsible for ensuring there's no sql injection? Or is the application that is using ADBC responsible for ensuring there's no injection in the value they pass for the option of the table name? Since ADBC itself allows arbitrary SQL strings it feels like it should be the application's responsibility for ensuring there's no sql injection rather than ADBC (and therefore every driver).

There's also the case that blindly adding quotes like we were doing isn't sufficient to prevent sql injection anyways and only served to create failure cases for the ingestion rather than potentially reasonable results (see here).

One possible solution could be for us to try to use the snowflake IDENTIFIER function https://docs.snowflake.com/en/sql-reference/identifier-literal which might solve the issue while preventing injection? it's still not perfect though unless we also check for un-escaped single quotes in the passed in string and forcibly escape them etc.

CurtHagenlocher · 2024-04-22T17:31:58Z

I feel like the injection thing is a red herring. Mainly, I think this depends on whether or not the API in question is defined as accepting a table name or a table identifier. If the former, then I would expect it to take the desired table name and do whatever is necessary to ensure a table with that name is created, including quoting if necessary. If the latter, then I would expect the identifier to already be quoted properly by the user of the API.

This distinction matters, because if the API allows me to create a table for a non-default schema (or in another database) then I have to be able to supply an identifier. If the separator is the canonical period, then I can't supply a table name containing a period without quoting it myself because anything else would be ambiguous.

lidavidm · 2024-04-22T23:23:13Z

if the API allows me to create a table for a non-default schema (or in another database) then I have to be able to supply an identifier

There are separate parameters for schema/catalog, so we're already down the path of name, not syntactical identifier

lidavidm · 2024-04-23T01:42:47Z

@jduo since you have lots of JDBC/ODBC background, mind chiming in?

I would argue that the parameters here make the most sense as names and not syntactical identifiers, both because we have separate parameters for catalog/schema/table and by analogy to JDBC/ODBC catalog APIs, e.g. getTables says

tableNamePattern - a table name pattern; must match the table name as it is stored in the database

which to my reading means that it is case sensitive, etc.

That said SQLTables appears to work both ways

If the SQL_ATTR_METADATA_ID statement attribute is set to SQL_TRUE, TableName is treated as an identifier and its case is not significant. If it is SQL_FALSE, TableName is a pattern value argument; it is treated literally, and its case is significant.

lidavidm · 2024-04-23T01:43:30Z

There's also the case that blindly adding quotes like we were doing isn't sufficient to prevent sql injection anyways and only served to create failure cases for the ingestion rather than potentially reasonable results (see here).

I think this is purely because we were not actually applying the right quoting rules, and just assuming the Go quoting rules applied here (they do not)

jduo · 2024-04-23T01:55:17Z

IMO the right thing to do is to let the driver tweak he input that works with the underlying database's client protocol library.

Which means that the application shouldn't care about quoting identifier inputs. How to handle identifiers is an implementation detail (which could be anything including wrapping the input a double quotes and putting it in a query to a system view, using bound parameters, running a stored procedures, running an RPC call...). This is far too complicated for the application layer to know about.

jduo · 2024-04-23T01:59:36Z

Also if quotes are supplied as part of the input, it could very well mean that the caller intended the object being searched for to have embedded quotes in the name. The behavior for this scenario should be the same for every driver.

davlee1972 · 2024-04-23T13:39:07Z

This is the sqlglot reference.. There is a default dialect and 20 other flavors including snowflake, postgres, etc..

https://github.com/tobymao/sqlglot/blob/main/sqlglot/dialects/dialect.py

def normalize_identifier(self, expression: E) -> E:
"""
Transforms an identifier in a way that resembles how it'd be resolved by this dialect.

    For example, an identifier like `FoO` would be resolved as `foo` in Postgres, because it
    lowercases all unquoted identifiers. On the other hand, Snowflake uppercases them, so
    it would resolve it as `FOO`. If it was quoted, it'd need to be treated as case-sensitive,
    and so any normalization would be prohibited in order to avoid "breaking" the identifier.

    There are also dialects like Spark, which are case-insensitive even when quotes are
    present, and dialects like MySQL, whose resolution rules match those employed by the
    underlying operating system, for example they may always be case-sensitive in Linux.

    Finally, the normalization behavior of some engines can even be controlled through flags,
    like in Redshift's case, where users can explicitly set enable_case_sensitive_identifier.

    SQLGlot aims to understand and handle all of these different behaviors gracefully, so
    that it can analyze queries in the optimizer and successfully capture their semantics.
    """

zeroshade · 2024-04-23T15:22:46Z

After talking with @lidavidm and based on the responses from everyone here, I'm gonna update this PR to have the following behavior:

We will still wrap ingesting table names in quotes, following the example of JDBC
I'll fix the behavior so it will properly escape quotes for Snowflake's dialect. We'll first do a pass and replace " with "" before wrapping the input with quotes, this should prevent things from erroring like in the examples in the linked issue.

If an option is desired to allow turning off this auto-quoting to allow a user to provide a literal, we can add that in a follow-up PR.

Does this work for everyone?

zeroshade · 2024-04-23T15:24:26Z

@davlee1972 we're not aiming to be a SQL dialect with ADBC, so I don't think that sqlglot is a good comparison here. we're more similar to ODBC and JDBC, which have behavior as quoted by @lidavidm's comments.

davlee1972 · 2024-04-23T16:35:07Z

I'm not asking ADBC to be a SQL dialect, but you might want to incorporate the dialects into the CI builds so they find the right table name..

Here's some samples:

from sqlglot import exp
from sqlglot.optimizer import optimize

table_names = {
    'lower_case | unquoted': exp.table_("dummy", quoted=False),
    'lower_case | quoted': exp.table_("dummy", quoted=True),
    'upper_case | unquoted': exp.table_("DUMMY", quoted=False),
    'upper_case | quoted': exp.table_("DUMMY", quoted=True),
    'mixed_case | unquoted': exp.table_("duMMy", quoted=False),
    'mixed_case | quoted': exp.table_("duMMy", quoted=True)
}

dialects = ['snowflake', 'postgres', 'mysql', 'tsql', 'oracle']

for dialect in dialects:
    for k, v in table_names.items():
        print("dialect: " + dialect + ", case + quoted: " + k)
        print("in: " + str(v.this))
        print("out: " + optimize(v, dialect=dialect).name)
        print("")

Dialect results..

dialect: snowflake, case + quoted: lower_case | unquoted
in: dummy
out: DUMMY

dialect: snowflake, case + quoted: lower_case | quoted
in: "dummy"
out: dummy

dialect: snowflake, case + quoted: upper_case | unquoted
in: DUMMY
out: DUMMY

dialect: snowflake, case + quoted: upper_case | quoted
in: "DUMMY"
out: DUMMY

dialect: snowflake, case + quoted: mixed_case | unquoted
in: duMMy
out: DUMMY

dialect: snowflake, case + quoted: mixed_case | quoted
in: "duMMy"
out: duMMy

dialect: postgres, case + quoted: lower_case | unquoted
in: dummy
out: dummy

dialect: postgres, case + quoted: lower_case | quoted
in: "dummy"
out: dummy

dialect: postgres, case + quoted: upper_case | unquoted
in: DUMMY
out: dummy

dialect: postgres, case + quoted: upper_case | quoted
in: "DUMMY"
out: DUMMY

dialect: postgres, case + quoted: mixed_case | unquoted
in: duMMy
out: dummy

dialect: postgres, case + quoted: mixed_case | quoted
in: "duMMy"
out: duMMy

dialect: mysql, case + quoted: lower_case | unquoted
in: dummy
out: dummy

dialect: mysql, case + quoted: lower_case | quoted
in: "dummy"
out: dummy

dialect: mysql, case + quoted: upper_case | unquoted
in: DUMMY
out: DUMMY

dialect: mysql, case + quoted: upper_case | quoted
in: "DUMMY"
out: DUMMY

dialect: mysql, case + quoted: mixed_case | unquoted
in: duMMy
out: duMMy

dialect: mysql, case + quoted: mixed_case | quoted
in: "duMMy"
out: duMMy

dialect: tsql, case + quoted: lower_case | unquoted
in: dummy
out: dummy

dialect: tsql, case + quoted: lower_case | quoted
in: "dummy"
out: dummy

dialect: tsql, case + quoted: upper_case | unquoted
in: DUMMY
out: dummy

dialect: tsql, case + quoted: upper_case | quoted
in: "DUMMY"
out: dummy

dialect: tsql, case + quoted: mixed_case | unquoted
in: duMMy
out: dummy

dialect: tsql, case + quoted: mixed_case | quoted
in: "duMMy"
out: dummy

dialect: oracle, case + quoted: lower_case | unquoted
in: dummy
out: DUMMY

dialect: oracle, case + quoted: lower_case | quoted
in: "dummy"
out: dummy

dialect: oracle, case + quoted: upper_case | unquoted
in: DUMMY
out: DUMMY

dialect: oracle, case + quoted: upper_case | quoted
in: "DUMMY"
out: DUMMY

dialect: oracle, case + quoted: mixed_case | unquoted
in: duMMy
out: DUMMY

dialect: oracle, case + quoted: mixed_case | quoted
in: "duMMy"
out: duMMy

lidavidm · 2024-04-24T00:19:37Z

We expect a catalog, schema, and table name, and as others have chimed in above this is typical of database APIs. If Snowflake's quoting rules mean that the SQL identifier dummy results in the name DUMMY, then we expect the latter.

It happened to work before because we were just concatenating SQL strings internally, but this does not work for all drivers and all users - for instance, Flight SQL now has an ingestion API that works on names and not SQL identifiers. And, not all users are using sqlglot to escape names for them, nor would we expect or require them to do so; hence why these APIs all work on names. I'd be supportive of a Snowflake driver-specific option to disable this at the user's own risk, but for the purposes of this PR let's fix the quoting first.

lidavidm · 2024-04-25T07:37:34Z

I filed #1766 for an option to disable the quoting behavior

davlee1972 · 2024-04-25T14:44:59Z

Can we change 1766 from disable quoting to something like allow identifiers.. This would cover both quoting and case sensitivity

zeroshade · 2024-04-25T14:48:38Z

@davlee1972 do you mean just changing what it is called? Or are you saying that it should functionally do more than just disable the quoting?

davlee1972 · 2024-04-25T19:16:18Z

Just the name.. Removing quotes essentially treats the value as an identifier who's resolution to name (no quotes and casing) must be handled by the underlying product.. If a product doesn't support lower case identifiers then passing in dummy without quotes would bomb which is a user generated error when using ADBC..

I also have some reservations using quotes for some non-ADBC supported products..

Sybase comes to mind.. It doesn't support quotes at all..

1> select * from "fund_data" where 1 = 0
2> go

Msg 156, Level 15, State 2:
Server '??????", Line 1:
Incorrect syntax near the keyword 'where'.

This does work in Sybase, but..

"Both Adaptive Server Enterprise and Sybase IQ provide a quoted_identifier option that allows the interpretation of delimited strings to be changed. By default, the quoted_identifier option is set to OFF in Adaptive Server Enterprise, and to ON in Sybase IQ."

1> set quoted_identifier on
2> go
1>
2> select * from "fund_data" where 1=0
3> go
cusip parent_cusip fund_type

zeroshade · 2024-04-25T20:11:47Z

Sure that makes perfect sense to me. I'd be fine with naming it as such.

davlee1972 · 2024-04-25T20:22:33Z

Sorry for the back and forth.. I think it should called "quoted_identifier" = True by default..

This is a common term used across database products..

A quoted identifier begins and ends with double quotation marks (").

https://www.postgresql.org/docs/current/sql-syntax-lexical.html
There is a second kind of identifier: the delimited identifier or quoted identifier. It is formed by enclosing an arbitrary sequence of characters in double-quotes (").

https://docs.oracle.com/cd/B10500_01/server.920/a96540/sql_elements9a.htm
Every database object has a name. In a SQL statement, you represent the name of an object with a quoted identifier or a nonquoted identifier.

https://docs.snowflake.com/en/sql-reference/identifiers-syntax
To work around this limitation, Snowflake provides the QUOTED_IDENTIFIERS_IGNORE_CASE session

zeroshade · 2024-04-25T20:24:27Z

let's shift the discussion to #1766 so that we don't lose track of it

fixes apache#1721

fix(go/adbc/driver/snowflake): handle quotes properly

c83711a

zeroshade requested a review from lidavidm as a code owner April 19, 2024 17:00

github-actions bot added this to the ADBC Libraries 1.0.0 milestone Apr 19, 2024

zeroshade added 5 commits April 19, 2024 13:42

remove excess quotes from postgres tests

300fb9c

more quotes updates

8c5dec8

missed a spot

ae2b298

consistency is annoying

8ef65f8

pre-commit

b82cbe3

zeroshade added 2 commits April 24, 2024 16:38

fix-up quotes based on feedback

42e830b

trailing whitespace gets me again!

e2c3223

lidavidm approved these changes Apr 25, 2024

View reviewed changes

lidavidm mentioned this pull request Apr 25, 2024

go/adbc/driver/snowflake: add option for unquoted table name #1766

Open

zeroshade merged commit 8bd0e9b into apache:main Apr 25, 2024
37 of 39 checks passed

zeroshade deleted the snowflake-quotes branch April 25, 2024 20:12

lidavidm mentioned this pull request Apr 29, 2024

go/adbc/driver/snowflake: quote/escape names in ingestion #1064

Closed

cocoa-xu pushed a commit to meowcraft-dev/arrow-adbc that referenced this pull request May 8, 2024

fix(go/adbc/driver/snowflake): handle quotes properly (apache#1738)

2855937

fixes apache#1721

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(go/adbc/driver/snowflake): handle quotes properly #1738

fix(go/adbc/driver/snowflake): handle quotes properly #1738

zeroshade commented Apr 19, 2024

lidavidm commented Apr 22, 2024

lidavidm commented Apr 22, 2024

zeroshade commented Apr 22, 2024

CurtHagenlocher commented Apr 22, 2024

lidavidm commented Apr 22, 2024

lidavidm commented Apr 23, 2024

lidavidm commented Apr 23, 2024

jduo commented Apr 23, 2024

jduo commented Apr 23, 2024

davlee1972 commented Apr 23, 2024

zeroshade commented Apr 23, 2024

zeroshade commented Apr 23, 2024

davlee1972 commented Apr 23, 2024 •

edited

Loading

lidavidm commented Apr 24, 2024

lidavidm commented Apr 25, 2024

davlee1972 commented Apr 25, 2024

zeroshade commented Apr 25, 2024

davlee1972 commented Apr 25, 2024 •

edited

Loading

zeroshade commented Apr 25, 2024

davlee1972 commented Apr 25, 2024 •

edited

Loading

zeroshade commented Apr 25, 2024

fix(go/adbc/driver/snowflake): handle quotes properly #1738

fix(go/adbc/driver/snowflake): handle quotes properly #1738

Conversation

zeroshade commented Apr 19, 2024

lidavidm commented Apr 22, 2024

lidavidm commented Apr 22, 2024

zeroshade commented Apr 22, 2024

CurtHagenlocher commented Apr 22, 2024

lidavidm commented Apr 22, 2024

lidavidm commented Apr 23, 2024

lidavidm commented Apr 23, 2024

jduo commented Apr 23, 2024

jduo commented Apr 23, 2024

davlee1972 commented Apr 23, 2024

zeroshade commented Apr 23, 2024

zeroshade commented Apr 23, 2024

davlee1972 commented Apr 23, 2024 • edited Loading

lidavidm commented Apr 24, 2024

lidavidm commented Apr 25, 2024

davlee1972 commented Apr 25, 2024

zeroshade commented Apr 25, 2024

davlee1972 commented Apr 25, 2024 • edited Loading

zeroshade commented Apr 25, 2024

davlee1972 commented Apr 25, 2024 • edited Loading

zeroshade commented Apr 25, 2024

davlee1972 commented Apr 23, 2024 •

edited

Loading

davlee1972 commented Apr 25, 2024 •

edited

Loading

davlee1972 commented Apr 25, 2024 •

edited

Loading