Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(go/adbc/driver): driverbase implementation for connection #1590

Merged
merged 26 commits into from
Mar 19, 2024

Conversation

joellubi
Copy link
Member

@joellubi joellubi commented Mar 4, 2024

Implementation of Connection driver base, along with a refactor of Driver and Database bases.

The bases have been refactored in the following way:

  • The *Impl interface (e.g. DatabaseImpl) now explicitly implements the corresponding adbc interface (e.g. adbc.Database).
  • We now check to guarantee the DatabaseImplBase implements the entire DatabaseImpl interface with stub methods or default implementations.
  • A new interface has been added (e.g. driverbase.Database) which contains all methods the output of driverbase constructor NewDatabase() should be. This helps document and guarantee the "extra" behavior provided by using the driverbase. This interface should be internal to the library.
  • By embedding DatabaseImpl in the database struct (and similarly for the other bases) it automatically inherits implementations coming from the DatabaseImpl. This way we don't need to write out all the implementations a second time, hence the deletes.
  • The Connection base uses a builder for its constructor to register any helper methods (see discussion in comments). The Driver and Database bases use simple function constructors because they don't have any helpers to register. This felt simpler but I can make those into trivial builders as well if we prefer to have consistency between them.

A new DriverInfo type has been introduced to help consolidate the collection and validation of metadata for GetInfo().

There are more small changes such as refactors of the flightsql and snowflake drivers to make use of the added functionality, as well as a new set of tests for the driverbase. Please let me know if anything else could use clarification.

Resolves #1105.

@github-actions github-actions bot added this to the ADBC Libraries 0.11.0 milestone Mar 4, 2024
@lidavidm
Copy link
Member

lidavidm commented Mar 4, 2024

See #1572

@joellubi joellubi force-pushed the driverbase-cnxn-stmt branch from 1cf64ac to 40be93d Compare March 4, 2024 19:34
@joellubi
Copy link
Member Author

joellubi commented Mar 4, 2024

See #1572

@lidavidm I'm just seeing this. Looks like we did some things similarly in our implementations and could likely combine them.

One question I had while going through this: Right now the (Driver|Database|Connection|Statement)ImplBase types partially implement the corresponding interfaces. Most methods are explicitly required to be implemented by the driver developer to satisfy the interface at all. Should the bases instead fully implement the interface (with methods that return NotImplemented errors) as a starting point like the current FlightSQL BaseServer does?

There are benefits I can think of to either approach. Implementing the interface fully makes it easier to get started with a driver that at least compiles with the functionality the author cares about. This has a slight downside that it might make it a bit harder to see what's missing from an implementation to make it complete, at least in terms of static analysis.

What are your thoughts on this?

@lidavidm
Copy link
Member

lidavidm commented Mar 4, 2024

Fully implementing was my intent, Golang is just a bit annoying in making that hard to check

@joellubi
Copy link
Member Author

joellubi commented Mar 4, 2024

Fully implementing was my intent, Golang is just a bit annoying in making that hard to check

@lidavidm That's good to know, I'll give that a try because I think it could help simplify the bases even further. I'm still moving things around, but what do you think about the approach so far? I'd say the most material change here is the consolidation of the GetInfo logic for reusability. Also I think there are a few interesting options for how the logger can be propagated down from Database->Connection->Statement but I just went with a naive implementation for now.

@zeroshade
Copy link
Member

I think driverbase should be under an internal package so that we aren't exposing it to consumers directly

@lidavidm
Copy link
Member

lidavidm commented Mar 5, 2024

How do we want to proceed here?

@joellubi
Copy link
Member Author

joellubi commented Mar 5, 2024

@lidavidm

I'm in the process of putting things back together after having torn them apart a bit in my PR so far. I think I now better understand the goals of the driverbase and how it is meant to be used. Some parts like the Driver and Database should go back to looking more like they did previously, with some additions. Though perhaps it would be helpful to confirm my current understanding. Using Connection as an example:

  • The ConnectionImpl interface describes the set of methods that that a driver author must provide implementations for when using the "driverbase SDK", so to speak.
    • Some of these methods MAY be identical to methods required for adbc.Connection in which case we likely just proxy those calls directly when actually using the resulting connection instance.
    • Some of the methods are different, for example simplified helper methods. In this case the author implements them and gets the more complicated behavior (e.g. GetObjects) "for free" on the resulting connection instance.
  • By using the provided constructors and embedding a ConnectionImplBase, the author actually has a "working" connection that compiles without having to implement any methods since ConnectionImplBase fully implements ConnectionImpl, though mostly with stubs.
  • Then the driver author can incrementally provide implementations for methods they care about, overriding the "base" implementations. The actual connection instance that's used by users is still the one produced by the NewConnection constructor, and it either:
    • Fully provides its own implementation for a method
    • Fully delegates to the driver author's provided impl
    • Or some combination where additional business logic is applied to the driver author's implementation of a method(s) to produce the result of an adbc.Connection method.

Is that an accurate summary?

@joellubi
Copy link
Member Author

joellubi commented Mar 5, 2024

think driverbase should be under an internal package so that we aren't exposing it to consumers directly

@zeroshade @lidavidm Do we ever plan to expose this as a library to external users? I think if we're using it as a means to more efficiently produce our own drivers then internal makes sense, but we also talked about wanting to encourage vendors to write their own ADBC drivers over time. This might be useful in that case, if they're not contributing directly to this repo.

@lidavidm
Copy link
Member

lidavidm commented Mar 5, 2024

Yes, pretty much. I expect Statement to be the hard one since I'd like the 'base' implementation to track all the state (do we have parameters? are we prepared? are we in the middle of an incremental query? etc.) and simplify the driver implementor's life as much as possible.

@lidavidm
Copy link
Member

lidavidm commented Mar 5, 2024

think driverbase should be under an internal package so that we aren't exposing it to consumers directly

@zeroshade @lidavidm Do we ever plan to expose this as a library to external users? I think if we're using it as a means to more efficiently produce our own drivers then internal makes sense, but we also talked about wanting to encourage vendors to write their own ADBC drivers over time. This might be useful in that case, if they're not contributing directly to this repo.

IMO, we can keep it as internal to start with, and open it up once there's demand.

@joellubi
Copy link
Member Author

joellubi commented Mar 5, 2024

Yes, pretty much. I expect Statement to be the hard one since I'd like the 'base' implementation to track all the state (do we have parameters? are we prepared? are we in the middle of an incremental query? etc.) and simplify the driver implementor's life as much as possible.

Ok great. So if it's alright with you, here's one way we could proceed: I'll pull the statement base out of this PR and save it for later, to help focus the scope. I'm in the process of pulling part of your connection implementation into this PR. I can then reach out again once there's something more fully baked to discuss/review and we can see if it's heading in the right direction.

@lidavidm
Copy link
Member

lidavidm commented Mar 5, 2024

Ok, sounds good. I haven't gotten to look at the code here yet but we can proceed here and feel free to steal anything from the other PR. (As you can see I was trying to simplify how GetObjects/GetInfo work)

@zeroshade
Copy link
Member

IMO, we can keep it as internal to start with, and open it up once there's demand.

I agree here, I'd rather keep it internal until it's requested

@joellubi
Copy link
Member Author

joellubi commented Mar 8, 2024

@lidavidm @zeroshade This isn't complete yet, but I've made some changes to they way the Connection gets built and wanted to get your thoughts on this direction.

I ended up breaking the previous large interface into several smaller ones, so that implementors may choose whichever ones are appropriate to use. With a single interface, there was a downside that the use of driverbase was "all or nothing". An example of this would be if the GetObjects logic for a driver doesn't fit well with the GetObjectsCatalogs/GetObjectsDbSchemas/GetObjectsTables combo. It may be easier to just write a custom GetObjects method from scratch but then it's not possible to use any of the other helpers.

By breaking up the interfaces and electing to only use the ones that are relevant, it allows implementors to mix/match custom methods with those provided by the driverbase. I've used a builder pattern to specify the helpers being used which makes it quite easy to discover which helpers are available to implement. Open to feedback on this approach. The way this looks in practice:

For driver_test, all of the helpers are used to demonstrate the behavior:

func (db *databaseImpl) Open(ctx context.Context) (adbc.Connection, error) {
	cnxn := &connectionImpl{ConnectionImplBase: driverbase.NewConnectionImplBase(&db.DatabaseImplBase), db: db}
	bldr := driverbase.NewConnectionBuilder(cnxn)
	return bldr.
		WithAutocommitSetter(cnxn).
		WithCurrentNamespacer(cnxn).
		WithTableTypeLister(cnxn).
		WithDriverInfoPreparer(cnxn).
		WithDbObjectsEnumerator(cnxn).
		Connection(), nil
}

For the flightsql_driver, CurrentNamespacer is not specified because sessions are not supported (yet) and TableTypeLister just isn't helpful because the data is already coming in as Arrow records, so there's not much sense in converting back/forth to a string list.

return driverbase.NewConnectionBuilder(conn).
		WithDriverInfoPreparer(conn).
		WithAutocommitSetter(conn).
		WithDbObjectsEnumerator(conn).
		Connection(), nil

For the snowflake driver, DriverInfoPreparer is not used because it doesn't send any driver info beyond the default build info, and a custom GetObjects method is used for this driver so we skip using DbObjectsEnumerator.

return driverbase.NewConnectionBuilder(conn).
		WithAutocommitSetter(conn).
		WithCurrentNamespacer(conn).
		WithTableTypeLister(conn).
		Connection(), nil

A few things to note about this approach: Each method on the builder enforces that those methods are implemented at compile time, which is nice. In these examples the connectionImpl itself implements each interface, which is nice because we can use the common connections state in each one, but with multiple interfaces it's now possible to start splitting some of these out into their own structs if they can manage their own state. This may be helpful for factoring the drivers into smaller components as they grow larger.

The names for the interfaces are kind of awkward, open to any suggestions. What do you think of this approach?

@lidavidm
Copy link
Member

lidavidm commented Mar 9, 2024

I'll try to take a look over the weekend or next week - just want to give you a heads up so you don't feel you're being ignored

@lidavidm
Copy link
Member

lidavidm commented Mar 9, 2024

Ok, just scanning through things - I think this approach is reasonable. The naming could go either way to me, it's clear enough

@joellubi
Copy link
Member Author

Thanks @lidavidm, I'll proceed with this approach then. I think that making this module internal and testing the pattern with a few more of our own driver implementations will help us fix anything we may be overlooking now as well.

@joellubi joellubi force-pushed the driverbase-cnxn-stmt branch from b20b71e to 281ba36 Compare March 15, 2024 16:19
@joellubi
Copy link
Member Author

@lidavidm @zeroshade I think I've just about wrapped this up. The driverbase package is now internal, and I've rebased on top of the recent session options changes and refactored the flightsql current catalog/schema code to use the helper methods. Let me know what you think.

There is one remaining issue that I have yet to resolve, if you have any ideas what the issue might be it would be very helpful: There is an integration test that's been failing (Integration / FlightSQL Integration Tests (Dremio and SQLite)). It seems to be related to these changes but I haven't yet figured out how. I'm seeing errors like this for GetInfo from sqlite:

Expected: has substring "db_name"
  Actual: "n\a\0\0\0\0\0"

and

Expected: has substring "sqlite 3"
  Actual: "n\a\0\0\0\0\0\0"

although the "actual" values have varied between runs. Any thoughts on what this might be?

@joellubi joellubi marked this pull request as ready for review March 15, 2024 18:44
@joellubi joellubi requested a review from zeroshade as a code owner March 15, 2024 18:44
@joellubi joellubi changed the title refactor(go/adbc/driver): connection and statement bases refactor(go/adbc/driver): driverbase implementation for connection Mar 15, 2024
@lidavidm
Copy link
Member

If the value is varying I'd expect some sort of use-after-free, or it's pointing into the wrong part/a garbage part of a buffer

@zeroshade
Copy link
Member

@joellubi I dug into this a bit and found something:

If you follow into PrepareDriverInfo and go down to line 579, we grab the value using array.String.Value. In the debugger, we see that v gets the correct value and we pass it to RegisterInfoCode. After the next call to rdr.Next, the record we pulled the value from is released. When stepping through with gdb, the memory location where the string was held gets filled with garbage upon that release. My theory is ASAN is blowing it up...

See, when we create the string from the underlying buffer we do an unsafe cast: a.values = *(*string)(unsafe.Pointer(&b)) and then the Value method takes a slice from that string. When we release the array, which releases the buffer, I'm guessing ASAN is filling the bytes with garbage since it's not tracking things properly due to our unsafe casting here.

If we're willing to drop Go 1.19 support in Arrow (which is safe now that Go 1.22 is out...) we can switch it to use unsafe.String instead which might be safer and solve this issue. Alternately, instead of needing a change in Arrow, we could just make sure that PrepareDriverInfo or RegisterInfoCode forces a copy of the string bytes and stores the copy in the map rather than referring to the same bytes. Since the string value for these info codes should be pretty small, it shouldn't be a problem, that should potentially fix the issue, though I haven't had the time to test that as a fix yet. Figured I'd report what I found though

@joellubi joellubi force-pushed the driverbase-cnxn-stmt branch from 326c1f7 to 20915b1 Compare March 17, 2024 17:59
@zeroshade
Copy link
Member

Huzzah! Looks like that fixed it

@joellubi
Copy link
Member Author

Thanks @zeroshade! Great timing, I was just looking into this too. Much appreciated. I've gotta say, I learned a bit about Go strings here. I understood them to be "immutable", but of course the devil is in the details with what that actually means.

@joellubi
Copy link
Member Author

With that out of the way, I'd say this is ready for review now.

@lidavidm
Copy link
Member

@zeroshade we've run into that before, there's a place somewhere in the current codebase with a comment about making an explicit copy

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me.

@zeroshade any comments?

@zeroshade zeroshade merged commit 3022428 into apache:main Mar 19, 2024
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

go: extend driver framework to cover connections
3 participants