Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] make Flight SQL server lazily consume streams #7248

Open
niebayes opened this issue Mar 7, 2025 · 2 comments
Open

[Discussion] make Flight SQL server lazily consume streams #7248

niebayes opened this issue Mar 7, 2025 · 2 comments
Labels
question Further information is requested

Comments

@niebayes
Copy link
Contributor

niebayes commented Mar 7, 2025

We are building a database using the Arrow Flight SQL protocol, and we recently observed an interesting phenomenon. When we create a stream on the server side, it is immediately polled by the server, rather than waiting for the client to start polling it.

After a lengthy debugging process, we discovered that this behavior is triggered by Tonic. The Arrow Flight SQL protocol is based on gRPC, and in the arrow-rs project, we use Tonic to build the gRPC service. When we create a cross-process stream through Tonic, Tonic actually creates a unidirectional stream. At the same time, Tonic spawns a task to continuously poll this stream as soon as it is created, even if the client has not started polling it yet.

We believe this behavior does not align with the design of the Arrow Flight SQL protocol. Typically, a query goes through two phases: get_flight_info and do_get_statement . In the first phase, we create the stream, and in the second phase, we retrieve the stream from the server and consume it on the client side. We expect the stream to start being consumed only in the second phase, but in reality, this is not the case.

On the other hand, our database implements distributed query based on DataFusion. The frontend sends the execution plan to other nodes, the backend executes it and returns a stream along with the statistics. The frontend then optimizes the plan. For example, a count(*) query would be optimized to directly return the num_rows stored in the statistics. As a result, the backend-side stream can be simply discarded. However, tonic would eagerly consume the stream without waiting for the client's poll, which is actually not necessary.

We hope that arrow-rs can provide a way to lazily consume the stream created on the server side, meaning it only starts being consumed when the client polls it.

@niebayes niebayes added the question Further information is requested label Mar 7, 2025
@niebayes niebayes changed the title [Discussion] make an Flight SQL server lazily consume streams [Discussion] make Flight SQL server lazily consume streams Mar 7, 2025
@niebayes
Copy link
Contributor Author

niebayes commented Mar 7, 2025

We have tried to put a bounded channel on the server side to disable eager polling to some extent. But this is definitely not a graceful and general solution.

Specifically:
[Before] server stream -> tonic -> client
[After] server stream -> bounded channel -> tonic -> client

Since the channel is bounded, the server side cannot poll the stream till exhausted, which emulates a lazy stream.

@tustvold
Copy link
Contributor

tustvold commented Mar 7, 2025

Could you perhaps provide a code reproducer, I'm not sure I follow what you're describing.

FWIW DF physical execution is eager and will immediately start doing work once the streams are created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants