You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are building a database using the Arrow Flight SQL protocol, and we recently observed an interesting phenomenon. When we create a stream on the server side, it is immediately polled by the server, rather than waiting for the client to start polling it.
After a lengthy debugging process, we discovered that this behavior is triggered by Tonic. The Arrow Flight SQL protocol is based on gRPC, and in the arrow-rs project, we use Tonic to build the gRPC service. When we create a cross-process stream through Tonic, Tonic actually creates a unidirectional stream. At the same time, Tonic spawns a task to continuously poll this stream as soon as it is created, even if the client has not started polling it yet.
We believe this behavior does not align with the design of the Arrow Flight SQL protocol. Typically, a query goes through two phases: get_flight_info and do_get_statement . In the first phase, we create the stream, and in the second phase, we retrieve the stream from the server and consume it on the client side. We expect the stream to start being consumed only in the second phase, but in reality, this is not the case.
On the other hand, our database implements distributed query based on DataFusion. The frontend sends the execution plan to other nodes, the backend executes it and returns a stream along with the statistics. The frontend then optimizes the plan. For example, a count(*) query would be optimized to directly return the num_rows stored in the statistics. As a result, the backend-side stream can be simply discarded. However, tonic would eagerly consume the stream without waiting for the client's poll, which is actually not necessary.
We hope that arrow-rs can provide a way to lazily consume the stream created on the server side, meaning it only starts being consumed when the client polls it.
The text was updated successfully, but these errors were encountered:
niebayes
changed the title
[Discussion] make an Flight SQL server lazily consume streams
[Discussion] make Flight SQL server lazily consume streams
Mar 7, 2025
We have tried to put a bounded channel on the server side to disable eager polling to some extent. But this is definitely not a graceful and general solution.
Specifically:
[Before] server stream -> tonic -> client
[After] server stream -> bounded channel -> tonic -> client
Since the channel is bounded, the server side cannot poll the stream till exhausted, which emulates a lazy stream.
We are building a database using the Arrow Flight SQL protocol, and we recently observed an interesting phenomenon. When we create a stream on the server side, it is immediately polled by the server, rather than waiting for the client to start polling it.
After a lengthy debugging process, we discovered that this behavior is triggered by Tonic. The Arrow Flight SQL protocol is based on gRPC, and in the arrow-rs project, we use Tonic to build the gRPC service. When we create a cross-process stream through Tonic, Tonic actually creates a unidirectional stream. At the same time, Tonic spawns a task to continuously poll this stream as soon as it is created, even if the client has not started polling it yet.
We believe this behavior does not align with the design of the Arrow Flight SQL protocol. Typically, a query goes through two phases:
get_flight_info
anddo_get_statement
. In the first phase, we create the stream, and in the second phase, we retrieve the stream from the server and consume it on the client side. We expect the stream to start being consumed only in the second phase, but in reality, this is not the case.On the other hand, our database implements distributed query based on DataFusion. The frontend sends the execution plan to other nodes, the backend executes it and returns a stream along with the statistics. The frontend then optimizes the plan. For example, a
count(*)
query would be optimized to directly return thenum_rows
stored in the statistics. As a result, the backend-side stream can be simply discarded. However, tonic would eagerly consume the stream without waiting for the client's poll, which is actually not necessary.We hope that arrow-rs can provide a way to lazily consume the stream created on the server side, meaning it only starts being consumed when the client polls it.
The text was updated successfully, but these errors were encountered: