Skip to content

Commit 1bf58f4

Browse files
duanmengfacebook-github-bot
authored andcommitted
docs: Add query tracing blog (facebookincubator#11865)
Summary: Add a query tracing blog to the Velox website. Original design doc, https://docs.google.com/document/d/1crIIeVz4tWKYQnBoHoxrv2i-4zAML9HSYLps8h5SDrc/edit?tab=t.0#heading=h.y6j2ojtr7hm9 Pull Request resolved: facebookincubator#11865 Reviewed By: amitkdutta, tanjialiang Differential Revision: D67494177 Pulled By: xiaoxmeng fbshipit-source-id: 123d58797ef1e38aad4284e0ccc5ad12548b3740
1 parent 3810d26 commit 1bf58f4

File tree

4 files changed

+373
-25
lines changed

4 files changed

+373
-25
lines changed

velox/docs/develop/debugging/tracing.rst

+198-25
Original file line numberDiff line numberDiff line change
@@ -59,30 +59,135 @@ There are three types of writers: `TaskTraceMetadataWriter`, `OperatorTraceInput
5959
and `OperatorTraceSplitWriter`. They are used in the prod or shadow environment to record
6060
the real execution data.
6161

62-
**TaskTraceMetadataWriter** records the query metadata during task creation,
63-
serializes, and writes them into a file in JSON format. There are two kinds
64-
of metadata:
62+
**TaskTraceMetadataWriter**
63+
64+
The `TaskTraceMetadataWriter` records the query metadata during task creation, serializes it,
65+
and saves it into a file in JSON format. There are two types of metadata:
66+
67+
1. **Query Configurations and Connector Properties**: These are user-specified per query and can
68+
be serialized as JSON map objects (key-value pairs).
69+
2. **Task Plan Fragment** (aka Plan Node Tree): This can be serialized as a JSON object, a feature
70+
already supported in Velox (see `#4614 <https://github.com/facebookincubator/velox/issues/4614>`_, `#4301 <https://github.com/facebookincubator/velox/issues/4301>`_, and `#4398 <https://github.com/facebookincubator/velox/issues/4398>`_).
71+
72+
The metadata is saved as a single JSON object string in the metadata file. It would look similar
73+
to the following simplified, pretty-printed JSON string (with some content removed for brevity):
74+
75+
.. code-block:: JSON
76+
77+
{
78+
"planNode":{
79+
"nullAware": false,
80+
"outputType":{...},
81+
"leftKeys":[...],
82+
"rightKeys":[...],
83+
"joinType":"INNER",
84+
"sources":[
85+
{
86+
"outputType":{...},
87+
"tableHandle":{...},
88+
"assignments":[...],
89+
"id":"0",
90+
"name":"TableScanNode"
91+
},
92+
{
93+
"outputType":{...},
94+
"tableHandle":{...},
95+
"assignments":[...],
96+
"id":"1",
97+
"name":"TableScanNode"
98+
}
99+
],
100+
"id":"2",
101+
"name":"HashJoinNode"
102+
},
103+
"connectorProperties":{...},
104+
"queryConfig":{"query_trace_node_ids":"2", ...}
105+
}
106+
107+
**OperatorTraceInputWriter**
108+
109+
The `OperatorTraceInputWriter` records the input vectors from the target operator, it uses a Presto
110+
serializer to serialize each vector batch and flush immediately to ensure that replay is possible
111+
even if a crash occurs during execution.
65112

66-
- Query configurations and connector properties are specified by the user per query.
67-
They can be serialized as JSON map objects (key-value pairs).
68-
- Plan fragment of the task (also known as a plan node tree). It can be serialized
69-
as a JSON object, which is already supported in Velox.
113+
It is created during the target operator's initialization and writes data in the `Operator::addInput`
114+
method during execution. It finishes when the target operator is closed. However, it can finish early
115+
if the recorded data size exceeds the limit specified by the user.
70116

71-
**OperatorTraceInputWriter** records the input vectors from the target operator, it uses a Presto
72-
serializer to serialize each vector batch and flush immediately to ensure that replay is possible
73-
even if a crash occurs during execution. It is created during the target operator's initialization
74-
and writes data in the `Operator::addInput` method during execution. It finishes when the target
75-
operator is closed. However, it can finish early if the recorded data size exceeds the limit specified
76-
by the user.
117+
**OperatorTraceSplitWriter**
77118

78-
**OperatorTraceSplitWriter** captures the input splits from the target `TableScan` operator. It
119+
The `OperatorTraceSplitWriter` captures the input splits from the target `TableScan` operator. It
79120
serializes each split and immediately flushes it to ensure that replay is possible even if a crash
80-
occurs during execution. Each split is serialized as follows:
121+
occurs during execution.
122+
123+
Each split is serialized as follows:
81124

82125
.. code-block:: c++
83126

84127
| length : uint32_t | split : JSON string | crc32 : uint32_t |
85128
129+
Storage Location
130+
^^^^^^^^^^^^^^^^
131+
132+
It is recommended to store traced data in a remote storage system to ensure its preservation and
133+
accessibility even if the computation clusters are reconfigured or encounter issues. This also
134+
helps prevent nodes in the cluster from failing due to local disk exhaustion.
135+
136+
Users should start by creating a root directory. Writers will then create subdirectories within
137+
this root directory to organize the traced data. A well-designed directory structure will keep
138+
the data organized and accessible for replay and analysis.
139+
140+
**Metadata Location**
141+
142+
The `TaskTraceMetadataWriter` is set up during the task creation so it creates a trace directory
143+
named `$rootDir/$queryId/$taskId`.
144+
145+
**Input Data and Split Location**
146+
147+
The task generates Drivers and Operators, and each is identified by a set of IDs. Each driver
148+
is assigned a pipeline ID and a driver ID. Pipeline IDs are sequential numbers starting from zero,
149+
and driver IDs are also sequential numbers starting from zero but are scoped to a specific pipeline,
150+
ensuring uniqueness within that pipeline. Additionally, each operator within a driver is assigned a
151+
sequential operator ID, starting from zero and unique within the driver.
152+
153+
The node ID consolidates the tracing for the same tracing plan node. The pipeline ID isolates the
154+
tracing data between operators created from the same plan node (e.g., HashProbe and HashBuild from
155+
the HashJoinNode). The driver ID isolates the tracing data of peer operators in the same pipeline
156+
from different drivers.
157+
158+
Correspondingly, to ensure the organized and isolated tracing data storage, the `OperatorTraceInputWriter`
159+
and `OpeartorTraceSplitWriter` are set up during the operator initialization and create a data or split
160+
tracing directory in `$rootDir/$queryId$taskId/$nodeId/$pipelineId/$driverId`.
161+
162+
The following is a typical `HashJoinNode` traced metadata and data storage directory structure:
163+
164+
.. code-block:: SHELL
165+
166+
trace ---------------------------------------------------> rootDir
167+
└── query-1 -------------------------------------------> query ID
168+
└── task-1 ----------------------------------------> task ID
169+
├── 2 -----------------------------------------> node ID
170+
│ ├── 0 -------------------------> pipeline ID (probe)
171+
│ │ ├── 0 -------------------------> driver ID (0)
172+
│ │ │ ├── op_input_trace.data
173+
│ │ │ └── op_trace_summary.json
174+
│ │ └── 1 -------------------------> driver ID (1)
175+
│ │ ├── op_input_trace.data
176+
│ │ └── op_trace_summary.json
177+
│ └── 1 -------------------------> pipeline ID (build)
178+
│ ├── 0 ---------------------------> driver ID (0)
179+
│ │ ├── op_input_trace.data
180+
│ │ └── op_trace_summary.json
181+
│ └── 1 -------------------------> driver ID (1)
182+
│ ├── op_input_trace.data
183+
│ └── op_trace_summary.json
184+
└── task_trace_meta.json ----------------> query metadata
185+
186+
Memory Management
187+
^^^^^^^^^^^^^^^^^
188+
189+
Add a new leaf system pool named tracePool for tracing memory usage, and expose it
190+
like `memory::MemoryManager::getInstance()->tracePool()`.
86191

87192
Query Trace Readers
88193
^^^^^^^^^^^^^^^^^^^
@@ -91,20 +196,93 @@ Three types of readers correspond to the query trace writers: `TaskTraceMetadata
91196
`OperatorTraceInputReader`, and `OperatorTraceSplitReader`. The replayers typically use
92197
them in the local environment, which will be described in detail in the Query Trace Replayer section.
93198

94-
**TaskTraceMetadataReader** can load the query metadata JSON file and extract the query
199+
**TaskTraceMetadataReader**
200+
201+
The `TaskTraceMetadataReader` can load the query metadata JSON file and extract the query
95202
configurations, connector properties, and a plan fragment. The replayer uses these to build
96203
a replay task.
97204

98-
**OperatorTraceInputReader** reads and deserializes the input vectors in a tracing data file.
205+
**OperatorTraceInputReader**
206+
207+
The `OperatorTraceInputReader` reads and deserializes the input vectors in a tracing data file.
99208
It is created and used by a `QueryTraceScan` operator which will be described in detail in
100209
the **Query Trace Scan** section.
101210

102-
**OperatorTraceSplitReader** reads and deserializes the input splits in tracing split info files,
211+
**OperatorTraceSplitReader**
212+
213+
The `OperatorTraceSplitReader` reads and deserializes the input splits in tracing split info files,
103214
and produces a list of `exec::Split` for the query replay.
104215

105-
How To Replay
106-
-------------
216+
Trace Scan
217+
^^^^^^^^^^
218+
219+
As outlined in the **How Tracing Works** section, replaying a non-leaf operator requires a
220+
specialized source operator. This operator is responsible for reading data records during the
221+
tracing phase and integrating with Velox’s `LocalPlanner` with a customized plan node and
222+
operator translator.
223+
224+
**TraceScanNode**
225+
226+
We introduce a customized ‘TraceScanNode’ to replay a non-leaf operator. This node acts as
227+
the source node and creates a specialized scan operator, known as `OperatorTraceScan` with
228+
one per driver during the replay. The `TraceScanNode` contains the trace directory for the
229+
designated trace node, the pipeline ID associated with it, and a driver ID list passed during
230+
the replaying by users so that the OperatorTraceScan can locate the right trace input data or
231+
split directory.
232+
233+
**OperatorTraceScan**
234+
235+
236+
As described in the **Storage Location** section, a plan node may be split into multiple pipelines,
237+
each pipeline can be divided into multiple operators. Each operator corresponds to a driver, which
238+
is a thread of execution. There may be multiple tracing data files for a single plan node, one file
239+
per driver.
240+
107241

242+
To identify the correct input data file associated with a specific `OperatorTraceScan` operator, it
243+
leverages the trace node directory, pipeline ID, and driver ID list supplied by the TraceScanNode.
244+
245+
246+
During the replay process, it uses its own driver ID as an index to extract the replay driver ID from
247+
the driver ID list in the `TraceScanNode`. Along with the trace node directory and pipeline ID from
248+
the `TraceScanNode`, it locates its corresponding input data file.
249+
250+
251+
Correspondingly, an `OperatorTraceScan` operator uses a trace data file in
252+
`$rootDir/$queryId/$taskId/$nodeId/$pipelineId/$dirverId` to create an
253+
`OperatorTraceReader`. And `OperatorTraceScan::getOutput` method returns the vectors read by
254+
its `OperatorTraceInputReader`, which returns the vectors in the same sequence order as originally
255+
processed in the production execution. This ensures that the replaying maintains the same data flow
256+
as in the original production execution.
257+
258+
Query Trace Replayer
259+
^^^^^^^^^^^^^^^^^^^^
260+
261+
The query trace replayer is typically used in the local environment and works as follows:
262+
263+
1. Use `TaskTraceMetadataReader` to load traced query configurations, connector properties,
264+
and a plan fragment.
265+
2. Extract the target plan node from the plan fragment using the specified plan node ID.
266+
3. Use the target plan node in step 2 to create a replay plan node. Create a replay plan
267+
using `exec::test::PlanBuilder`.
268+
4. If the target plan node is a `TableScanNode`
269+
- Add the replay plan node to the replay plan as the source node.
270+
- Get all the traced splits using `OperatorInputSplitReader`.
271+
- Use the splits as inputs for task replaying.
272+
5. For a non-leaf operator, add a `QueryTraceScanNode` as the source node to the replay plan and
273+
then add the replay plan node.
274+
6. Use `exec::test::AssertQueryBuilder` to add the sink node, apply the query
275+
configurations (disable tracing), and connector properties, and execute the replay plan.
276+
277+
The `OperatorReplayBase` provides the core functionality required for replaying an operator.
278+
It handles the retrieval of metadata, creation of the replay plan, and execution of the plan.
279+
Concrete operator replayers, such as `HashJoinReplayer` and `AggregationReplayer`, extend this
280+
base class and override the `createPlanNode` method to create the specific plan node.
281+
282+
Query Trace Tool Usage
283+
----------------------
284+
285+
Enable tracing using configurations in https://facebookincubator.github.io/velox/configs.html#tracing.
108286
After the traced query finishes, its metadata and the input data for the target tasks and operators
109287
are all saved in the directory specified by `query_trace_dir`.
110288

@@ -188,8 +366,3 @@ Here is a full list of supported command line arguments.
188366
* ``--shuffle_serialization_format``: Specify the shuffle serialization format.
189367
* ``--table_writer_output_dir``: Specify the output directory of TableWriter.
190368
* ``--hiveConnectorExecutorHwMultiplier``: Hardware multiplier for hive connector.
191-
192-
Future Work
193-
-----------
194-
195-
https://github.com/facebookincubator/velox/issues/9668

0 commit comments

Comments
 (0)