How to read the column by specific data instead of the entire parquet file? #4247

jun0315 · 2023-05-19T14:20:45Z

jun0315
May 19, 2023

pub fn with_projection(self, mask: ProjectionMask) -> Self

Can read data from the provided column indexes.But the entire parquet data needs to be provided.
Can we directly read individual column? Only transmit relevant data for the column?

FYI, in arrow2 can only read column data.

                let page_meta_data = PageMetaData {
                    column_start: meta.offset,
                    num_values: meta.num_values as i64,
                    compression: Self::to_parquet_compression(compression)?,
                    descriptor: column_descriptor.descriptor.clone(),
                };
                let pages = PageReader::new_with_page_meta(
                    chunk,
                    page_meta_data,
                    Arc::new(|_, _| true),
                    vec![],
                    usize::MAX,
                );

/// A fallible [`Iterator`] of [`CompressedDataPage`]. This iterator reads pages back
/// to back until all pages have been consumed.
/// The pages from this iterator always have [`None`] [`crate::page::CompressedDataPage::selected_rows()`] since
/// filter pushdown is not supported without a
/// pre-computed [page index](https://github.com/apache/parquet-format/blob/master/PageIndex.md).
pub struct PageReader<R: Read> {
}

PageReader can consume all pages.

Answered by tustvold

May 19, 2023

The readers automatically perform IO pushdown, they will only fetch the byte ranges needed, this includes column projection, and extends through to row group and page pruning, late materialization, etc...

The readers aim to be batteries included, you shouldn't need to worry about pages, column chunks, etc... it will just do the right thing

View full answer

tustvold · 2023-05-19T18:48:14Z

tustvold
May 19, 2023
Collaborator

The readers automatically perform IO pushdown, they will only fetch the byte ranges needed, this includes column projection, and extends through to row group and page pruning, late materialization, etc...

The readers aim to be batteries included, you shouldn't need to worry about pages, column chunks, etc... it will just do the right thing

2 replies

jun0315 May 20, 2023
Author

When we used arrow2 before, only the specified column data was passed between functions, rather than the complete parquet data. In order to minimize the migration cost as much as possible, I wanted to find out if there was a corresponding API, so that we wouldn't have to change a lot of code @tustvold

tustvold May 20, 2023
Collaborator

I'm afraid there isn't a corresponding API for this, such details are kept as implementation details

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to read the column by specific data instead of the entire parquet file? #4247

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How to read the column by specific data instead of the entire parquet file? #4247

jun0315 May 19, 2023

Replies: 1 comment · 2 replies

tustvold May 19, 2023 Collaborator

jun0315 May 20, 2023 Author

tustvold May 20, 2023 Collaborator

jun0315
May 19, 2023

Replies: 1 comment 2 replies

tustvold
May 19, 2023
Collaborator

jun0315 May 20, 2023
Author

tustvold May 20, 2023
Collaborator