-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose arrow_json::writer::encoder::make_encoder_impl() for re-use in custom encoders #5733
Comments
I think the encoder details are the kind of thing that would be good to keep as a private implementation detail. Perhaps we might provide a way to indicate that a given StringArray should be encoded as-is without escaping? |
We also needed something like this for this very exact use case: we have string columns that are JSON. I know these APIs can get gross, I had to deal with something similar for pgpq because if you had an Arrow UTF8 column you have to choose if you want to make it Postgres TEXT or JSONB. |
I have the same use case. I am using https://github.com/datafusion-contrib/datafusion-functions-json and the arrow_json writer. Would adding support for encoding unions to the arrow_json be sufficient? I imagine it would be an untagged union and just flatten away the union itself. As the json functions repo produces a Sparse union that would work well for that case. We could either error for Dense unions like we do today or agree on a way to flatten the union adding explicit nulls. Thoughts? I could put up a PR for sparse unions if we agree its a good solution. |
I forgot about this issue and recently opened apache/datafusion#14297. I'm happy for either that to get across the line (which also has other use cases and would make it easy to add a union encoder) or a PR implementing support for unions directly. I think both make sense. |
Ok if adding both makes sense, I'll try adding the union encoder directly to the arrow_json implementation and see how that works out, there might be some ugly edge cases I haven't considered yet. As for the datafusion PR I don't quite so how that enables adding a union encoder could you elaborate? |
Sorry wrong link: #7015 |
Ok so it turns out that the json functions doesn't use the union as I was expecting. They still encode objects directly as strings. So encoding the union directly therefore does not solve my use case. Here is my wip diff of changes diff --git a/arrow-json/src/writer/encoder.rs b/arrow-json/src/writer/encoder.rs
index ed430fe6a..876c499ab 100644
--- a/arrow-json/src/writer/encoder.rs
+++ b/arrow-json/src/writer/encoder.rs
@@ -24,7 +24,7 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
use half::f16;
use lexical_core::FormattedSize;
use serde::Serializer;
-use std::io::Write;
+use std::{collections::BTreeMap, io::Write};
#[derive(Debug, Clone, Default)]
pub struct EncoderOptions {
@@ -138,6 +138,26 @@ fn make_encoder_impl<'a>(
};
(Box::new(encoder) as _, array.nulls().cloned())
}
+ DataType::Union(fields,_) => {
+ let array = array.as_union();
+ let encoders = fields.iter().map(|(type_id, field )| {
+ let (encoder, nulls) = make_encoder_impl(array.child(type_id), options)?;
+ Ok((
+ type_id,
+ FieldEncoder {
+ field: field.clone(),
+ encoder,
+ nulls
+ }
+ ))
+ }).collect::<Result<BTreeMap<_,_>, ArrowError>>()?;
+
+ let encoder = UnionArrayEncoder{
+ array,
+ encoders,
+ };
+ (Box::new(encoder) as _, array.nulls().cloned())
+ }
DataType::Decimal128(_, _) | DataType::Decimal256(_, _) => {
let options = FormatOptions::new().with_display_error(true);
let formatter = ArrayFormatter::try_new(array, &options)?;
@@ -210,6 +230,20 @@ impl Encoder for StructArrayEncoder<'_> {
}
}
+struct UnionArrayEncoder<'a> {
+ array: &'a UnionArray,
+ encoders: BTreeMap<i8, FieldEncoder<'a>>,
+}
+
+impl<'a> Encoder for UnionArrayEncoder<'a> {
+ fn encode(&mut self, idx: usize, out: &mut Vec<u8>) {
+ let type_id = self.array.type_id(idx);
+ self.encoders
+ .get_mut(&type_id)
+ .map(|f| f.encoder.encode(idx, out));
+ }
+}
+
trait PrimitiveEncode: ArrowNativeType {
type Buffer; Its a pretty simple change if someone wants to take it further. Given #7015 however its usefulness is likely diminished. |
In our project we have a set of JSON serialization rules not covered by
arrow_json
.For example, there are binary columns that already contain a valid JSON that needs to be just embedded "as is" into the output stream. We also have cases when we'd like to serialize some StructArrays in non-canonical way without any prior transformation.
Public make_encoder_impl() would allow us to re-use existing arrow encoders in our serializers.
The text was updated successfully, but these errors were encountered: