Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose arrow_json::writer::encoder::make_encoder_impl() for re-use in custom encoders #5733

Open
eldargab opened this issue May 7, 2024 · 7 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@eldargab
Copy link

eldargab commented May 7, 2024

In our project we have a set of JSON serialization rules not covered by arrow_json.

For example, there are binary columns that already contain a valid JSON that needs to be just embedded "as is" into the output stream. We also have cases when we'd like to serialize some StructArrays in non-canonical way without any prior transformation.

Public make_encoder_impl() would allow us to re-use existing arrow encoders in our serializers.

@eldargab eldargab added the enhancement Any new improvement worthy of a entry in the changelog label May 7, 2024
@tustvold
Copy link
Contributor

tustvold commented May 7, 2024

I think the encoder details are the kind of thing that would be good to keep as a private implementation detail.

Perhaps we might provide a way to indicate that a given StringArray should be encoded as-is without escaping?

@adriangb
Copy link
Contributor

adriangb commented Sep 26, 2024

We also needed something like this for this very exact use case: we have string columns that are JSON.
But we also have union columns that are json (they come from https://github.com/datafusion-contrib/datafusion-functions-json if you do json_column->'abc') which we would also like to create json objects out of.
It seems to me like that would require a more flexible hook where for each column we can choose an encoder (the default one, or a custom one).

I know these APIs can get gross, I had to deal with something similar for pgpq because if you had an Arrow UTF8 column you have to choose if you want to make it Postgres TEXT or JSONB.

@nathanielc
Copy link
Contributor

I have the same use case. I am using https://github.com/datafusion-contrib/datafusion-functions-json and the arrow_json writer. Would adding support for encoding unions to the arrow_json be sufficient?

I imagine it would be an untagged union and just flatten away the union itself. As the json functions repo produces a Sparse union that would work well for that case. We could either error for Dense unions like we do today or agree on a way to flatten the union adding explicit nulls.

Thoughts? I could put up a PR for sparse unions if we agree its a good solution.

@adriangb
Copy link
Contributor

I forgot about this issue and recently opened apache/datafusion#14297. I'm happy for either that to get across the line (which also has other use cases and would make it easy to add a union encoder) or a PR implementing support for unions directly. I think both make sense.

@nathanielc
Copy link
Contributor

Ok if adding both makes sense, I'll try adding the union encoder directly to the arrow_json implementation and see how that works out, there might be some ugly edge cases I haven't considered yet.

As for the datafusion PR I don't quite so how that enables adding a union encoder could you elaborate?

@adriangb
Copy link
Contributor

Sorry wrong link: #7015

@nathanielc
Copy link
Contributor

Ok so it turns out that the json functions doesn't use the union as I was expecting. They still encode objects directly as strings. So encoding the union directly therefore does not solve my use case. Here is my wip diff of changes

diff --git a/arrow-json/src/writer/encoder.rs b/arrow-json/src/writer/encoder.rs
index ed430fe6a..876c499ab 100644
--- a/arrow-json/src/writer/encoder.rs
+++ b/arrow-json/src/writer/encoder.rs
@@ -24,7 +24,7 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
+use std::{collections::BTreeMap, io::Write};

 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
@@ -138,6 +138,26 @@ fn make_encoder_impl<'a>(
             };
             (Box::new(encoder) as _, array.nulls().cloned())
         }
+        DataType::Union(fields,_) => {
+            let array = array.as_union();
+            let encoders = fields.iter().map(|(type_id, field )| {
+                let (encoder, nulls) = make_encoder_impl(array.child(type_id), options)?;
+                Ok((
+                        type_id,
+                        FieldEncoder {
+                            field: field.clone(),
+                            encoder,
+                            nulls
+                        }
+                ))
+            }).collect::<Result<BTreeMap<_,_>, ArrowError>>()?;
+
+            let encoder = UnionArrayEncoder{
+                array,
+                encoders,
+            };
+            (Box::new(encoder) as _, array.nulls().cloned())
+        }
         DataType::Decimal128(_, _) | DataType::Decimal256(_, _) => {
             let options = FormatOptions::new().with_display_error(true);
             let formatter = ArrayFormatter::try_new(array, &options)?;
@@ -210,6 +230,20 @@ impl Encoder for StructArrayEncoder<'_> {
     }
 }

+struct UnionArrayEncoder<'a> {
+    array: &'a UnionArray,
+    encoders: BTreeMap<i8, FieldEncoder<'a>>,
+}
+
+impl<'a> Encoder for UnionArrayEncoder<'a> {
+    fn encode(&mut self, idx: usize, out: &mut Vec<u8>) {
+        let type_id = self.array.type_id(idx);
+        self.encoders
+            .get_mut(&type_id)
+            .map(|f| f.encoder.encode(idx, out));
+    }
+}
+
 trait PrimitiveEncode: ArrowNativeType {
     type Buffer;

Its a pretty simple change if someone wants to take it further. Given #7015 however its usefulness is likely diminished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

4 participants