Expose arrow_json::writer::encoder::make_encoder_impl() for re-use in custom encoders #5733

eldargab · 2024-05-07T18:44:11Z

In our project we have a set of JSON serialization rules not covered by arrow_json.

For example, there are binary columns that already contain a valid JSON that needs to be just embedded "as is" into the output stream. We also have cases when we'd like to serialize some StructArrays in non-canonical way without any prior transformation.

Public make_encoder_impl() would allow us to re-use existing arrow encoders in our serializers.

The text was updated successfully, but these errors were encountered:

tustvold · 2024-05-07T22:02:08Z

I think the encoder details are the kind of thing that would be good to keep as a private implementation detail.

Perhaps we might provide a way to indicate that a given StringArray should be encoded as-is without escaping?

adriangb · 2024-09-26T16:08:38Z

We also needed something like this for this very exact use case: we have string columns that are JSON.
But we also have union columns that are json (they come from https://github.com/datafusion-contrib/datafusion-functions-json if you do json_column->'abc') which we would also like to create json objects out of.
It seems to me like that would require a more flexible hook where for each column we can choose an encoder (the default one, or a custom one).

I know these APIs can get gross, I had to deal with something similar for pgpq because if you had an Arrow UTF8 column you have to choose if you want to make it Postgres TEXT or JSONB.

nathanielc · 2025-01-30T22:06:00Z

I have the same use case. I am using https://github.com/datafusion-contrib/datafusion-functions-json and the arrow_json writer. Would adding support for encoding unions to the arrow_json be sufficient?

I imagine it would be an untagged union and just flatten away the union itself. As the json functions repo produces a Sparse union that would work well for that case. We could either error for Dense unions like we do today or agree on a way to flatten the union adding explicit nulls.

Thoughts? I could put up a PR for sparse unions if we agree its a good solution.

adriangb · 2025-01-30T22:30:03Z

I forgot about this issue and recently opened apache/datafusion#14297. I'm happy for either that to get across the line (which also has other use cases and would make it easy to add a union encoder) or a PR implementing support for unions directly. I think both make sense.

nathanielc · 2025-01-30T23:04:21Z

Ok if adding both makes sense, I'll try adding the union encoder directly to the arrow_json implementation and see how that works out, there might be some ugly edge cases I haven't considered yet.

As for the datafusion PR I don't quite so how that enables adding a union encoder could you elaborate?

adriangb · 2025-01-30T23:06:35Z

Sorry wrong link: #7015

nathanielc · 2025-01-31T19:32:03Z

Ok so it turns out that the json functions doesn't use the union as I was expecting. They still encode objects directly as strings. So encoding the union directly therefore does not solve my use case. Here is my wip diff of changes

diff --git a/arrow-json/src/writer/encoder.rs b/arrow-json/src/writer/encoder.rs
index ed430fe6a..876c499ab 100644
--- a/arrow-json/src/writer/encoder.rs
+++ b/arrow-json/src/writer/encoder.rs
@@ -24,7 +24,7 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
+use std::{collections::BTreeMap, io::Write};

 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
@@ -138,6 +138,26 @@ fn make_encoder_impl<'a>(
             };
             (Box::new(encoder) as _, array.nulls().cloned())
         }
+        DataType::Union(fields,_) => {
+            let array = array.as_union();
+            let encoders = fields.iter().map(|(type_id, field )| {
+                let (encoder, nulls) = make_encoder_impl(array.child(type_id), options)?;
+                Ok((
+                        type_id,
+                        FieldEncoder {
+                            field: field.clone(),
+                            encoder,
+                            nulls
+                        }
+                ))
+            }).collect::<Result<BTreeMap<_,_>, ArrowError>>()?;
+
+            let encoder = UnionArrayEncoder{
+                array,
+                encoders,
+            };
+            (Box::new(encoder) as _, array.nulls().cloned())
+        }
         DataType::Decimal128(_, _) | DataType::Decimal256(_, _) => {
             let options = FormatOptions::new().with_display_error(true);
             let formatter = ArrayFormatter::try_new(array, &options)?;
@@ -210,6 +230,20 @@ impl Encoder for StructArrayEncoder<'_> {
     }
 }

+struct UnionArrayEncoder<'a> {
+    array: &'a UnionArray,
+    encoders: BTreeMap<i8, FieldEncoder<'a>>,
+}
+
+impl<'a> Encoder for UnionArrayEncoder<'a> {
+    fn encode(&mut self, idx: usize, out: &mut Vec<u8>) {
+        let type_id = self.array.type_id(idx);
+        self.encoders
+            .get_mut(&type_id)
+            .map(|f| f.encoder.encode(idx, out));
+    }
+}
+
 trait PrimitiveEncode: ArrowNativeType {
     type Buffer;

Its a pretty simple change if someone wants to take it further. Given #7015 however its usefulness is likely diminished.

eldargab added the enhancement Any new improvement worthy of a entry in the changelog label May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose arrow_json::writer::encoder::make_encoder_impl() for re-use in custom encoders #5733

Expose arrow_json::writer::encoder::make_encoder_impl() for re-use in custom encoders #5733

eldargab commented May 7, 2024

tustvold commented May 7, 2024

adriangb commented Sep 26, 2024 •

edited

Loading

nathanielc commented Jan 30, 2025

adriangb commented Jan 30, 2025

nathanielc commented Jan 30, 2025

adriangb commented Jan 30, 2025

nathanielc commented Jan 31, 2025

Expose arrow_json::writer::encoder::make_encoder_impl() for re-use in custom encoders #5733

Expose arrow_json::writer::encoder::make_encoder_impl() for re-use in custom encoders #5733

Comments

eldargab commented May 7, 2024

tustvold commented May 7, 2024

adriangb commented Sep 26, 2024 • edited Loading

nathanielc commented Jan 30, 2025

adriangb commented Jan 30, 2025

nathanielc commented Jan 30, 2025

adriangb commented Jan 30, 2025

nathanielc commented Jan 31, 2025

adriangb commented Sep 26, 2024 •

edited

Loading