Add hooks to json encoder to override default encoding or add support for unsupported types #7015

adriangb · 2025-01-24T12:37:19Z

We essentially had to vendor this module to implement this. I know this has been discussed in the past and there was pushback against exposing Encoder to the public API but since I had to go through the work of updating our vendored version I thought I'd at least try to make a PR to upstream this to make our lives easier in the future and bring value back to the community.

In my opinion exposing this is a rather small adition the public API and any breaking changes in the future will be restricted to the JSON encoder, which is not a core part of the project. It seems worth it and addresses various requests e.g. how to encode a binary array (base64 encoded string or array of ints are both common). It has also been a stable API for ~ 1 year now.

adriangb · 2025-01-24T12:37:34Z

TODO: add a test showing how to change the encoding of binary arrays.

adriangb · 2025-01-24T12:37:55Z

cc @tustvold

adriangb · 2025-01-24T15:15:18Z

TODO: add a test showing how to change the encoding of binary arrays.

Done

adriangb · 2025-01-24T15:15:56Z

arrow-json/src/writer/encoder.rs

+type EncoderFactoryResult<'a> =
+    Result<Option<(Box<dyn Encoder + 'a>, Option<NullBuffer>)>, ArrowError>;


Should this be Option<&'a NullBuffer>?

It probably could be for consistency, but it isn't likely to have any material performance impact

tustvold

I had a review, and think it is probably fine to expose Encoder, it is a relatively straightforward trait that I don't see changing. It is the reader side that I'd be much more hesitant about altering, as that is quite coupled with the parsing machinery.

That being said there is one slight wrinkle, as documented here

The implementation of Encoder for ArrayFormatter assumes it does not produce
characters that would need to be escaped within a JSON string, e.g. '"'.
If support for user-provided format specifications is added, this assumption
may need to be revisited

I think we need to do the following:

Remove the impl of Encoder for ArrayFormatter in favour of a private newtype wrapper
Clearly document on Encoder the expectation that it generates valid escaped JSON

On a related note, the return type of Box<dyn Encoder + 'a>, Option<NullBuffer> is a bit of a mouthful, and honestly makes for a bit of a funky API where the semantics of an Encoder depend on a different data structure returned at construction time.

I wonder if a better API might be something like

pub trait Encoder {
    fn encode(&mut self, idx: usize, out: &mut Vec<u8>);
    fn nulls(&self) -> Option<NullBuffer>;
}

I also left a comment on potentially simplifying the factory down to a single method

tustvold · 2025-01-25T15:28:43Z

arrow-json/src/writer/encoder.rs

+        _array: &'a dyn Array,
+        _data_type: &DataType,


Why do we pass both DataType and Array?

We don't need to, it just simplifies the implementation of the EncoderFactory a bit because almost surely all of them are going to call array.data_type(). I can remove if you'd prefer.

tustvold · 2025-01-25T15:29:05Z

arrow-json/src/writer/mod.rs

@@ -426,10 +438,13 @@ mod tests {

        let actual: Vec<Option<Value>> = input
            .split(|b| *b == b'\n')
-            .map(|s| (!s.is_empty()).then(|| serde_json::from_slice(s).unwrap()))
+            .map(|s| {
+                println!("{:?}", str::from_utf8(s));


tustvold · 2025-01-25T15:29:57Z

arrow-json/src/writer/mod.rs

+                    }
+                }
+                let (_, type_ids, _, buffers) = array
+                    .as_any()


Can we use AsArray instead of manual downcasts, it is much easier to read

tustvold · 2025-01-25T15:45:05Z

arrow-json/src/writer/encoder.rs

+
+    /// Make an encoder that if returned runs after all of the default encoders.
+    /// If this method returns None then an error will be returned.
+    fn make_fallback_encoder<'a>(


Why do we need this? AFAICT we could remove this with no real loss of functionality, if anything it would be more resilient as adding a new default encoder wouldn't break existing overrides.

I'll drop it in favor of a single method, I think it should work for our use case.

tustvold · 2025-01-25T15:47:36Z

arrow-json/src/writer/mod.rs

+        }
+
+        #[derive(Debug)]
+        struct UnionEncoder {


FWIW adding a native UnionEncoder is likely not overly complicated

That would be nice 😄

How about I try to write one on our end and then we can upstream it if it works properly?

tustvold · 2025-01-25T15:49:48Z

arrow-json/src/writer/encoder.rs

+type EncoderFactoryResult<'a> =
+    Result<Option<(Box<dyn Encoder + 'a>, Option<NullBuffer>)>, ArrowError>;


It probably could be for consistency, but it isn't likely to have any material performance impact

adriangb · 2025-01-25T19:49:54Z

Thank you for the review. I've pushed a (broken) version with most comments addressed aside from the part about ArrayFormater. To be honest I don't understand what ArrayFormatter does or why it's special... would you mind elaborating a bit on how you propose to handle it?

tustvold · 2025-01-25T20:26:19Z

ArrayFormatter is a type from arrow_cast used to format array values to strings. It is used by arrow_json to format some values, however, being a general formatter it does not implement JSON escaping. This isn't a problem for the types arrow_json uses it for, but as we're now making Encoder public we need to ensure people don't think ArrayFormatter can be used as an Encoder in the general case

adriangb · 2025-01-26T16:31:05Z

Okay I think I understand now. Looking at the implementation it seems to me that it wouldn't be that much extra work to add formatters for Date, Timestamp, Time and Decimal, then we could ditch ArrayFormatter for JSON encoding altogether. I can do that as a prequel PR. Wdyt?

tustvold · 2025-01-26T16:53:44Z

A private newtype wrapper would avoid a non-trivial amount of code duplication whilst resolving the issue, and would be my preferred approach

adriangb · 2025-01-26T17:42:20Z

Makes sense, done!

There is quite a bit of code churn from changing the Encoder trait. Would you like me to break that out into a precursor PR?

kylebarron · 2025-01-29T18:05:04Z

arrow-json/src/writer/encoder.rs

+    /// This can be used to override how e.g. binary data is encoded so that it is an encoded string or an array of integers.
+    fn make_default_encoder<'a>(
+        &self,
+        _array: &'a dyn Array,


I'd be interested in using the changes in this PR to write GeoJSON from geoarrow-rs.

However this API would not be sufficient for me because it assumes that the physical Array is enough to know how to encode the data. This is not true for geospatial data (at least for Arrow data according to the GeoArrow specification) because the same physical layout can describe multiple types.

E.g. an array of LineString and an array of MultiPoint would both be stored as an Arrow List[FixedSizeList[2, Float64]], but the extension metadata on the Field would be necessary to know whether to write "type": "MultiPoint" or "type": "LineString" in each JSON object.

Given that the existing json Writer API writes a RecordBatch, it should be possible to access the Field and pass that through here, instead of just using the Array?

Great point yes I think that should be possible 😄

@kylebarron will this work for you?

I believe so! As long as we can access the field that should be enough!

adriangb · 2025-02-10T21:58:50Z

Hi @tustvold sorry to ping you. Given the use cases beyond our JSON serialization, would you mind taking another look at this PR when you have some time? Thanks!

tustvold · 2025-02-11T22:49:53Z

I'm a bit swamped at the moment, but I'll try to take a look this weekend.

adriangb · 2025-02-13T05:50:47Z

I've started trying to replace our vendored version with this + a hook to handle the JSON union and have a couple of notes:

In order to handle unions we need make_encoder to be public: the union will need to delegate to make_encoder for it's children.
~~2. Handling dictionary types is annoying, I would consider making DictionaryEncoder public.~~ I was able to work around this by handling the values type instead of the dictionary itself, maybe not necessary. What is a bit weird is that in the dictionary case it ends up recursing (see DictionaryEncoder::try_new) but in the second pass the dictionary field gets passed in alongside the values array so that the type of the field parameter and the array parameter do not match.

tustvold · 2025-03-08T09:47:37Z

arrow-json/src/writer/encoder.rs

+    /// Check if the value at `idx` is null.
+    fn is_null(&self, _idx: usize) -> bool;
+
+    /// Short circuiting null checks if there are none.
+    fn has_nulls(&self) -> bool;


I suspect this change will have major performance implications, as you're now adding a per-value dyn dispatch in order to check nullability

Makes sense. I guess it does need to be dynamic, but I take it you'd rather have a method that returns a NullBuffer the length of the array in one go?

I guess I'm not seeing why it needs to be changed, this wasn't necessary before...

Hard to remember now but I think the primary motivation was to encapsulate the (Encoder, NullBuffer) into the Encoder to make the whole API easier to work with.

I've now collapsed these into a single method that returns an Option<NullBuffer> which can be hoisted out of loops. I think that should be equivalent to the old implementation but with a better API for implementers.

arrow-json/src/lib.rs

tustvold · 2025-03-08T09:51:15Z

arrow-json/src/writer/encoder.rs


+/// Configuration options for the JSON encoder.
 #[derive(Debug, Clone, Default)]


As this is now public, I think we either need to mark it #[non_exhaustive] or make it into a builder with private fields.

tustvold

Sorry for the delay, I have been (and still am) really swamped. I think this is getting closer to something we can merge, although the changes to null handling concern me, and I'd prefer to revert these if possible.

adriangb · 2025-03-08T17:35:32Z

Thanks for reviewing! I'll try to work out how to do the null handling differently.

… for unsupported types

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

tustvold · 2025-03-09T08:22:32Z

arrow-json/src/writer/encoder.rs

-            if drop_nulls && is_null {
+
+        // Collect all the field nulls buffers up front to avoid dynamic dispatch in the loop
+        // This creates a temporary Vec, but avoids repeated virtual calls which should be a net win


We're still doing this dynamic dispatch to get null information on every value write... Unless we can show with benchmarks it doesn't regress performance I'm lukewarm on this

tustvold

I think the changes make the null handling worse 😅

The API was designed the way it was for a reason, I think if we want to change it we need to justify doing so empirically

tustvold · 2025-03-09T08:24:20Z

arrow-json/src/writer/encoder.rs

@@ -290,15 +344,41 @@ impl PrimitiveEncode for f16 {
    }
 }

+/// Extension trait providing null-related methods to `Option<NullBuffer>`
+pub trait NullBufferExt {


Suggested change

pub trait NullBufferExt {

trait NullBufferExt {

github-actions bot added the arrow Changes to the arrow crate label Jan 24, 2025

adriangb commented Jan 24, 2025

View reviewed changes

tustvold reviewed Jan 25, 2025

View reviewed changes

adriangb force-pushed the arrow-union branch from a21bbc7 to 2f50dc7 Compare January 29, 2025 03:26

kylebarron mentioned this pull request Jan 29, 2025

Write GeoJSON using arrow-json geoarrow/geoarrow-rs#982

Open

kylebarron reviewed Jan 29, 2025

View reviewed changes

adriangb mentioned this pull request Jan 30, 2025

Expose arrow_json::writer::encoder::make_encoder_impl() for re-use in custom encoders #5733

Open

adriangb force-pushed the arrow-union branch from ac80665 to 0f032fe Compare January 31, 2025 21:18

adriangb force-pushed the arrow-union branch from 0f032fe to 19ed4ab Compare February 12, 2025 14:03

adriangb force-pushed the arrow-union branch from 19ed4ab to 6276df6 Compare February 13, 2025 06:41

tustvold reviewed Mar 8, 2025

View reviewed changes

arrow-json/src/lib.rs Outdated Show resolved Hide resolved

tustvold reviewed Mar 8, 2025

View reviewed changes

adriangb added 4 commits March 8, 2025 11:48

Add hooks to json encoder to override default encoding or add support…

fdb97e7

… for unsupported types

fix

ade3713

Pass in Field to DefaultEncoderFactory

fbaee24

fmt

d8a5610

adriangb and others added 8 commits March 8, 2025 11:48

elid lifetime

5d25e98

make make_encoder public, make EncoderOptions public, add lifetimes

ad832a5

whitespace

ec2ec0d

fmt

73ef17f

add test for dictionary arrays of types that have custom encoding

9c0eb69

Update lib.rs

1202314

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

wip nulls rework

a3d67c4

nulls rework

0d3812a

adriangb force-pushed the arrow-union branch from ba5609b to 0d3812a Compare March 8, 2025 18:44

Update mod.rs

e7a8d44

tustvold reviewed Mar 9, 2025

View reviewed changes

		type EncoderFactoryResult<'a> =
		Result<Option<(Box<dyn Encoder + 'a>, Option<NullBuffer>)>, ArrowError>;


		/// Configuration options for the JSON encoder.
		#[derive(Debug, Clone, Default)]

Add hooks to json encoder to override default encoding or add support for unsupported types #7015

Are you sure you want to change the base?

Add hooks to json encoder to override default encoding or add support for unsupported types #7015

Conversation

adriangb commented Jan 24, 2025

adriangb commented Jan 24, 2025

adriangb commented Jan 24, 2025

adriangb commented Jan 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriangb commented Jan 25, 2025

tustvold commented Jan 25, 2025

adriangb commented Jan 26, 2025

tustvold commented Jan 26, 2025

adriangb commented Jan 26, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriangb commented Feb 10, 2025

tustvold commented Feb 11, 2025

adriangb commented Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold left a comment

Choose a reason for hiding this comment

adriangb commented Mar 8, 2025

Choose a reason for hiding this comment

tustvold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriangb commented Feb 13, 2025 •

edited

Loading