fix(function): Support Spark legacy behavior for central moments functions and change the input type #12566

NEUpanning · 2025-03-06T09:13:00Z

In skewness and kurtosis functions, the result should be Double.NaN
instead of NULL if spark.legacy_statistical_aggregate is set to true.
Furthermore, these functions support several input types
including "smallint", "integer", "bigint", "real", "double",
but Spark only supports double type input, see code link.

This PR includes these changes:

Add template parameter 'nullOnDivideByZero' to the 'SkewnessResultAccessor'
and 'KurtosisResultAccessor', which controls whether NULL or NaN is returned
when dividing by zero.
Change skewness and kurtosis functions to support only double type input.

Part of: #12542

netlify · 2025-03-06T09:13:23Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`1a8347e`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/67ce5a55491ef10008c6654d

NEUpanning · 2025-03-06T09:21:28Z

@rui-mo Could you help to take a look? Thanks.

zhli1142015 · 2025-03-06T10:55:18Z

velox/functions/sparksql/aggregates/CentralMomentsAggregate.cpp

+                return std::make_unique<CentralMomentsAggregatesBase<
+                    int64_t,
+                    SkewnessResultAccessor<true>>>(resultType);
+              case TypeKind::DOUBLE:


I think we only need to register this function with double type raw input.

Do you mean this function should only support double type input? If so, I agree with you, as Spark only supports double type input, see https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala#L58C16-L58C24

zhli1142015 · 2025-03-06T10:56:06Z

velox/functions/sparksql/aggregates/CentralMomentsAggregate.cpp

+                return std::make_unique<CentralMomentsAggregatesBase<
+                    int64_t,
+                    KurtosisResultAccessor<false>>>(resultType);
+              case TypeKind::DOUBLE:


same for this function.

zhli1142015 · 2025-03-06T11:00:31Z

velox/functions/sparksql/aggregates/CentralMomentsAggregate.cpp

+          const std::vector<TypePtr>& argTypes,
+          const TypePtr& resultType,
+          const core::QueryConfig& config) -> std::unique_ptr<exec::Aggregate> {
+        VELOX_CHECK_LE(


I think this function should require exactly one argument, rather than allowing at most one.

jinchengchenghh · 2025-03-06T21:12:38Z

velox/core/QueryConfig.h

@@ -333,6 +333,12 @@ class QueryConfig {
  static constexpr const char* kSparkLegacyDateFormatter =
      "spark.legacy_date_formatter";

+  /// if true, statistical aggregation function includes skewness, kurtosis,


If true, the first letter should be uppercase, so as the document

jinchengchenghh · 2025-03-06T21:19:11Z

velox/docs/configs.rst

+   * - spark.legacy_statistical_aggregate
+     - bool
+     - false
+     - if true, statistical aggregation function includes skewness, kurtosis will return std::numeric_limits<double>::quiet_NaN()


Please highlight that this config is partially honored, there is still some functions should honor this config but not such as stddev

Perhaps rename it to statistical_agg_null_on_divide_by_zero and update all related functions in this PR as well.

@zhli1142015 @jinchengchenghh
Other functions are not supported in Velox Spark functions yet and Gluten transforms them to Velox Presto functions now. Therefore, I think there is no need to add the doc. BTW, I will implement the Spark version of these functions in further PRs.

+1 for improving the documentation. We'd better make it clear which functions will depend on this config when adding it.

We'd better make it clear which functions will depend on this config when adding it.

@rui-mo You are right. Updated. Thanks!

Perhaps rename it to statistical_agg_null_on_divide_by_zero

@zhli1142015 I thought it would be clear to align with Spark.

rui-mo

Please revise the PR description, as I notice this PR removes several types' implementations, and add template parameter 'nullOnDivideByZero' to the SkewnessResultAccessor and 'KurtosisResultAccessor', which controls whether NULL or NaN is returned when dividing by zero.

rui-mo · 2025-03-07T11:42:09Z

velox/functions/sparksql/aggregates/CentralMomentsAggregate.cpp

  }

  static double result(const CentralMomentsAccumulator& accumulator) {
+    if (accumulator.m2() == 0) {
+      return std::numeric_limits<double>::quiet_NaN();


Perhaps check nullOnDivideByZero is false.

rui-mo · 2025-03-07T11:47:11Z

velox/core/QueryConfig.h

@@ -333,6 +333,12 @@ class QueryConfig {
  static constexpr const char* kSparkLegacyDateFormatter =
      "spark.legacy_date_formatter";

+  /// If true, statistical aggregation function includes skewness, kurtosis,
+  /// will return std::numeric_limits<double>::quiet_NaN() instead of NULL when


std::numeric_limits::quiet_NaN() -> NaN

rui-mo · 2025-03-07T11:50:18Z

velox/core/QueryConfig.h

@@ -333,6 +333,12 @@ class QueryConfig {
  static constexpr const char* kSparkLegacyDateFormatter =
      "spark.legacy_date_formatter";

+  /// If true, statistical aggregation function includes skewness, kurtosis,
+  /// will return std::numeric_limits<double>::quiet_NaN() instead of NULL when
+  /// DivideByZero occurs during expression evaluation.


DivideByZero occurs during expression evaluation

dividing by zero during the aggregate result calculation

rui-mo · 2025-03-07T11:51:48Z

velox/docs/configs.rst

+   * - spark.legacy_statistical_aggregate
+     - bool
+     - false
+     - If true, statistical aggregation function includes skewness, kurtosis will return std::numeric_limits<double>::quiet_NaN()


velox/docs/configs.rst

rui-mo · 2025-03-07T13:26:41Z

velox/functions/sparksql/aggregates/CentralMomentsAggregate.cpp

 #include "velox/functions/lib/aggregates/CentralMomentsAggregatesBase.h"

 namespace facebook::velox::functions::aggregate::sparksql {

 namespace {
+template <bool nullOnDivideByZero>


Perhaps document this variable.

rui-mo · 2025-03-07T13:28:54Z

velox/functions/sparksql/aggregates/CentralMomentsAggregate.cpp

-                  inputType->toString());
+        if (config.sparkLegacyStatisticalAggregate()) {
+          if (exec::isRawInput(step)) {
+            switch (inputType->kind()) {


We might don't need a switch clause if only one valid case.

NEUpanning · 2025-03-10T03:32:27Z

@rui-mo I've updated the PR description and resolved all the comments. Can you please take another look?

initial

b457e0f

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 6, 2025

NEUpanning and others added 2 commits March 6, 2025 17:14

Merge branch 'main' into agg_nan

9c28b3e

reformat

c466e7a

zhli1142015 reviewed Mar 6, 2025

View reviewed changes

jinchengchenghh reviewed Mar 6, 2025

View reviewed changes

cr

1854ba1

NEUpanning requested review from zhli1142015 and jinchengchenghh March 7, 2025 05:00

rui-mo reviewed Mar 7, 2025

View reviewed changes

cr

404f31c

rui-mo reviewed Mar 7, 2025

View reviewed changes

velox/docs/configs.rst Show resolved Hide resolved

rui-mo reviewed Mar 7, 2025

View reviewed changes

cr

1a8347e

NEUpanning changed the title ~~fix(function): Support Spark legacy behavior for central moments functions when 'divide by zero' occurs during expression evaluation~~ fix(function): Support Spark legacy behavior for central moments functions and change the input type Mar 11, 2025

NEUpanning requested a review from rui-mo March 11, 2025 03:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(function): Support Spark legacy behavior for central moments functions and change the input type #12566

fix(function): Support Spark legacy behavior for central moments functions and change the input type #12566

NEUpanning commented Mar 6, 2025 •

edited

Loading

netlify bot commented Mar 6, 2025 •

edited

Loading

NEUpanning commented Mar 6, 2025

zhli1142015 Mar 6, 2025 •

edited

Loading

NEUpanning Mar 6, 2025

zhli1142015 Mar 6, 2025

zhli1142015 Mar 6, 2025

jinchengchenghh Mar 6, 2025

NEUpanning Mar 7, 2025

jinchengchenghh Mar 6, 2025

zhli1142015 Mar 6, 2025

NEUpanning Mar 7, 2025

rui-mo Mar 7, 2025 •

edited

Loading

NEUpanning Mar 7, 2025

NEUpanning Mar 7, 2025

rui-mo left a comment •

edited

Loading

rui-mo Mar 7, 2025

rui-mo Mar 7, 2025

NEUpanning Mar 10, 2025

rui-mo Mar 7, 2025

NEUpanning Mar 10, 2025

rui-mo Mar 7, 2025

rui-mo Mar 7, 2025

rui-mo Mar 7, 2025

NEUpanning commented Mar 10, 2025

fix(function): Support Spark legacy behavior for central moments functions and change the input type #12566

Are you sure you want to change the base?

fix(function): Support Spark legacy behavior for central moments functions and change the input type #12566

Conversation

NEUpanning commented Mar 6, 2025 • edited Loading

netlify bot commented Mar 6, 2025 • edited Loading

✅ Deploy Preview for meta-velox canceled.

NEUpanning commented Mar 6, 2025

zhli1142015 Mar 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rui-mo Mar 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rui-mo left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NEUpanning commented Mar 10, 2025

NEUpanning commented Mar 6, 2025 •

edited

Loading

netlify bot commented Mar 6, 2025 •

edited

Loading

zhli1142015 Mar 6, 2025 •

edited

Loading

rui-mo Mar 7, 2025 •

edited

Loading

rui-mo left a comment •

edited

Loading