Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(function): Support Spark legacy behavior for central moments functions and change the input type #12566

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

NEUpanning
Copy link
Contributor

@NEUpanning NEUpanning commented Mar 6, 2025

In skewness and kurtosis functions, the result should be Double.NaN
instead of NULL if spark.legacy_statistical_aggregate is set to true.
Furthermore, these functions support several input types
including "smallint", "integer", "bigint", "real", "double",
but Spark only supports double type input, see code link.

This PR includes these changes:

  1. Add template parameter 'nullOnDivideByZero' to the 'SkewnessResultAccessor'
    and 'KurtosisResultAccessor', which controls whether NULL or NaN is returned
    when dividing by zero.
  2. Change skewness and kurtosis functions to support only double type input.

Part of: #12542

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 6, 2025
Copy link

netlify bot commented Mar 6, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 1a8347e
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67ce5a55491ef10008c6654d

@NEUpanning
Copy link
Contributor Author

@rui-mo Could you help to take a look? Thanks.

return std::make_unique<CentralMomentsAggregatesBase<
int64_t,
SkewnessResultAccessor<true>>>(resultType);
case TypeKind::DOUBLE:
Copy link
Contributor

@zhli1142015 zhli1142015 Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only need to register this function with double type raw input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean this function should only support double type input? If so, I agree with you, as Spark only supports double type input, see https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala#L58C16-L58C24

return std::make_unique<CentralMomentsAggregatesBase<
int64_t,
KurtosisResultAccessor<false>>>(resultType);
case TypeKind::DOUBLE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for this function.

const std::vector<TypePtr>& argTypes,
const TypePtr& resultType,
const core::QueryConfig& config) -> std::unique_ptr<exec::Aggregate> {
VELOX_CHECK_LE(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function should require exactly one argument, rather than allowing at most one.

@@ -333,6 +333,12 @@ class QueryConfig {
static constexpr const char* kSparkLegacyDateFormatter =
"spark.legacy_date_formatter";

/// if true, statistical aggregation function includes skewness, kurtosis,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If true, the first letter should be uppercase, so as the document

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

* - spark.legacy_statistical_aggregate
- bool
- false
- if true, statistical aggregation function includes skewness, kurtosis will return std::numeric_limits<double>::quiet_NaN()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please highlight that this config is partially honored, there is still some functions should honor this config but not such as stddev

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps rename it to statistical_agg_null_on_divide_by_zero and update all related functions in this PR as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhli1142015 @jinchengchenghh
Other functions are not supported in Velox Spark functions yet and Gluten transforms them to Velox Presto functions now. Therefore, I think there is no need to add the doc. BTW, I will implement the Spark version of these functions in further PRs.

Copy link
Collaborator

@rui-mo rui-mo Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for improving the documentation. We'd better make it clear which functions will depend on this config when adding it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd better make it clear which functions will depend on this config when adding it.

@rui-mo You are right. Updated. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps rename it to statistical_agg_null_on_divide_by_zero

@zhli1142015 I thought it would be clear to align with Spark.

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revise the PR description, as I notice this PR removes several types' implementations, and add template parameter 'nullOnDivideByZero' to the SkewnessResultAccessor and 'KurtosisResultAccessor', which controls whether NULL or NaN is returned when dividing by zero.

}

static double result(const CentralMomentsAccumulator& accumulator) {
if (accumulator.m2() == 0) {
return std::numeric_limits<double>::quiet_NaN();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps check nullOnDivideByZero is false.

@@ -333,6 +333,12 @@ class QueryConfig {
static constexpr const char* kSparkLegacyDateFormatter =
"spark.legacy_date_formatter";

/// If true, statistical aggregation function includes skewness, kurtosis,
/// will return std::numeric_limits<double>::quiet_NaN() instead of NULL when
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::numeric_limits::quiet_NaN() -> NaN

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@@ -333,6 +333,12 @@ class QueryConfig {
static constexpr const char* kSparkLegacyDateFormatter =
"spark.legacy_date_formatter";

/// If true, statistical aggregation function includes skewness, kurtosis,
/// will return std::numeric_limits<double>::quiet_NaN() instead of NULL when
/// DivideByZero occurs during expression evaluation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DivideByZero occurs during expression evaluation

dividing by zero during the aggregate result calculation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

* - spark.legacy_statistical_aggregate
- bool
- false
- If true, statistical aggregation function includes skewness, kurtosis will return std::numeric_limits<double>::quiet_NaN()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

#include "velox/functions/lib/aggregates/CentralMomentsAggregatesBase.h"

namespace facebook::velox::functions::aggregate::sparksql {

namespace {
template <bool nullOnDivideByZero>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps document this variable.

inputType->toString());
if (config.sparkLegacyStatisticalAggregate()) {
if (exec::isRawInput(step)) {
switch (inputType->kind()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might don't need a switch clause if only one valid case.

@NEUpanning
Copy link
Contributor Author

@rui-mo I've updated the PR description and resolved all the comments. Can you please take another look?

@NEUpanning NEUpanning changed the title fix(function): Support Spark legacy behavior for central moments functions when 'divide by zero' occurs during expression evaluation fix(function): Support Spark legacy behavior for central moments functions and change the input type Mar 11, 2025
@NEUpanning NEUpanning requested a review from rui-mo March 11, 2025 03:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants