-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(function): Support Spark legacy behavior for central moments functions and change the input type #12566
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@rui-mo Could you help to take a look? Thanks. |
return std::make_unique<CentralMomentsAggregatesBase< | ||
int64_t, | ||
SkewnessResultAccessor<true>>>(resultType); | ||
case TypeKind::DOUBLE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only need to register this function with double type raw input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean this function should only support double type input? If so, I agree with you, as Spark only supports double type input, see https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala#L58C16-L58C24
return std::make_unique<CentralMomentsAggregatesBase< | ||
int64_t, | ||
KurtosisResultAccessor<false>>>(resultType); | ||
case TypeKind::DOUBLE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for this function.
const std::vector<TypePtr>& argTypes, | ||
const TypePtr& resultType, | ||
const core::QueryConfig& config) -> std::unique_ptr<exec::Aggregate> { | ||
VELOX_CHECK_LE( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function should require exactly one argument, rather than allowing at most one.
velox/core/QueryConfig.h
Outdated
@@ -333,6 +333,12 @@ class QueryConfig { | |||
static constexpr const char* kSparkLegacyDateFormatter = | |||
"spark.legacy_date_formatter"; | |||
|
|||
/// if true, statistical aggregation function includes skewness, kurtosis, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If true, the first letter should be uppercase, so as the document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
velox/docs/configs.rst
Outdated
* - spark.legacy_statistical_aggregate | ||
- bool | ||
- false | ||
- if true, statistical aggregation function includes skewness, kurtosis will return std::numeric_limits<double>::quiet_NaN() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please highlight that this config is partially honored, there is still some functions should honor this config but not such as stddev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps rename it to statistical_agg_null_on_divide_by_zero and update all related functions in this PR as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhli1142015 @jinchengchenghh
Other functions are not supported in Velox Spark functions yet and Gluten transforms them to Velox Presto functions now. Therefore, I think there is no need to add the doc. BTW, I will implement the Spark version of these functions in further PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for improving the documentation. We'd better make it clear which functions will depend on this config when adding it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd better make it clear which functions will depend on this config when adding it.
@rui-mo You are right. Updated. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps rename it to statistical_agg_null_on_divide_by_zero
@zhli1142015 I thought it would be clear to align with Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revise the PR description, as I notice this PR removes several types' implementations, and add template parameter 'nullOnDivideByZero' to the SkewnessResultAccessor and 'KurtosisResultAccessor', which controls whether NULL or NaN is returned when dividing by zero.
} | ||
|
||
static double result(const CentralMomentsAccumulator& accumulator) { | ||
if (accumulator.m2() == 0) { | ||
return std::numeric_limits<double>::quiet_NaN(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps check nullOnDivideByZero
is false.
velox/core/QueryConfig.h
Outdated
@@ -333,6 +333,12 @@ class QueryConfig { | |||
static constexpr const char* kSparkLegacyDateFormatter = | |||
"spark.legacy_date_formatter"; | |||
|
|||
/// If true, statistical aggregation function includes skewness, kurtosis, | |||
/// will return std::numeric_limits<double>::quiet_NaN() instead of NULL when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::numeric_limits::quiet_NaN() -> NaN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
velox/core/QueryConfig.h
Outdated
@@ -333,6 +333,12 @@ class QueryConfig { | |||
static constexpr const char* kSparkLegacyDateFormatter = | |||
"spark.legacy_date_formatter"; | |||
|
|||
/// If true, statistical aggregation function includes skewness, kurtosis, | |||
/// will return std::numeric_limits<double>::quiet_NaN() instead of NULL when | |||
/// DivideByZero occurs during expression evaluation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DivideByZero occurs during expression evaluation
dividing by zero during the aggregate result calculation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
* - spark.legacy_statistical_aggregate | ||
- bool | ||
- false | ||
- If true, statistical aggregation function includes skewness, kurtosis will return std::numeric_limits<double>::quiet_NaN() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
#include "velox/functions/lib/aggregates/CentralMomentsAggregatesBase.h" | ||
|
||
namespace facebook::velox::functions::aggregate::sparksql { | ||
|
||
namespace { | ||
template <bool nullOnDivideByZero> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps document this variable.
inputType->toString()); | ||
if (config.sparkLegacyStatisticalAggregate()) { | ||
if (exec::isRawInput(step)) { | ||
switch (inputType->kind()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might don't need a switch clause if only one valid case.
@rui-mo I've updated the PR description and resolved all the comments. Can you please take another look? |
In
skewness
andkurtosis
functions, the result should beDouble.NaN
instead of
NULL
ifspark.legacy_statistical_aggregate
is set totrue
.Furthermore, these functions support several input types
including "smallint", "integer", "bigint", "real", "double",
but Spark only supports double type input, see code link.
This PR includes these changes:
and 'KurtosisResultAccessor', which controls whether
NULL
orNaN
is returnedwhen dividing by zero.
skewness
andkurtosis
functions to support only double type input.Part of: #12542