Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Spark date_trunc function #11340

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

zml1206
Copy link
Contributor

@zml1206 zml1206 commented Oct 24, 2024

Adds Spark date_trunc function. Presto date_trunc function was refactored for reuse:

  1. Moves class DateTimeUnit to functions/lib/DateTimeFormatter.h.
  2. Extends function fromDateTimeUnitString by adding params allowMicro and
    allowAbbreviated for Spark, and moves to functions/lib/TimeUtils.h.
  3. Moves adjustDateTime and adjustEpoch functions to functions/lib/TimeUtils.h.
  4. Extracts new function truncateTimestamp to functions/lib/TimeUtils.h .

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 24, 2024
Copy link

netlify bot commented Oct 24, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit fe5df23
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67cebc88602f700008b54d88

@rui-mo rui-mo changed the title add date_trunc spark function Add Spark date_trunc function Oct 28, 2024
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The key idea is that by moving some of the methods to the lib for reuse, we can reduce the code duplication.

@zml1206 zml1206 requested a review from rui-mo October 29, 2024 06:17
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for refactoring.

@@ -123,4 +126,155 @@ struct InitSessionTimezone {
timeZone_ = getTimeZoneFromConfig(config);
}
};

FOLLY_ALWAYS_INLINE std::optional<DateTimeUnit> fromDateTimeUnitString(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please add some comments for the new API? If there are unit tests for them, please move them into the lib; if not, we will need to add them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe mention it in the currently empty PR description/commit message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Contributor

@jinchengchenghh jinchengchenghh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Some minors. Could you also clarify what you have changed in the PR description?


.. spark:function:: date_trunc(fmt, ts) -> timestamp

Returns timestamp ts truncated to the unit specified by the format model fmt.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns timestamp ts -> Returns timestamp that ts, by the format model fmt

Returns null if ``fmt`` is invalid.
``fmt`` is case insensitive and must be one of the following:
* "YEAR", "YYYY", "YY" - truncate to the first date of the year that the ts falls in, the time part will be zero out
* "QUARTER" - truncate to the first date of the quarter that the ts falls in, the time part will be zero out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that the ts falls, please quote all the ts, thanks!

bool allowAbbreviated = false) {
static const StringView kMicrosecond("microsecond");
static const StringView kMillisecond("millisecond");
static const StringView kSecond("second");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this fields are used only once, do we need to set them? Or we could use the string directly in if condition? CC @rui-mo
For example, https://github.com/facebookincubator/velox/blob/main/velox/common/base/StatsReporter.h#L70

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the original presto function code. Now it is just extracted to lib for reuse. Do we need to change it? @jinchengchenghh @rui-mo

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove those defined in DateTimeFormatter then? And +1 for using string directly because these variables are used only once.

}

// Calculate the correct day of the month based on the number of days
// in the adjusted month
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add . in the end.

31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31};
int daysInPrevMonth = daysInMonth[dateTime.tm_mon];

// Adjust for leap year if February
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

(dateTime.tm_year + 1900) % 400 == 0)) {
daysInPrevMonth = 29;
}
// Set to the correct day in the previous month
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


// Returns timestamp truncated to the unit specified by the format.
FOLLY_ALWAYS_INLINE Timestamp dateTrunc(
const DateTimeUnit unit,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't add the const to enum class, you could consider it as integer like.
https://stackoverflow.com/questions/44749658/c-is-it-better-to-pass-an-enum-class-as-value-or-const-reference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinchengchenghh Thanks for review. It seems to be used this way in velox,

inline bool isTimeUnit(const DateTimeUnit unit) {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add const when passing by value.

struct DateTruncFunction {
VELOX_DEFINE_FUNCTION_TYPES(T);

const tz::TimeZone* timeZone_ = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the class member to the last of the struct as FromUnixtimeFunction.
And move it to private scope

if (!unitOption.has_value()) {
return false;
}
DateTimeUnit unit = unitOption.value();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const DateTimeUnit unit. I would prefer to drop this initialization since it is used only once.

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

bool allowAbbreviated = false) {
static const StringView kMicrosecond("microsecond");
static const StringView kMillisecond("millisecond");
static const StringView kSecond("second");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove those defined in DateTimeFormatter then? And +1 for using string directly because these variables are used only once.


// Returns timestamp truncated to the unit specified by the format.
FOLLY_ALWAYS_INLINE Timestamp dateTrunc(
const DateTimeUnit unit,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add const when passing by value.

@@ -518,6 +518,8 @@ void registerFunctions(const std::string& prefix) {
Varchar,
Varchar,
Varchar>({prefix + "mask"});
registerFunction<DateTruncFunction, Timestamp, Varchar, Timestamp>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved, thanks.

* "MICROSECOND" - everything remains

Returns timestamp ``ts`` truncated to the unit specified by the format model ``fmt``.
Returns null if ``fmt`` is invalid. ::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if ts is null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* "MILLISECOND" - zero out the microseconds
* "MICROSECOND" - everything remains

Returns timestamp ``ts`` truncated to the unit specified by the format model ``fmt``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the function description before argument description L56.

ASSERT_EQ(std::nullopt, fromDateTimeUnitString("yyyy", false));
ASSERT_EQ(std::nullopt, fromDateTimeUnitString("yy", false));

ASSERT_EQ(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is valid too.

 ASSERT_EQ(
      DateTimeUnit::kMillisecond, fromDateTimeUnitString("millisecond", false));

And other public functions also need tests. I'm open for this point since the function test should have cover the function you moved, but it is better to add it.

ASSERT_EQ(
std::optional(DateTimeUnit::kYear),
fromDateTimeUnitString("yyyy", false, true, true));
ASSERT_EQ(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test invalid fmt by VELOX_ASSERT_THROW

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Added some comments.

* "MINUTE"- zero out the second with fraction part
* "SECOND" - zero out the second fraction part
* "MILLISECOND" - zero out the microseconds
* "MICROSECOND" - everything remains. ::
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example section does not render correctly. You could add the '::' as below to fix it.

       * "MICROSECOND" - everything remains.
        
    ::

        SELECT date_trunc('YEAR', '2015-03-05T09:32:05.359'); -- 2015-01-01 00:00:00

SELECT date_trunc('HOUR', '2015-03-05T09:32:05.359'); -- 2015-03-05 09:00:00
SELECT date_trunc('MINUTE', '2015-03-05T09:32:05.359'); -- 2015-03-05 09:32:00
SELECT date_trunc('SECOND', '2015-03-05T09:32:05.359'); -- 2015-03-05 09:32:05
SELECT date_trunc('MILLISECOND', '2015-03-05T09:32:05.123456'); -- 2015-03-05 09:32:05.123456
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The microseconds are not zeroed out. Is this example correct?

SELECT date_trunc('MINUTE', '2015-03-05T09:32:05.359'); -- 2015-03-05 09:32:00
SELECT date_trunc('SECOND', '2015-03-05T09:32:05.359'); -- 2015-03-05 09:32:05
SELECT date_trunc('MILLISECOND', '2015-03-05T09:32:05.123456'); -- 2015-03-05 09:32:05.123456
SELECT date_trunc('MILLISECOND', '2015-03-05T09:32:05..123456'); -- 2015-01-01 09:32:05.123456
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate for millisecond.

FOLLY_ALWAYS_INLINE std::optional<DateTimeUnit> fromDateTimeUnitString(
const StringView& unitString,
bool throwIfInvalid,
bool allowMirco = false,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: Mirco -> Micro

if (seconds < 0 && seconds % intervalSeconds) {
s = s - 1;
}
int64_t truncedSeconds = s * intervalSeconds;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: trunced -> truncated


/// For fixed interval like minute, hour and day,
/// we can truncate date by a simple arithmetic expression:
/// floor(seconds / intervalSeconds) * intervalSeconds.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid repeating the code in comments. Comment suggestion:

/// Returns timestamp with seconds adjusted to the nearest lower multiple of the
/// specified interval. If the given seconds is negative and not an exact
/// multiple of the interval, it adjusts further down.

FOLLY_ALWAYS_INLINE void initialize(
const std::vector<TypePtr>& /*inputTypes*/,
const core::QueryConfig& config,
const arg_type<Varchar>* format,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: format is not used either.

@zml1206 zml1206 requested a review from rui-mo November 15, 2024 01:40
Copy link

stale bot commented Feb 14, 2025

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

@stale stale bot added the stale label Feb 14, 2025
@rui-mo
Copy link
Collaborator

rui-mo commented Feb 14, 2025

@zml1206 Would you like to rebase this PR? Thanks.

@stale stale bot removed the stale label Feb 14, 2025
@zml1206 zml1206 changed the title Add Spark date_trunc function feat: Add Spark date_trunc function Feb 15, 2025
@zml1206
Copy link
Contributor Author

zml1206 commented Feb 17, 2025

@zml1206 Would you like to rebase this PR? Thanks.

Rebased, CI failures seems unrelated. @rui-mo

fix style

fix

fix

fix

fix

fix

fix

update for reuse

fix style

fix

fix

fix

fix

fix style

update

update

fix

fix

update

update

fix

fix

update

update

update

update

add ut

add ut
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks good overall.

Extend funciton fromDateTimeUnitString

Would be elaborate how the function is extended in the PR description? Thanks.

@zml1206
Copy link
Contributor Author

zml1206 commented Feb 27, 2025

@rui-mo all comments resolved, can you help take a look again? thanks.

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Looks nice! Just added some nits.

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Just several nits.

.. spark:function:: date_trunc(fmt, ts) -> timestamp

Returns timestamp ``ts`` truncated to the unit specified by the format model ``fmt``.
Returns null if ``fmt`` is invalid, microseconds and abbreviated unit string are allowed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment suggestion:

Returns null if ``fmt`` is invalid. ``fmt`` as "MICROSECOND" and abbreviated unit string are allowed.

.. spark:function:: date_trunc(fmt, ts) -> timestamp

Returns timestamp ``ts`` truncated to the unit specified by the format model ``fmt``.
Returns null if ``fmt`` is invalid, microseconds and abbreviated unit string are allowed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: null -> NULL

SELECT date_trunc('MINUTE', '2015-03-05T09:32:05.359'); -- 2015-03-05 09:32:00
SELECT date_trunc('SECOND', '2015-03-05T09:32:05.359'); -- 2015-03-05 09:32:05
SELECT date_trunc('MILLISECOND', '2015-03-05T09:32:05.123456'); -- 2015-03-05 09:32:05.123
SELECT date_trunc('MICROSECOND', '2015-03-05T09:32:05.123456'); -- 2015-03-05 09:32:05.123456
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you provide an example of invalid fmt string?

}

Timestamp truncateTimestamp(
const Timestamp& timestamp,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timestamp is fixed-width. Better to pass by value.

@@ -25,4 +25,179 @@ const folly::F14FastMap<std::string, int8_t> kDayOfWeekNames{
{"tue", 5}, {"wed", 6}, {"thursday", 0}, {"friday", 1},
{"saturday", 2}, {"sunday", 3}, {"monday", 4}, {"tuesday", 5},
{"wednesday", 6}};

std::optional<DateTimeUnit> fromDateTimeUnitString(
const StringView& unitString,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Can pass by value.

@@ -515,6 +515,40 @@ struct DateFromUnixDateFunction {
}
};

/// Truncates a timestamp to a specified time unit. Return null if the
/// format is invalid, microseconds and abbreviated unit string are allowed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return NULL if the format is invalid. Format as abbreviated unit string and "microseconds" are allowed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants