IWF-208: Optimize executingStateIdMode for state transition #460

lwolczynski · 2024-10-21T20:59:37Z

Description

Checklist

Code compiles correctly
Tests for the changes have been added
All tests passing
This PR change is backwards-compatible
This PR CONTAINS a (planned) breaking change (it is not backwards compatible)

Related Issue

Closes #454

mixydubs · 2024-10-22T13:21:48Z

service/interpreter/stateExecutionCounter.go

+		shouldSkipUpsertingSAs := true
+		for _, s := range stateTransition.next {
+			if !s.StateOptions.GetSkipWaitUntil() {
+				shouldSkipUpsertingSAs = false


I think perhaps I'm missing some context or understanding of what we want to achieve here - what happens when the first StateMovement in the next array skips waitUntil, but subsequent ones do not? If I'm reading this correctly, with this change if any next StateMovement does not have a waitUntil, the SearchAttributeExecutingStateIds will not be updated...

hmm... I have that wrong don't I... if any of the next states do have a waitUntil (i.e. they don't skip wait until) then do continue on to UpsertSearchAttributes??

Right, if any of the next states don't skip waitUntil, then the SAs will be upserted. Also, if I'm not mistaken, nextStates are the states that are executed in the state transition, example:
current = A
next = [B, C, D]
means that B, C, and D are executed at the same time after A is completed.

longquanzheng · 2024-10-22T21:03:59Z

service/interpreter/stateExecutionCounter.go

 		}
 	}

 	return e.updateStateIdSearchAttribute()
 }

+func determineIfShouldSkipUpsert(currentState iwfidl.StateMovement, nextStates []iwfidl.StateMovement) bool {


I thought the signature would have pass ExecutingStateIdMode.
I think the currentState can be omitted.

func determineIfShouldSkipUpsert(mode ExecutingStateIdMode, nextStates ) bool{ // based on the mode // For ENABLED_FOR_ALL, return true as long as any of next states is going to a next state(not closing state) // For ENABLED_FOR_STATES_WITH_WAIT_UNTIL, in addition to above, also check the state has waitUntil. }

StateMovement may not go to any state, because we are doing a hack in iwf (I really regret that) 😨
We use some special StateIDs to prepresent gracefulComplete, deadEnd etc. See https://github.com/indeedeng/iwf/blob/main/service/const.go#L68
And how it's being used in https://github.com/indeedeng/iwf/blob/main/service/interpreter/workflowImpl.go#L373

I thought the signature would have pass ExecutingStateIdMode.

I'm not passing it as I only invoke determineIfShouldSkipUpsert when needed based on ExecutingStateIdMode

currentState is needed if we want to determine when a state is looping back, I believe

currentState is needed if we want to determine when a state is looping back, I believe

for that case, I belive what you have in updateStateIdSearchAttribute baesd on comparing with GetSearchAttributes will cover it?

Basically, if MarkStateExecutionCompleted not delete the SA, next time when MarkStateExecutingIfNotYet will callupdateStateIdSearchAttribute which will then skip it because the SA is still there

longquanzheng · 2024-10-23T06:13:24Z

service/interpreter/stateExecutionCounter.go

@@ -84,15 +97,17 @@ func (e *StateExecutionCounter) MarkStateIdExecutingIfNotYet(stateReqs []StateRe
 			case iwfidl.DISABLED:
 				// do nothing
 			case iwfidl.ENABLED_FOR_ALL:
-				if e.increaseStateIdCurrentlyExecutingCounts(s) {
+				e.increaseStateIdCurrentlyExecutingCounts(s)
+				if !slices.Contains(currentSAs.StringArrayValue, s.StateId) {


After reflecting into this for a while, I found issues with this partial comparison using Contains. I think just checking containing is probably not correct, it should be a full comparison.

Consider this case (assuming using ENABLED_FOR_ALL so that we don't care about the waitUntil):

S1->S2+S1

S2->S1
So there are 3 S1 executions. Assuming the first S1-1 completed, then goes to S2-1 and S1-2. But S1-2 didn't complete yet, then S2-1 completed and decided to go to S1-3.
Now S1-3 will call this MarkStateIdExecutingIfNotYet:

Because of the optimization in MarkStateExecutionCompleted, S2 is not deleted and still in the search attribute. So the SA contains both: [S1, S2]. (Assuming you are implementing the MarkStateExecutionCompleted optimization as we discussed today -- skip deleting the stateID from SA if there is a next state going to refresh the whole SA)

Now for S1-3 calling MarkStateIdExecutingIfNotYet, it will see the SA contains S1 already, but it also contains S2. Therefore, it should still refresh the SA based on stateIdCurrentlyExecutingCounts (Note: it may be a lot better to rename updateStateIdSearchAttribute -> refreshIwfExecutingStateIdSearchAttribute to be more clear.

I remember you were doing the full comparison when you put it in the updateStateIdSearchAttribute, not sure why you changed your mind 🤣

I think you may still run into NDE if putting it into that method, but instead, maybe you can create another method checkIfNeedToRefreshIwfExecutingStateIdSearchAttribute before calling refreshIwfExecutingStateIdSearchAttribute.

I will find sometime later to dig into the NDE again. It's very weird to me

Got it, I see the point and it makes sense.

I pushed another commit that should address this issue. Let's discuss it later today

longquanzheng · 2024-10-23T19:23:14Z

service/interpreter/stateExecutionCounter.go

+		if ok {
+			currentSAsValues = currentSAs.StringArrayValue
+		} else {
+			e.provider.GetLogger(e.ctx).Error("search attribute IwfExecutingStateIds is not found", err)


I feel like it's epxected to have empty/nil value to start with, so we probably just ignore this case, or just assign a empty slice so that DeepEqual can return true when comparing with another empty slice

longquanzheng · 2024-10-23T19:25:11Z

service/interpreter/stateExecutionCounter.go

@@ -135,25 +164,56 @@ func (e *StateExecutionCounter) MarkStateExecutionCompleted(state iwfidl.StateMo
 		case iwfidl.DISABLED:
 			return nil
 		case iwfidl.ENABLED_FOR_ALL:
-			e.decreaseStateIdCurrentlyExecutingCounts(state)
+			e.decreaseStateIdCurrentlyExecutingCounts(currentState)
+			shouldSkipUpsert := determineIfShouldSkipRefreshOnCompleted(nextStates, true)


Nit: shouldSkipUpsert -> shouldSkipRefresh. I feel like one of the confusion is that this is not really upserting new stateID to the SA, but deleting statID by refreshing the whole map (keys of stateIdCurrentlyExecutingCounts)

longquanzheng · 2024-10-23T19:30:45Z

service/interpreter/stateExecutionCounter.go

+		}
+	} else {
+		for _, s := range nonClosingNextStates {
+			if !s.StateOptions.GetSkipWaitUntil() {


It should be !compatibility.GetSkipWaitUntilApi(&options) like you did in #453

Btw, this double negative is hard to follow. maybe you can make it compatibility.HasWaitUntilApi(&options) which will make it more clear in this statement -- when any of the next states has waitUntil, return true to skip refresh (because the next state will refresh immediately)

Do we need !compatibility.GetSkipWaitUntilApi(&options)? This check is only for the newest version so I thought using s.StateOptions.GetSkipWaitUntil() was safe

I changed it, but would like to understand better why not use s.StateOptions.GetSkipWaitUntil() in this case. Unless it's only a safety measure

This check is only for the newest version so I thought using s.StateOptions.GetSkipWaitUntil() was safe

Oh, you are right. It's not needed because of the versioning

longquanzheng · 2024-10-23T19:31:20Z

Looks good now. Thanks for the hard work!

integ/wait_until_search_attributes_optimization_test.go

longquanzheng · 2024-10-24T03:51:22Z

service/interpreter/stateExecutionCounter.go

+			executingStateIds = append(executingStateIds, sid)
+		}
+
+		if reflect.DeepEqual(currentSAsValues, executingStateIds) {


Actually, I think we should use something like elementMatch to compare ? Because the order could be different https://stackoverflow.com/questions/36000487/check-for-equality-on-slices-without-order

I sort the slice (line 170)

But I agree doing elementMatch would be easier. I learned about it after I added sorting... 🙃

Nice. Sorry I didn’t read your latest code yet. I just thinking about it and wondering why you had NDE yesterday. This could be the reasons. Especially doing it in the update/refresh method for both MarkExecuting and MarkCompleted, it increases a lot the likelihood of NDE because of the randomness

Have you push the latest code? I don’t see the sorting ( do you mean sorting the keys of the map, right?)

~~#460 (review)~~

Oh, I see the confusion. I was focusing on the tests... my bad. Yes, I should add sorting or elementMatch here.

Now with this sorting, WDYT try move the whole optimization back to refreshIwfExecutingStateIdSearchAttribute again? I think if this solves the NDE, it will be also help the issue of #460 (comment)

lwolczynski · 2024-10-24T03:56:00Z

integ/wait_until_search_attributes_optimization_test.go

+		assertions.Equal([]string{"S3", "S6", "S7"}, historyEventSAs(upsertSAEvents[5]))
+		assertions.Equal([]string{"S3", "S6"}, historyEventSAs(upsertSAEvents[6]))
+		// TODO: This is unexpected; should not upsert the same SAs -- happens after "_SYS_GRACEFUL_COMPLETING_WORKFLOW"
+		assertions.Equal([]string{"S3"}, historyEventSAs(upsertSAEvents[7]))
+		assertions.Equal([]string{"S3"}, historyEventSAs(upsertSAEvents[8]))
+		assertions.Equal([]string{"null"}, historyEventSAs(upsertSAEvents[9]))


Technically we should not upsert "S3" twice. Seems that we always upsert on _SYS_GRACEFUL_COMPLETING_WORKFLOW

SAs -> "S3", "S6", "S7"

S7-1 -> _SYS_GRACEFUL_COMPLETING_WORKFLOW

SAs -> "S3", "S6"

S6-1 -> _SYS_GRACEFUL_COMPLETING_WORKFLOW

SAs -> "S3"

S3-1 -> _SYS_GRACEFUL_COMPLETING_WORKFLOW

SAs -> "S3"

S3-2 -> _SYS_GRACEFUL_COMPLETING_WORKFLOW

SAs -> null

Is that because of we miss an existing optimization— on MarkCompletion, we should only refresh when any stateID reaches to zero (needs to delete)

Okay, this should be the last thing to tackle. I'll take care of it in the morning

Latest thought: If the NDE is actually because of sorting and we can fix it and then move the logic back to the refresh method, then I guess it should also solve this issue of Upsert duplicate S3 -- we don’t need to check if a StateId count reaches zero like previous version

Fixed this issue now and tests are passing as expected. We can try moving back to the refresh method

IWF-208: Optimize executingStateIdMode for state transition

db96c7b

mixydubs reviewed Oct 22, 2024

View reviewed changes

lwolczynski added 3 commits October 22, 2024 10:47

IWF-208: Skip upserting SAs if state transition goes in loop

d44ab10

IWF-208: Add further upsert SAs rules

4468bdd

IWF-208: Fix lint issue

4426149

longquanzheng reviewed Oct 22, 2024

View reviewed changes

IWF-208: Refactor

f27d960

lwolczynski force-pushed the jira/lwolczynski/IWF-208 branch from 0358c1b to f27d960 Compare October 22, 2024 22:03

longquanzheng reviewed Oct 23, 2024

View reviewed changes

IWF-208: Change logic from checking if any state skips waitUntil to all

0c83e1a

lwolczynski force-pushed the jira/lwolczynski/IWF-208 branch from 76f6f15 to 0c83e1a Compare October 23, 2024 15:56

lwolczynski added 3 commits October 23, 2024 11:18

IWF-208: Refactor

d1d7612

IWF-208: Refactor

6d524e4

IWF-208: Add GetApiService method to client

3b7e065

lwolczynski force-pushed the jira/lwolczynski/IWF-208 branch from 4d0f7c5 to 3b7e065 Compare October 23, 2024 18:29

longquanzheng reviewed Oct 23, 2024

View reviewed changes

lwolczynski added 2 commits October 23, 2024 16:18

IWF-208: Add test

d09bce8

IWF-208: Refactor

13668fb

lwolczynski marked this pull request as ready for review October 23, 2024 21:26

lwolczynski requested review from longquanzheng and mixydubs October 23, 2024 21:26

lwolczynski commented Oct 23, 2024

View reviewed changes

integ/wait_until_search_attributes_optimization_test.go Outdated Show resolved Hide resolved

IWF-208: Make test more complex

f831664

lwolczynski force-pushed the jira/lwolczynski/IWF-208 branch from daa3bd8 to f831664 Compare October 24, 2024 02:28

IWF-208: Update test

ce4c5af

longquanzheng reviewed Oct 24, 2024

View reviewed changes

lwolczynski commented Oct 24, 2024

View reviewed changes

lwolczynski added 2 commits October 23, 2024 23:20

IWF-208: Add sorting

fe55ff5

IWF-208: Fix a case where a state has multiple executions

67d6c90

longquanzheng approved these changes Oct 24, 2024

View reviewed changes

lwolczynski added 2 commits October 24, 2024 12:38

IWF-208: Refactor

dc126ca

IWF-208: Add replay test

0ab8063

lwolczynski merged commit 55f45a0 into jira/lwolczynski/IWF-129 Oct 24, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IWF-208: Optimize executingStateIdMode for state transition #460

IWF-208: Optimize executingStateIdMode for state transition #460

lwolczynski commented Oct 21, 2024 •

edited

Loading

mixydubs Oct 22, 2024

mixydubs Oct 22, 2024

lwolczynski Oct 22, 2024

longquanzheng Oct 22, 2024

lwolczynski Oct 22, 2024 •

edited

Loading

longquanzheng Oct 22, 2024 •

edited

Loading

longquanzheng Oct 22, 2024 •

edited

Loading

longquanzheng Oct 23, 2024 •

edited

Loading

lwolczynski Oct 23, 2024

longquanzheng Oct 23, 2024

lwolczynski Oct 23, 2024

longquanzheng Oct 23, 2024

longquanzheng Oct 23, 2024 •

edited

Loading

lwolczynski Oct 23, 2024

lwolczynski Oct 23, 2024

longquanzheng Oct 23, 2024

longquanzheng commented Oct 23, 2024

longquanzheng Oct 24, 2024

lwolczynski Oct 24, 2024

lwolczynski Oct 24, 2024

longquanzheng Oct 24, 2024

longquanzheng Oct 24, 2024

lwolczynski Oct 24, 2024 •

edited

Loading

lwolczynski Oct 24, 2024

lwolczynski Oct 24, 2024

longquanzheng Oct 24, 2024 •

edited

Loading

lwolczynski Oct 24, 2024

longquanzheng Oct 24, 2024

lwolczynski Oct 24, 2024

longquanzheng Oct 24, 2024 •

edited

Loading

lwolczynski Oct 24, 2024

IWF-208: Optimize executingStateIdMode for state transition #460

IWF-208: Optimize executingStateIdMode for state transition #460

Conversation

lwolczynski commented Oct 21, 2024 • edited Loading

Description

Checklist

Related Issue

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lwolczynski Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

longquanzheng Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

longquanzheng Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

longquanzheng Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

longquanzheng Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

longquanzheng commented Oct 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lwolczynski Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

longquanzheng Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

longquanzheng Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lwolczynski commented Oct 21, 2024 •

edited

Loading

lwolczynski Oct 22, 2024 •

edited

Loading

longquanzheng Oct 22, 2024 •

edited

Loading

longquanzheng Oct 22, 2024 •

edited

Loading

longquanzheng Oct 23, 2024 •

edited

Loading

longquanzheng Oct 23, 2024 •

edited

Loading

lwolczynski Oct 24, 2024 •

edited

Loading

longquanzheng Oct 24, 2024 •

edited

Loading

longquanzheng Oct 24, 2024 •

edited

Loading