Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IWF-208: Optimize executingStateIdMode for state transition #460

Merged
merged 17 commits into from
Oct 24, 2024

Conversation

lwolczynski
Copy link
Contributor

@lwolczynski lwolczynski commented Oct 21, 2024

Description

Checklist

  • Code compiles correctly
  • Tests for the changes have been added
  • All tests passing
  • This PR change is backwards-compatible
  • This PR CONTAINS a (planned) breaking change (it is not backwards compatible)

Related Issue

Closes #454

shouldSkipUpsertingSAs := true
for _, s := range stateTransition.next {
if !s.StateOptions.GetSkipWaitUntil() {
shouldSkipUpsertingSAs = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think perhaps I'm missing some context or understanding of what we want to achieve here - what happens when the first StateMovement in the next array skips waitUntil, but subsequent ones do not? If I'm reading this correctly, with this change if any next StateMovement does not have a waitUntil, the SearchAttributeExecutingStateIds will not be updated...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... I have that wrong don't I... if any of the next states do have a waitUntil (i.e. they don't skip wait until) then do continue on to UpsertSearchAttributes??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, if any of the next states don't skip waitUntil, then the SAs will be upserted. Also, if I'm not mistaken, nextStates are the states that are executed in the state transition, example:
current = A
next = [B, C, D]
means that B, C, and D are executed at the same time after A is completed.

}
}

return e.updateStateIdSearchAttribute()
}

func determineIfShouldSkipUpsert(currentState iwfidl.StateMovement, nextStates []iwfidl.StateMovement) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the signature would have pass ExecutingStateIdMode.
I think the currentState can be omitted.

func determineIfShouldSkipUpsert(mode ExecutingStateIdMode, nextStates ) bool{
  // based on the mode
  // For ENABLED_FOR_ALL, return true as long as any of next states is going to a next state(not closing state)
  // For ENABLED_FOR_STATES_WITH_WAIT_UNTIL, in addition to above, also check the state has waitUntil.
}

StateMovement may not go to any state, because we are doing a hack in iwf (I really regret that) 😨
We use some special StateIDs to prepresent gracefulComplete, deadEnd etc. See https://github.com/indeedeng/iwf/blob/main/service/const.go#L68
And how it's being used in https://github.com/indeedeng/iwf/blob/main/service/interpreter/workflowImpl.go#L373

Copy link
Contributor Author

@lwolczynski lwolczynski Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the signature would have pass ExecutingStateIdMode.

I'm not passing it as I only invoke determineIfShouldSkipUpsert when needed based on ExecutingStateIdMode

currentState is needed if we want to determine when a state is looping back, I believe

Copy link
Contributor

@longquanzheng longquanzheng Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currentState is needed if we want to determine when a state is looping back, I believe

for that case, I belive what you have in updateStateIdSearchAttribute baesd on comparing with GetSearchAttributes will cover it?

Copy link
Contributor

@longquanzheng longquanzheng Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, if MarkStateExecutionCompleted not delete the SA, next time when MarkStateExecutingIfNotYet will callupdateStateIdSearchAttribute which will then skip it because the SA is still there

@lwolczynski lwolczynski force-pushed the jira/lwolczynski/IWF-208 branch from 0358c1b to f27d960 Compare October 22, 2024 22:03
@@ -84,15 +97,17 @@ func (e *StateExecutionCounter) MarkStateIdExecutingIfNotYet(stateReqs []StateRe
case iwfidl.DISABLED:
// do nothing
case iwfidl.ENABLED_FOR_ALL:
if e.increaseStateIdCurrentlyExecutingCounts(s) {
e.increaseStateIdCurrentlyExecutingCounts(s)
if !slices.Contains(currentSAs.StringArrayValue, s.StateId) {
Copy link
Contributor

@longquanzheng longquanzheng Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reflecting into this for a while, I found issues with this partial comparison using Contains. I think just checking containing is probably not correct, it should be a full comparison.

Consider this case (assuming using ENABLED_FOR_ALL so that we don't care about the waitUntil):

  • S1->S2+S1
  • S2->S1
    So there are 3 S1 executions. Assuming the first S1-1 completed, then goes to S2-1 and S1-2. But S1-2 didn't complete yet, then S2-1 completed and decided to go to S1-3.
    Now S1-3 will call this MarkStateIdExecutingIfNotYet:
  • Because of the optimization in MarkStateExecutionCompleted, S2 is not deleted and still in the search attribute. So the SA contains both: [S1, S2]. (Assuming you are implementing the MarkStateExecutionCompleted optimization as we discussed today -- skip deleting the stateID from SA if there is a next state going to refresh the whole SA)
  • Now for S1-3 calling MarkStateIdExecutingIfNotYet, it will see the SA contains S1 already, but it also contains S2. Therefore, it should still refresh the SA based on stateIdCurrentlyExecutingCounts (Note: it may be a lot better to rename updateStateIdSearchAttribute -> refreshIwfExecutingStateIdSearchAttribute to be more clear.

I remember you were doing the full comparison when you put it in the updateStateIdSearchAttribute, not sure why you changed your mind 🤣

I think you may still run into NDE if putting it into that method, but instead, maybe you can create another method checkIfNeedToRefreshIwfExecutingStateIdSearchAttribute before calling refreshIwfExecutingStateIdSearchAttribute.

I will find sometime later to dig into the NDE again. It's very weird to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I see the point and it makes sense.

I pushed another commit that should address this issue. Let's discuss it later today

@lwolczynski lwolczynski force-pushed the jira/lwolczynski/IWF-208 branch from 76f6f15 to 0c83e1a Compare October 23, 2024 15:56
@lwolczynski lwolczynski force-pushed the jira/lwolczynski/IWF-208 branch from 4d0f7c5 to 3b7e065 Compare October 23, 2024 18:29
if ok {
currentSAsValues = currentSAs.StringArrayValue
} else {
e.provider.GetLogger(e.ctx).Error("search attribute IwfExecutingStateIds is not found", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it's epxected to have empty/nil value to start with, so we probably just ignore this case, or just assign a empty slice so that DeepEqual can return true when comparing with another empty slice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -135,25 +164,56 @@ func (e *StateExecutionCounter) MarkStateExecutionCompleted(state iwfidl.StateMo
case iwfidl.DISABLED:
return nil
case iwfidl.ENABLED_FOR_ALL:
e.decreaseStateIdCurrentlyExecutingCounts(state)
e.decreaseStateIdCurrentlyExecutingCounts(currentState)
shouldSkipUpsert := determineIfShouldSkipRefreshOnCompleted(nextStates, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: shouldSkipUpsert -> shouldSkipRefresh. I feel like one of the confusion is that this is not really upserting new stateID to the SA, but deleting statID by refreshing the whole map (keys of stateIdCurrentlyExecutingCounts)

}
} else {
for _, s := range nonClosingNextStates {
if !s.StateOptions.GetSkipWaitUntil() {
Copy link
Contributor

@longquanzheng longquanzheng Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be !compatibility.GetSkipWaitUntilApi(&options) like you did in #453

Btw, this double negative is hard to follow. maybe you can make it compatibility.HasWaitUntilApi(&options) which will make it more clear in this statement -- when any of the next states has waitUntil, return true to skip refresh (because the next state will refresh immediately)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need !compatibility.GetSkipWaitUntilApi(&options)? This check is only for the newest version so I thought using s.StateOptions.GetSkipWaitUntil() was safe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it, but would like to understand better why not use s.StateOptions.GetSkipWaitUntil() in this case. Unless it's only a safety measure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is only for the newest version so I thought using s.StateOptions.GetSkipWaitUntil() was safe

Oh, you are right. It's not needed because of the versioning

@longquanzheng
Copy link
Contributor

Looks good now. Thanks for the hard work!

@lwolczynski lwolczynski marked this pull request as ready for review October 23, 2024 21:26
@lwolczynski lwolczynski force-pushed the jira/lwolczynski/IWF-208 branch from daa3bd8 to f831664 Compare October 24, 2024 02:28
executingStateIds = append(executingStateIds, sid)
}

if reflect.DeepEqual(currentSAsValues, executingStateIds) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think we should use something like elementMatch to compare ? Because the order could be different https://stackoverflow.com/questions/36000487/check-for-equality-on-slices-without-order

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sort the slice (line 170)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I agree doing elementMatch would be easier. I learned about it after I added sorting... 🙃

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Sorry I didn’t read your latest code yet. I just thinking about it and wondering why you had NDE yesterday. This could be the reasons. Especially doing it in the update/refresh method for both MarkExecuting and MarkCompleted, it increases a lot the likelihood of NDE because of the randomness

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you push the latest code? I don’t see the sorting ( do you mean sorting the keys of the map, right?)

Copy link
Contributor Author

@lwolczynski lwolczynski Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see the confusion. I was focusing on the tests... my bad. Yes, I should add sorting or elementMatch here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added now

Copy link
Contributor

@longquanzheng longquanzheng Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now with this sorting, WDYT try move the whole optimization back to refreshIwfExecutingStateIdSearchAttribute again? I think if this solves the NDE, it will be also help the issue of #460 (comment)

Comment on lines 140 to 145
assertions.Equal([]string{"S3", "S6", "S7"}, historyEventSAs(upsertSAEvents[5]))
assertions.Equal([]string{"S3", "S6"}, historyEventSAs(upsertSAEvents[6]))
// TODO: This is unexpected; should not upsert the same SAs -- happens after "_SYS_GRACEFUL_COMPLETING_WORKFLOW"
assertions.Equal([]string{"S3"}, historyEventSAs(upsertSAEvents[7]))
assertions.Equal([]string{"S3"}, historyEventSAs(upsertSAEvents[8]))
assertions.Equal([]string{"null"}, historyEventSAs(upsertSAEvents[9]))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we should not upsert "S3" twice. Seems that we always upsert on _SYS_GRACEFUL_COMPLETING_WORKFLOW

  • SAs -> "S3", "S6", "S7"
  • S7-1 -> _SYS_GRACEFUL_COMPLETING_WORKFLOW
  • SAs -> "S3", "S6"
  • S6-1 -> _SYS_GRACEFUL_COMPLETING_WORKFLOW
  • SAs -> "S3"
  • S3-1 -> _SYS_GRACEFUL_COMPLETING_WORKFLOW
  • SAs -> "S3"
  • S3-2 -> _SYS_GRACEFUL_COMPLETING_WORKFLOW
  • SAs -> null

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that because of we miss an existing optimization— on MarkCompletion, we should only refresh when any stateID reaches to zero (needs to delete)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this should be the last thing to tackle. I'll take care of it in the morning

Copy link
Contributor

@longquanzheng longquanzheng Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest thought: If the NDE is actually because of sorting and we can fix it and then move the logic back to the refresh method, then I guess it should also solve this issue of Upsert duplicate S3 -- we don’t need to check if a StateId count reaches zero like previous version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this issue now and tests are passing as expected. We can try moving back to the refresh method

@lwolczynski lwolczynski merged commit 55f45a0 into jira/lwolczynski/IWF-129 Oct 24, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants