-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
skip expensive diff operations when possible #5342
Conversation
CodSpeed Performance ReportMerging #5342 will improve performances by 35.91%Comparing Summary
Benchmarks breakdown
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plan to go through and add some more comments/doc strings b/c some of these changes are unfortunately, a little hard to understand at first glance
@@ -70,7 +70,7 @@ async def rebase_branch(branch: str) -> None: | |||
service=service, | |||
) | |||
diff_repository = await component_registry.get_component(DiffRepository, db=db, branch=obj) | |||
enriched_diff = await diff_coordinator.update_branch_diff(base_branch=base_branch, diff_branch=obj) | |||
enriched_diff = await diff_coordinator.update_branch_diff_and_return(base_branch=base_branch, diff_branch=obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made this change in a number of places b/c there are now two version of this method
update_branch_diff
might returnEnrichedDiffsMetadata
(no actual diff data) instead ofEnrichedDiffs
update_branch_diff_and_return
will always returnEnrichedDiffs
@@ -320,6 +320,7 @@ def _combine_relationships( | |||
combined_relationship = EnrichedDiffRelationship( | |||
name=later_relationship.name, | |||
label=later_relationship.label, | |||
identifier=later_relationship.identifier, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the identifier
of a relationship to the diff b/c it is easier and cleaner than always getting it out of the SchemaBranch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is where most of the changes are. needed a lot of refactoring to support EnrichedDiffs
and EnrichedDiffsMetadata
) | ||
if not isinstance(enriched_diffs, EnrichedDiffs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this conditional shows up a lot. it is to check if the enriched_diffs
variable is an instance of EnrichedDiffsMetadata
or EnrichedDiffs
in a manner that mypy
accepts. EnrichedDiffs
inherits from EnrichedDiffsMetadata
for relationship in node.relationships: | ||
relationship_schema = node_schema.get_relationship(name=relationship.name) | ||
specifiers.add(NodeFieldSpecifier(node_uuid=node.uuid, field_name=relationship_schema.get_identifier())) | ||
return specifiers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced with a simple database query
) | ||
|
||
data_conflicts = await self._get_data_conflicts(enriched_diff=enriched_diff) | ||
enriched_conflicts_map = self._get_enriched_conflicts_map(enriched_diff=enriched_diff) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes to handle the case when synchronize
receives an EnrichedDiffsMetadata
(which does not include conflicts data)
the has_validator
conditional is used to determine if this is a new ProposedChange that needs to have its conflict data set for the first time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new query to get the node_uuid-attribute_name/relationship_identifier tuples that the diff calculation query uses to identify nodes to include
hopefully can just be incorporated into the diff calculation query so that we can skip the step of getting it into memory completely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, might be good to have someone from the CS team do some additional tests before merging it
274883c
to
def1b18
Compare
converting to a draft to fix an issue that can crash that database on a very large diff |
def1b18
to
ffa586b
Compare
IFC-1004
the goal of this PR is to skip expensive diff operations (loading, calculating, saving, and combining) where possible. These operations are not too expensive when small, but diffs can be very big and the operations become slower as the size of the diff increases.
the previous approach for calculating an incremental diff (meaning at least one diff covering some of the requested time period already exists) was to load the existing diff, calculate the missing piece, add them together, repeat for all the missing time ranges, then save the whole updated diff. Sometimes this is still required, but we now try very hard to make sure that any of these expensive operations are absolutely necessary before running them.
this PR introduces the the
EnrichedDiffMetadata
class, which is what is sounds like: all of the data about a diff without the actual data, so from_time, to_time, branches, tracking ID, uuid, but no nodes. we try to use this class for as long as possible and only hydrate theEnrichedDiffMetadata
instance if we really need to add to calculated diffs together.new checks that we perform to avoid hydrating a diff:
this PR got bigger than I would have liked b/c
DiffCoordinator
assumed that theEnrichedDiffs
instances always had full data and needed a lot of changes to support the new "only get the diff data if you really need it" logic