Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skip expensive diff operations when possible #5342

Merged
merged 18 commits into from
Jan 9, 2025

Conversation

ajtmccarty
Copy link
Contributor

@ajtmccarty ajtmccarty commented Dec 31, 2024

IFC-1004

the goal of this PR is to skip expensive diff operations (loading, calculating, saving, and combining) where possible. These operations are not too expensive when small, but diffs can be very big and the operations become slower as the size of the diff increases.

the previous approach for calculating an incremental diff (meaning at least one diff covering some of the requested time period already exists) was to load the existing diff, calculate the missing piece, add them together, repeat for all the missing time ranges, then save the whole updated diff. Sometimes this is still required, but we now try very hard to make sure that any of these expensive operations are absolutely necessary before running them.

this PR introduces the the EnrichedDiffMetadata class, which is what is sounds like: all of the data about a diff without the actual data, so from_time, to_time, branches, tracking ID, uuid, but no nodes. we try to use this class for as long as possible and only hydrate the EnrichedDiffMetadata instance if we really need to add to calculated diffs together.

new checks that we perform to avoid hydrating a diff:

  • quick query to check if there are any changes on either branch involved in the diff in the time frame. if none, then we can just adjust the time frame of the aggregated diff to include this time frame with no changes
  • if we run the diff calculation for a time frame and it returns no changes, we can do basically the same thing

this PR got bigger than I would have liked b/c DiffCoordinator assumed that the EnrichedDiffs instances always had full data and needed a lot of changes to support the new "only get the diff data if you really need it" logic

@github-actions github-actions bot added the group/backend Issue related to the backend (API Server, Git Agent) label Dec 31, 2024
Copy link

codspeed-hq bot commented Dec 31, 2024

CodSpeed Performance Report

Merging #5342 will improve performances by 35.91%

Comparing ajtm-12312024-skip-diff-no-changes (e10dc77) with stable (929df94)

Summary

⚡ 1 improvements
✅ 9 untouched benchmarks

Benchmarks breakdown

Benchmark stable ajtm-12312024-skip-diff-no-changes Change
test_schemabranch_duplicate 443.6 µs 326.4 µs +35.91%

@ajtmccarty ajtmccarty changed the title WIP skip diff calculations when possible skip expensive diff operations when possible Jan 3, 2025
Copy link
Contributor Author

@ajtmccarty ajtmccarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plan to go through and add some more comments/doc strings b/c some of these changes are unfortunately, a little hard to understand at first glance

@@ -70,7 +70,7 @@ async def rebase_branch(branch: str) -> None:
service=service,
)
diff_repository = await component_registry.get_component(DiffRepository, db=db, branch=obj)
enriched_diff = await diff_coordinator.update_branch_diff(base_branch=base_branch, diff_branch=obj)
enriched_diff = await diff_coordinator.update_branch_diff_and_return(base_branch=base_branch, diff_branch=obj)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made this change in a number of places b/c there are now two version of this method

  • update_branch_diff might return EnrichedDiffsMetadata (no actual diff data) instead of EnrichedDiffs
  • update_branch_diff_and_return will always return EnrichedDiffs

@@ -320,6 +320,7 @@ def _combine_relationships(
combined_relationship = EnrichedDiffRelationship(
name=later_relationship.name,
label=later_relationship.label,
identifier=later_relationship.identifier,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the identifier of a relationship to the diff b/c it is easier and cleaner than always getting it out of the SchemaBranch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is where most of the changes are. needed a lot of refactoring to support EnrichedDiffs and EnrichedDiffsMetadata

)
if not isinstance(enriched_diffs, EnrichedDiffs):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this conditional shows up a lot. it is to check if the enriched_diffs variable is an instance of EnrichedDiffsMetadata or EnrichedDiffs in a manner that mypy accepts. EnrichedDiffs inherits from EnrichedDiffsMetadata

for relationship in node.relationships:
relationship_schema = node_schema.get_relationship(name=relationship.name)
specifiers.add(NodeFieldSpecifier(node_uuid=node.uuid, field_name=relationship_schema.get_identifier()))
return specifiers
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced with a simple database query

)

data_conflicts = await self._get_data_conflicts(enriched_diff=enriched_diff)
enriched_conflicts_map = self._get_enriched_conflicts_map(enriched_diff=enriched_diff)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes to handle the case when synchronize receives an EnrichedDiffsMetadata (which does not include conflicts data)
the has_validator conditional is used to determine if this is a new ProposedChange that needs to have its conflict data set for the first time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new query to get the node_uuid-attribute_name/relationship_identifier tuples that the diff calculation query uses to identify nodes to include
hopefully can just be incorporated into the diff calculation query so that we can skip the step of getting it into memory completely

@ajtmccarty ajtmccarty marked this pull request as ready for review January 3, 2025 02:54
@ajtmccarty ajtmccarty requested a review from a team January 3, 2025 02:54
Copy link
Collaborator

@dgarros dgarros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, might be good to have someone from the CS team do some additional tests before merging it

@ajtmccarty ajtmccarty force-pushed the ajtm-12312024-skip-diff-no-changes branch 2 times, most recently from 274883c to def1b18 Compare January 6, 2025 14:33
@ajtmccarty ajtmccarty marked this pull request as draft January 7, 2025 22:01
@ajtmccarty
Copy link
Contributor Author

converting to a draft to fix an issue that can crash that database on a very large diff

@ajtmccarty ajtmccarty force-pushed the ajtm-12312024-skip-diff-no-changes branch from def1b18 to ffa586b Compare January 8, 2025 22:10
@ajtmccarty ajtmccarty marked this pull request as ready for review January 9, 2025 01:20
@dgarros dgarros merged commit a120fb8 into stable Jan 9, 2025
34 checks passed
@dgarros dgarros deleted the ajtm-12312024-skip-diff-no-changes branch January 9, 2025 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
group/backend Issue related to the backend (API Server, Git Agent)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants