Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diff calculation performance improvement #5478

Merged
merged 2 commits into from
Jan 16, 2025
Merged

Conversation

ajtmccarty
Copy link
Contributor

@ajtmccarty ajtmccarty commented Jan 15, 2025

IFC-1129

updates the DiffAllPathsQuery to speed up calculating an incremental diff on the base branch (main) portion of the diff. my tests for calculating/updating very large diffs indicate that this is a bottleneck that slows down updating a diff even when there are very few or no changes on the main branch.

my local testing indicates that this change speeds up calculating a diff on main when there are no/very few changes on main by a factor of 4, which is pretty good, but I think there is still more that we can do

@github-actions github-actions bot added the group/backend Issue related to the backend (API Server, Git Agent) label Jan 15, 2025
Copy link

codspeed-hq bot commented Jan 15, 2025

CodSpeed Performance Report

Merging #5478 will not alter performance

Comparing ajtm-01152025-diff-perf (8344849) with stable (9ec036d)

Summary

✅ 10 untouched benchmarks

@ajtmccarty ajtmccarty marked this pull request as ready for review January 16, 2025 00:44
@ajtmccarty ajtmccarty requested a review from a team January 16, 2025 00:44
@@ -61,7 +61,6 @@ async def get(
include_empty: bool = False,
) -> list[EnrichedDiffRoot]:
final_max_depth = config.SETTINGS.database.max_depth_search_hierarchy
final_limit = limit or config.SETTINGS.database.query_size_limit
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will prevent updating diffs with over 5000 nodes correctly

WITH nfs[0] AS uuid, nfs[1] AS field_name
WITH collect(DISTINCT uuid) as uuids, collect(DISTINCT field_name) AS field_names
RETURN uuids AS node_ids_list, field_names AS field_names_list
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when calculating a diff on the base branch (you can think of it as main b/c it almost always is) we need to tell the query which nodes and attributes/relationships it should process b/c the main diff only needs to look at nodes and attributes/relationships that are included in the branch diff
in the case of a big diff (think thousands of nodes), this list of "node_field_specifiers", can be over 100,000 elements. each element is just a UUID and a name, but this is still quite a long list

the previous filtering logic in this query, checked if the UUID of a Node node and the name/identifier of an Attribute or Relationship node was in this potentially huge list of node field specifiers. it could take a very long time to check every potential node against this big list

this change uses two smaller lists of unique node UUIDs and unique attribute names/relationship identifiers as a pre-check that can be executed much more quickly than the check against the node field specifier list

@ajtmccarty ajtmccarty merged commit 7cf1310 into stable Jan 16, 2025
32 checks passed
@ajtmccarty ajtmccarty deleted the ajtm-01152025-diff-perf branch January 16, 2025 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
group/backend Issue related to the backend (API Server, Git Agent)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants