Autodiff Memory Management: BFS #1710

louisfd · 2024-04-29T23:45:44Z

Use breadth first search algorithm instead of pure recursivity in autodiff memory management, because nodes could be visited recursively way too many times in some settings.
Fix #1702

crates/burn-autodiff/src/runtime/memory_management.rs

nathanielsimard · 2024-04-30T12:40:45Z

crates/burn-autodiff/src/runtime/memory_management.rs

+                if !visited.contains(&parent) {
+                    to_visit.push((parent, next_mode.clone()));
+                }


As I understand it, we can still visit multiple times the same parent with different modes, but once we visited the parent once, we can't register new modes. Wouldn't it make sense to register the modes as well with priority (If TagAsUseful > Explore)?

The visited could be an HashMap containing the mode used. You can't register a parent with Explore when the parent was already visited, but you can with the mode TagAsUseful if the parent was visited with the mode Explore?

Unlike the NodeMemoryStatus for which there is one status per node (at first Unknown, then maybe Unavailable after the first propagation, then maybe Useful after this propagation), there is not one Mode per node.

The Mode made more sense in the previous form of the algorithm: we started in exploration until we found a node to tag as useful, then for this node to be usable we had to tag all of its parents as useful as well, so the algorithm switched to TagAsUseful for all of this branch. But with the visited approach we have to remember to go back to this mode when we get to a parent of a useful node. So the mode becomes tied to nodes in the vec and it's not very elegant as it was supposed to be a status of the algorithm, not the node. I think I can remove the concept of Mode altogether, it will be clearer.

codecov · 2024-04-30T13:42:23Z

Codecov Report

Attention: Patch coverage is 99.23077% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 86.44%. Comparing base (5d959e2) to head (fd772ac).

Files	Patch %	Lines
...tes/burn-autodiff/src/runtime/memory_management.rs	98.82%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1710      +/-   ##
==========================================
+ Coverage   86.42%   86.44%   +0.01%     
==========================================
  Files         697      697              
  Lines       82645    82729      +84     
==========================================
+ Hits        71429    71513      +84     
  Misses      11216    11216

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nathanielsimard · 2024-05-01T13:06:02Z

crates/burn-autodiff/src/runtime/memory_management.rs

+                if !visited_as_useful.contains(&parent)
+                    && (Some(&NodeMemoryStatus::Useful) == self.statuses.get(&node_id)
+                        || !visited_as_unknown.contains(&parent))


I would refactor that a bit, very hard to read:

if visited_as_useful.contains(&parents); { continue; } let is_useful = Some(&NodeMemoryStatus::Useful) == self.statuses.get(&node_id); if is_useful || !visited_as_unknown.contains(&parents) { to_visit.push((parent, Some(node_id))) }

And I actually think there is still a performance problem and even a bug. I think it should be:

if visited_as_useful.contains(&parents); { continue; } let is_useful = Some(&NodeMemoryStatus::Useful) == self.statuses.get(&node_id); if is_useful { to_visit.push((parent, Some(node_id))) }

Since the vector to_visit is already filled with all nodes, you don't need to keep track of nodes that were visited as unknown.

I rewrote the algorithm, it's more efficient and elegant

nathanielsimard · 2024-05-01T13:06:56Z

crates/burn-autodiff/src/runtime/memory_management.rs

-                let parents = self.nodes.get(&node_id).cloned().unwrap_or(vec![]);
-                for parent in parents {
-                    self.identify_leaves_and_deletables(parent, new_leaves, to_delete)
+        let mut visited = HashSet::new();


Should be initialized with the right capacity, which is new_leaves.len()

Hmm new_leaves is dynamically inserted during this algorithm, it's length 0 at that moment.
Also we can't really know visited length in advance because it depends if the nodes the algorithm sees are useful.

louisfd added 2 commits April 29, 2024 19:20

refactor identify leaves and deletables

709522e

bfs strategy

c59ddf2

louisfd mentioned this pull request Apr 29, 2024

loss.backward() hangs after burn update 0.12 -> 0.13 #1702

Closed

nathanielsimard reviewed Apr 30, 2024

View reviewed changes

remove mode

5748f5f

louisfd added 7 commits April 30, 2024 14:06

fix potential but obscure bug

6f807ed

clippy

6d6cdd4

clippy

ab8cee6

fmt?

715b590

fix repeat in jit

499b569

minor refactor

5e46a42

repeat default implementation refactored and fixed

fbdac3e

nathanielsimard requested changes May 1, 2024

View reviewed changes

louisfd and others added 5 commits May 2, 2024 14:37

Merge branch 'main' into fix/autodiff_mm/revisit_nodes

308ebbc

better useful propagation algorithm

e941557

important to separate explored and tagged as useful

f8a3b13

Merge branch 'main' into fix/autodiff_mm/revisit_nodes

365bf45

Cleanup

fd772ac

nathanielsimard approved these changes May 3, 2024

View reviewed changes

nathanielsimard merged commit a8661a2 into main May 3, 2024
15 checks passed

nathanielsimard deleted the fix/autodiff_mm/revisit_nodes branch May 3, 2024 13:45

nathanielsimard pushed a commit that referenced this pull request May 3, 2024

Autodiff Memory Management: BFS (#1710)

3bb0b8f

vampire3232 approved these changes May 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autodiff Memory Management: BFS #1710

Autodiff Memory Management: BFS #1710

louisfd commented Apr 29, 2024

nathanielsimard Apr 30, 2024

louisfd Apr 30, 2024

codecov bot commented Apr 30, 2024 •

edited

Loading

nathanielsimard May 1, 2024

louisfd May 2, 2024

nathanielsimard May 1, 2024

louisfd May 2, 2024

Autodiff Memory Management: BFS #1710

Autodiff Memory Management: BFS #1710

Conversation

louisfd commented Apr 29, 2024

nathanielsimard Apr 30, 2024

Choose a reason for hiding this comment

louisfd Apr 30, 2024

Choose a reason for hiding this comment

codecov bot commented Apr 30, 2024 • edited Loading

Codecov Report

nathanielsimard May 1, 2024

Choose a reason for hiding this comment

louisfd May 2, 2024

Choose a reason for hiding this comment

nathanielsimard May 1, 2024

Choose a reason for hiding this comment

louisfd May 2, 2024

Choose a reason for hiding this comment

codecov bot commented Apr 30, 2024 •

edited

Loading