Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep NMS on XPU #1403

Merged
merged 3 commits into from
Mar 5, 2025
Merged

Keep NMS on XPU #1403

merged 3 commits into from
Mar 5, 2025

Conversation

frost-intel
Copy link
Contributor

Based on the NMS updates in pytorch/vision#8766, this PR moves the gather-keep section of the nms op from CPU to XPU. This causes a very minor slowdown for small num_boxes < 400 but drastically increases performance for large num_boxes by eliminating data transfer between XPU and CPU. Since the number of boxes is typically > 1000, this is a reasonable change.

Details

XPU New Code Timings
num_boxes = 10 med = 0.60ms +- 0.10  # _batched_nms_coordinate_trick
num_boxes = 10 med = 2.73ms +- 0.04  # _batched_nms_vanilla
num_boxes = 100 med = 0.60ms +- 0.03
num_boxes = 100 med = 2.75ms +- 0.07
num_boxes = 200 med = 0.60ms +- 0.03
num_boxes = 200 med = 2.77ms +- 0.05
num_boxes = 400 med = 0.61ms +- 0.03
num_boxes = 400 med = 2.80ms +- 0.03
num_boxes = 800 med = 0.61ms +- 0.02
num_boxes = 800 med = 2.81ms +- 0.03
num_boxes = 1000 med = 0.62ms +- 0.01
num_boxes = 1000 med = 2.15ms +- 0.12
num_boxes = 2000 med = 0.54ms +- 0.01
num_boxes = 2000 med = 2.15ms +- 0.01
num_boxes = 10000 med = 1.76ms +- 0.02
num_boxes = 10000 med = 3.25ms +- 0.02
num_boxes = 20000 med = 2.83ms +- 0.03
num_boxes = 20000 med = 4.74ms +- 0.02
num_boxes = 80000 med = 17.79ms +- 0.05
num_boxes = 80000 med = 12.27ms +- 0.03
num_boxes = 100000 med = 25.76ms +- 0.04
num_boxes = 100000 med = 15.43ms +- 0.04
num_boxes = 200000 med = 85.42ms +- 0.26
num_boxes = 200000 med = 36.35ms +- 0.04

XPU - main
num_boxes = 10 med = 0.47ms +- 0.08
num_boxes = 10 med = 2.35ms +- 0.07
num_boxes = 100 med = 0.59ms +- 0.03
num_boxes = 100 med = 2.40ms +- 0.09
num_boxes = 200 med = 0.60ms +- 0.04
num_boxes = 200 med = 2.46ms +- 0.06
num_boxes = 400 med = 0.60ms +- 0.03
num_boxes = 400 med = 2.98ms +- 0.03
num_boxes = 800 med = 0.61ms +- 0.01
num_boxes = 800 med = 2.98ms +- 0.02
num_boxes = 1000 med = 0.62ms +- 0.01
num_boxes = 1000 med = 3.01ms +- 0.02
num_boxes = 2000 med = 0.66ms +- 0.01
num_boxes = 2000 med = 3.34ms +- 0.02
num_boxes = 10000 med = 3.82ms +- 3.67
num_boxes = 10000 med = 5.31ms +- 1.82
num_boxes = 20000 med = 20.92ms +- 1.70
num_boxes = 20000 med = 7.22ms +- 1.43
num_boxes = 80000 med = 119.85ms +- 5.65
num_boxes = 80000 med = 90.21ms +- 3.99
num_boxes = 100000 med = 168.14ms +- 4.02
num_boxes = 100000 med = 123.07ms +- 1.49
num_boxes = 200000 med = 457.85ms +- 70.04
num_boxes = 200000 med = 254.54ms +- 5.27
import torch
from time import perf_counter_ns
from torchvision.ops import nms
from torchvision.ops.boxes import _batched_nms_coordinate_trick, _batched_nms_vanilla

def bench(f, *args, num_exp=1000, warmup=0, **kwargs):

    for _ in range(warmup):
        f(*args, **kwargs)

    times = []
    for _ in range(num_exp):
        start = perf_counter_ns()
        f(*args, **kwargs)
        torch.xpu.synchronize()
        end = perf_counter_ns()
        times.append(end - start)
    return torch.tensor(times).float()

def report_stats(times, unit="ms", prefix=""):
    mul = {
        "ns": 1,
        "µs": 1e-3,
        "ms": 1e-6,
        "s": 1e-9,
    }[unit]
    times = times * mul
    std = times.std().item()
    med = times.median().item()
    print(f"{prefix}{med = :.2f}{unit} +- {std:.2f}")
    return med


def make_boxes(num_boxes, num_classes=4, device="xpu"):
    boxes = torch.cat((torch.rand(num_boxes, 2), torch.rand(num_boxes, 2) + 10), dim=1).to(device)
    assert max(boxes[:, 0]) < min(boxes[:, 2])  # x1 < x2
    assert max(boxes[:, 1]) < min(boxes[:, 3])  # y1 < y2

    scores = torch.rand(num_boxes).to(device)
    idxs = torch.randint(0, num_classes, size=(num_boxes,)).to(device)
    return boxes, scores, idxs

NUM_EXP = 30
for num_boxes in (10, 100, 200, 400, 600, 800, 1000, 1400, 2000, 10000, 20_000, 80_000, 100000, 200_000):
    for f in (_batched_nms_coordinate_trick, _batched_nms_vanilla):
        boxes, scores, idxs = make_boxes(num_boxes)
        times = bench(f, boxes, scores, idxs, iou_threshold=.7, warmup=1, num_exp=NUM_EXP)
        report_stats(times, prefix=f"{num_boxes = } ")

@frost-intel frost-intel requested a review from Stonepia March 4, 2025 12:36
@frost-intel frost-intel added this pull request to the merge queue Mar 5, 2025
Merged via the queue into main with commit b8c05de Mar 5, 2025
8 of 9 checks passed
@frost-intel frost-intel deleted the frost/nms_xpu_only branch March 5, 2025 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants