Skip to content

Commit

Permalink
Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)
Browse files Browse the repository at this point in the history
Summary:
As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels.

This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits:

+ caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`)
+ "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925
+ fixes behavior broken behavior with the memtracker; pytorch/pytorch#139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it
+ one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider

X-link: pytorch/pytorch#145130
Approved by: https://github.com/ngimel

Reviewed By: jeanschmidt

Differential Revision: D70075331

fbshipit-source-id: cf4d0d687b299c942793a758c6fec4b064c44227
  • Loading branch information
generatedunixname499836121 authored and facebook-github-bot committed Feb 24, 2025
1 parent f6726de commit 818315c
Showing 1 changed file with 9 additions and 0 deletions.
9 changes: 9 additions & 0 deletions userbenchmark/dynamo/dynamobench/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -3576,6 +3576,15 @@ def run(runner, args, original_dir=None):
# some of the models do not support use_deterministic_algorithms
torch.use_deterministic_algorithms(True)
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
if args.only is not None and args.only in {
"DebertaForQuestionAnswering",
"RobertaForQuestionAnswering",
"nvidia_deeprecommender",
"volo_d1_224",
}:
# These seem unhappy with numerics of larger cuBLASLt workspace
# sizes following #145130 (due to enabling split-k?)
torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.allow_tf32 = False
torch.backends.cudnn.benchmark = False
Expand Down

0 comments on commit 818315c

Please sign in to comment.