Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip certain unit tests on NAVI #1950

Merged

Skipped unit tests in test_fsdp_sharded_grad_scaler.py

eb38990
Select commit
Loading
Failed to load commit list.
Merged

Skip certain unit tests on NAVI #1950

Skipped unit tests in test_fsdp_sharded_grad_scaler.py
eb38990
Select commit
Loading
Failed to load commit list.
ROCm Repo Management API / Tests / Tests / Test Inductor / Run pytorch_inductor failed Mar 10, 2025 in 0s

failed: 12, skipped: 23198, passed: 25882

Send us feedback

Details

AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda

AssertionError: Tensor-likes are not close!

Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Stack trace
Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 951, in test_sdpa_2
    self.check_model(Model(), example_inputs)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor_utils.py", line 194, in check_model
    self.assertEqual(actual, expected, atol=atol, rtol=rtol)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4040, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!

Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Standard error
/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py:940: UserWarning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at /var/lib/jenkins/pytorch/aten/src/ATen/Context.cpp:328.)
  t = torch.nn.functional.scaled_dot_product_attention(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:195: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_config_module.py:440: UserWarning: Skipping serialization of skipfiles_inline_module_allowlist value {}
  warnings.warn(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_config_module.py:440: UserWarning: Skipping serialization of skipfiles_inline_module_allowlist value {}
  warnings.warn(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_config_module.py:440: UserWarning: Skipping serialization of skipfiles_inline_module_allowlist value {}
  warnings.warn(
Standard out
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 2), ('benchmarking.TritonBenchmarker.triton_do_bench', 1)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 2)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 2)]

AOTInductorTestABICompatibleGpu.test_fallback_kernel_with_symexpr_output_cuda

RuntimeError: [AOTriton] Accelerated SDPA only supports MI200/MI300X/Navi31 GPUs (gfx90a:sramecc+:xnack-/gfx942:sramecc+:xnack-/gfx1100)
Exception raised from check_gpu_arch at /var/lib/jenkins/pytorch/aten/src/ATen/native/transformers/hip/flash_attn/flash_api.hip:76 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0
#7 pytorch_flash::mha_fwd(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor>&, std::optional<at::Tensor>&, float, float, bool, int, int, bool, std::optional<at::Generator>) from ??:0
#8 at::native::_flash_attention_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, long, long, double, bool, bool, std::optional<double>, std::optional<long>, std::optional<long>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&) from ??:0
#9 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___flash_attention_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&) from RegisterCUDA.cpp:0
#10 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___flash_attention_forward>, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&> >, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&) from RegisterCUDA.cpp:0
#11 std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> c10::callUnboxedKernelFunction<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor>, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&>(void*, c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt&&, c10::SymInt&&, double&&, bool&&, bool&&, std::optional<double>&&, std::optional<c10::SymInt>&&, std::optional<c10::SymInt>&&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&) [clone .isra.0] from Operators_0.cpp:0
#12 at::_ops::_flash_attention_forward::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&) from ??:0
#13 at::_flash_attention_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, long, long, double, bool, bool, std::optional<double>, std::optional<long>, std::optional<long>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&) from ??:0
#14 at::native::_scaled_dot_product_flash_attention_cuda(at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>) from ??:0
#15 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___scaled_dot_product_flash_attention(at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>) from RegisterCUDA.cpp:0
#16 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt, at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___scaled_dot_product_flash_attention>, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt, at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double> > >, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt, at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>) from RegisterCUDA.cpp:0
#17 at::_ops::_scaled_dot_product_flash_attention::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>) from ??:0
#18 torch::autograd::VariableType::(anonymous namespace)::_scaled_dot_product_flash_attention(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>) from VariableType_1.cpp:0
#19 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt, at::Tensor, at::Tensor, at::Tensor> (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>), &torch::autograd::VariableType::(anonymous namespace)::_scaled_dot_product_flash_attention>, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt, at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double> > >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from VariableType_1.cpp:0
#20 c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const [clone .isra.0] from register_c10_ops.cpp:0
#21 torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, pybind11::args const&, pybind11::kwargs const&, std::optional<c10::DispatchKey>) from ??:0
#22 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol, pybind11::args const&, pybind11::kwargs const&, bool, std::optional<c10::DispatchKey>) from ??:0
#23 torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#215}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}::operator()(pybind11::args const&, pybind11::kwargs const&) const from init.cpp:0
#24 pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#215}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}, pybind11::object, pybind11::args const&, pybind11::kwargs const&>(torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#215}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}&&, pybind11::object (*)(pybind11::args const&, pybind11::kwargs const&))::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0
#25 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0
#26 cfunction_call from /usr/local/src/conda/python-3.10.16/Objects/methodobject.c:543
#27 _PyObject_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:305
#28 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5917
#29 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#30 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#31 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#32 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#33 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#34 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#35 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#36 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#37 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#38 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#39 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#40 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#41 _PyObject_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:305
#42 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#43 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#44 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#45 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#46 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#47 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#48 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#49 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#50 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#51 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#52 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#53 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#54 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#55 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#56 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#58 PyVectorcall_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:267
#59 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#60 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#61 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#62 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#63 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#64 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#65 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#66 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#67 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#68 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#69 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#70 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#71 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#72 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#73 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#74 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#75 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#76 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#77 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#78 _PyObject_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:305
#79 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#80 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#81 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#82 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#83 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#84 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#85 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#86 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#87 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#88 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#89 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#90 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#91 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#92 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#93 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#94 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#95 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#96 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#97 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#98 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#99 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#100 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#101 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#102 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#103 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#104 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#105 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#106 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#107 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#108 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#109 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#110 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#111 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#112 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#113 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#114 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#115 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#116 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#117 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#118 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#119 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#120 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#121 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#122 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#123 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#124 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#125 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#126 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#127 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#128 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#129 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#130 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#131 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#132 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#133 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#134 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#135 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#136 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#137 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#138 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#139 PyEval_EvalCode from /usr/local/src/conda/python-3.10.16/Python/ceval.c:1134
#140 run_eval_code_obj from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1291
#141 run_mod from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1312
#142 pyrun_file from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1208
#143 _PyRun_SimpleFileObject from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:456
#144 _PyRun_AnyFileObject from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:90
#145 pymain_run_file_obj from /usr/local/src/conda/python-3.10.16/Modules/main.c:357
#146 Py_BytesMain from /usr/local/src/conda/python-3.10.16/Modules/main.c:1094
#147 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58
#148 __libc_start_main_impl from ./csu/../csu/libc-start.c:392
#149 _start from ??:0


To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_fallback_kernel_with_symexpr_output_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Stack trace
Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 1089, in test_fallback_kernel_with_symexpr_output
    torch.testing.assert_close(m(*inputs), aot_model(*inputs))
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 1066, in forward
    res = torch.ops.aten._scaled_dot_product_flash_attention.default(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_ops.py", line 722, in __call__
    return self._op(*args, **kwargs)
RuntimeError: [AOTriton] Accelerated SDPA only supports MI200/MI300X/Navi31 GPUs (gfx90a:sramecc+:xnack-/gfx942:sramecc+:xnack-/gfx1100)
Exception raised from check_gpu_arch at /var/lib/jenkins/pytorch/aten/src/ATen/native/transformers/hip/flash_attn/flash_api.hip:76 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0
#7 pytorch_flash::mha_fwd(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor>&, std::optional<at::Tensor>&, float, float, bool, int, int, bool, std::optional<at::Generator>) from ??:0
#8 at::native::_flash_attention_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, long, long, double, bool, bool, std::optional<double>, std::optional<long>, std::optional<long>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&) from ??:0
#9 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___flash_attention_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&) from RegisterCUDA.cpp:0
#10 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___flash_attention_forward>, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&> >, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&) from RegisterCUDA.cpp:0
#11 std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> c10::callUnboxedKernelFunction<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor>, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&>(void*, c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt&&, c10::SymInt&&, double&&, bool&&, bool&&, std::optional<double>&&, std::optional<c10::SymInt>&&, std::optional<c10::SymInt>&&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&) [clone .isra.0] from Operators_0.cpp:0
#12 at::_ops::_flash_attention_forward::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&) from ??:0
#13 at::_flash_attention_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, long, long, double, bool, bool, std::optional<double>, std::optional<long>, std::optional<long>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&) from ??:0
#14 at::native::_scaled_dot_product_flash_attention_cuda(at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>) from ??:0
#15 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___scaled_dot_product_flash_attention(at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>) from RegisterCUDA.cpp:0
#16 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt, at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___scaled_dot_product_flash_attention>, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt, at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double> > >, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt, at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>) from RegisterCUDA.cpp:0
#17 at::_ops::_scaled_dot_product_flash_attention::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>) from ??:0
#18 torch::autograd::VariableType::(anonymous namespace)::_scaled_dot_product_flash_attention(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>) from VariableType_1.cpp:0
#19 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt, at::Tensor, at::Tensor, at::Tensor> (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>), &torch::autograd::VariableType::(anonymous namespace)::_scaled_dot_product_flash_attention>, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt, at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double> > >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from VariableType_1.cpp:0
#20 c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const [clone .isra.0] from register_c10_ops.cpp:0
#21 torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, pybind11::args const&, pybind11::kwargs const&, std::optional<c10::DispatchKey>) from ??:0
#22 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol, pybind11::args const&, pybind11::kwargs const&, bool, std::optional<c10::DispatchKey>) from ??:0
#23 torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#215}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}::operator()(pybind11::args const&, pybind11::kwargs const&) const from init.cpp:0
#24 pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#215}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}, pybind11::object, pybind11::args const&, pybind11::kwargs const&>(torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#215}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}&&, pybind11::object (*)(pybind11::args const&, pybind11::kwargs const&))::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0
#25 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0
#26 cfunction_call from /usr/local/src/conda/python-3.10.16/Objects/methodobject.c:543
#27 _PyObject_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:305
#28 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5917
#29 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#30 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#31 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#32 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#33 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#34 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#35 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#36 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#37 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#38 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#39 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#40 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#41 _PyObject_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:305
#42 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#43 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#44 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#45 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#46 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#47 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#48 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#49 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#50 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#51 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#52 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#53 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#54 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#55 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#56 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#58 PyVectorcall_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:267
#59 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#60 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#61 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#62 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#63 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#64 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#65 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#66 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#67 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#68 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#69 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#70 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#71 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#72 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#73 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#74 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#75 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#76 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#77 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#78 _PyObject_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:305
#79 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#80 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#81 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#82 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#83 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#84 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#85 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#86 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#87 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#88 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#89 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#90 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#91 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#92 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#93 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#94 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#95 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#96 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#97 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#98 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#99 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#100 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#101 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#102 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#103 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#104 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#105 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#106 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#107 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#108 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#109 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#110 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#111 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#112 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#113 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#114 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#115 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#116 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#117 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#118 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#119 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#120 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#121 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#122 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#123 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#124 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#125 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#126 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#127 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#128 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#129 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#130 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#131 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#132 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#133 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#134 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#135 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#136 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#137 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#138 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#139 PyEval_EvalCode from /usr/local/src/conda/python-3.10.16/Python/ceval.c:1134
#140 run_eval_code_obj from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1291
#141 run_mod from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1312
#142 pyrun_file from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1208
#143 _PyRun_SimpleFileObject from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:456
#144 _PyRun_AnyFileObject from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:90
#145 pymain_run_file_obj from /usr/local/src/conda/python-3.10.16/Modules/main.c:357
#146 Py_BytesMain from /usr/local/src/conda/python-3.10.16/Modules/main.c:1094
#147 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58
#148 __libc_start_main_impl from ./csu/../csu/libc-start.c:392
#149 _start from ??:0


To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_fallback_kernel_with_symexpr_output_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Standard out
inductor [('pattern_matcher_count', 3), ('pattern_matcher_nodes', 3), ('extern_calls', 2)]
inductor [('pattern_matcher_count', 3), ('pattern_matcher_nodes', 3), ('extern_calls', 2)]
inductor [('pattern_matcher_count', 3), ('pattern_matcher_nodes', 3), ('extern_calls', 2)]

AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda

AssertionError: Tensor-likes are not close!

Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Stack trace
Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 951, in test_sdpa_2
    self.check_model(Model(), example_inputs)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor_utils.py", line 194, in check_model
    self.assertEqual(actual, expected, atol=atol, rtol=rtol)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4040, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!

Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Standard error
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_config_module.py:440: UserWarning: Skipping serialization of skipfiles_inline_module_allowlist value {}
  warnings.warn(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_config_module.py:440: UserWarning: Skipping serialization of skipfiles_inline_module_allowlist value {}
  warnings.warn(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_config_module.py:440: UserWarning: Skipping serialization of skipfiles_inline_module_allowlist value {}
  warnings.warn(
Standard out
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 2)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 2)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 2)]

AOTInductorTestABICompatibleGpu.test_scaled_dot_product_efficient_attention_cuda

RuntimeError: [AOTriton] Accelerated SDPA only supports MI200/MI300X/Navi31 GPUs (gfx90a:sramecc+:xnack-/gfx942:sramecc+:xnack-/gfx1100)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_scaled_dot_product_efficient_attention_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Stack trace
Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 12048, in new_test
    return value(self)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 3048, in test_scaled_dot_product_efficient_attention
    self.check_model(Model(), example_inputs)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor_utils.py", line 182, in check_model
    expected = ref_model(*ref_inputs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 3038, in forward
    return torch.ops.aten._scaled_dot_product_efficient_attention(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_ops.py", line 1122, in __call__
    return self._op(*args, **(kwargs or {}))
RuntimeError: [AOTriton] Accelerated SDPA only supports MI200/MI300X/Navi31 GPUs (gfx90a:sramecc+:xnack-/gfx942:sramecc+:xnack-/gfx1100)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_scaled_dot_product_efficient_attention_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda

AssertionError: Tensor-likes are not close!

Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Stack trace
Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 951, in test_sdpa_2
    self.check_model(Model(), example_inputs)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor_utils.py", line 194, in check_model
    self.assertEqual(actual, expected, atol=atol, rtol=rtol)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4040, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!

Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Standard error
/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py:940: UserWarning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at /var/lib/jenkins/pytorch/aten/src/ATen/Context.cpp:328.)
  t = torch.nn.functional.scaled_dot_product_attention(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:195: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_config_module.py:440: UserWarning: Skipping serialization of skipfiles_inline_module_allowlist value {}
  warnings.warn(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_config_module.py:440: UserWarning: Skipping serialization of skipfiles_inline_module_allowlist value {}
  warnings.warn(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_config_module.py:440: UserWarning: Skipping serialization of skipfiles_inline_module_allowlist value {}
  warnings.warn(
Standard out
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 2), ('benchmarking.TritonBenchmarker.triton_do_bench', 1)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 2)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 2)]

AOTInductorTestABICompatibleGpu.test_fallback_kernel_with_symexpr_output_cuda

RuntimeError: [AOTriton] Accelerated SDPA only supports MI200/MI300X/Navi31 GPUs (gfx90a:sramecc+:xnack-/gfx942:sramecc+:xnack-/gfx1100)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_fallback_kernel_with_symexpr_output_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Stack trace
Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 1089, in test_fallback_kernel_with_symexpr_output
    torch.testing.assert_close(m(*inputs), aot_model(*inputs))
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 1066, in forward
    res = torch.ops.aten._scaled_dot_product_flash_attention.default(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_ops.py", line 722, in __call__
    return self._op(*args, **kwargs)
RuntimeError: [AOTriton] Accelerated SDPA only supports MI200/MI300X/Navi31 GPUs (gfx90a:sramecc+:xnack-/gfx942:sramecc+:xnack-/gfx1100)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_fallback_kernel_with_symexpr_output_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Standard error
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_config_module.py:440: UserWarning: Skipping serialization of skipfiles_inline_module_allowlist value {}
  warnings.warn(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_config_module.py:440: UserWarning: Skipping serialization of skipfiles_inline_module_allowlist value {}
  warnings.warn(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_config_module.py:440: UserWarning: Skipping serialization of skipfiles_inline_module_allowlist value {}
  warnings.warn(
Standard out
inductor [('pattern_matcher_count', 3), ('pattern_matcher_nodes', 3), ('extern_calls', 2)]
inductor [('pattern_matcher_count', 3), ('pattern_matcher_nodes', 3), ('extern_calls', 2)]
inductor [('pattern_matcher_count', 3), ('pattern_matcher_nodes', 3), ('extern_calls', 2)]

more test results are not shown here, view them on Jenkins