Skip certain unit tests on NAVI #1950

failed: 12, skipped: 23198, passed: 25882

`AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda`

AssertionError: Tensor-likes are not close!

Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

Stack trace

Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 951, in test_sdpa_2
    self.check_model(Model(), example_inputs)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor_utils.py", line 194, in check_model
    self.assertEqual(actual, expected, atol=atol, rtol=rtol)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4040, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!

Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

Standard error

/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py:940: UserWarning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at /var/lib/jenkins/pytorch/aten/src/ATen/Context.cpp:328.)
  t = torch.nn.functional.scaled_dot_product_attention(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:195: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_config_module.py:440: UserWarning: Skipping serialization of skipfiles_inline_module_allowlist value {}
  warnings.warn(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_config_module.py:440: UserWarning: Skipping serialization of skipfiles_inline_module_allowlist value {}
  warnings.warn(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_config_module.py:440: UserWarning: Skipping serialization of skipfiles_inline_module_allowlist value {}
  warnings.warn(

Standard out

stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 2), ('benchmarking.TritonBenchmarker.triton_do_bench', 1)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 2)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 2)]

`AOTInductorTestABICompatibleGpu.test_fallback_kernel_with_symexpr_output_cuda`

RuntimeError: [AOTriton] Accelerated SDPA only supports MI200/MI300X/Navi31 GPUs (gfx90a:sramecc+:xnack-/gfx942:sramecc+:xnack-/gfx1100)
Exception raised from check_gpu_arch at /var/lib/jenkins/pytorch/aten/src/ATen/native/transformers/hip/flash_attn/flash_api.hip:76 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0
#7 pytorch_flash::mha_fwd(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor>&, std::optional<at::Tensor>&, float, float, bool, int, int, bool, std::optional<at::Generator>) from ??:0
#8 at::native::_flash_attention_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, long, long, double, bool, bool, std::optional<double>, std::optional<long>, std::optional<long>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&) from ??:0
#9 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___flash_attention_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&) from RegisterCUDA.cpp:0
#10 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___flash_attention_forward>, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&> >, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&) from RegisterCUDA.cpp:0
#11 std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> c10::callUnboxedKernelFunction<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor>, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&>(void*, c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt&&, c10::SymInt&&, double&&, bool&&, bool&&, std::optional<double>&&, std::optional<c10::SymInt>&&, std::optional<c10::SymInt>&&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&) [clone .isra.0] from Operators_0.cpp:0
#12 at::_ops::_flash_attention_forward::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&) from ??:0
#13 at::_flash_attention_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, long, long, double, bool, bool, std::optional<double>, std::optional<long>, std::optional<long>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&) from ??:0
#14 at::native::_scaled_dot_product_flash_attention_cuda(at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>) from ??:0
#15 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___scaled_dot_product_flash_attention(at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>) from RegisterCUDA.cpp:0
#16 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt, at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___scaled_dot_product_flash_attention>, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt, at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double> > >, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt, at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>) from RegisterCUDA.cpp:0
#17 at::_ops::_scaled_dot_product_flash_attention::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>) from ??:0
#18 torch::autograd::VariableType::(anonymous namespace)::_scaled_dot_product_flash_attention(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>) from VariableType_1.cpp:0
#19 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt, at::Tensor, at::Tensor, at::Tensor> (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>), &torch::autograd::VariableType::(anonymous namespace)::_scaled_dot_product_flash_attention>, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt, at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double> > >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from VariableType_1.cpp:0
#20 c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const [clone .isra.0] from register_c10_ops.cpp:0
#21 torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, pybind11::args const&, pybind11::kwargs const&, std::optional<c10::DispatchKey>) from ??:0
#22 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol, pybind11::args const&, pybind11::kwargs const&, bool, std::optional<c10::DispatchKey>) from ??:0
#23 torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#215}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}::operator()(pybind11::args const&, pybind11::kwargs const&) const from init.cpp:0
#24 pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#215}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}, pybind11::object, pybind11::args const&, pybind11::kwargs const&>(torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#215}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}&&, pybind11::object (*)(pybind11::args const&, pybind11::kwargs const&))::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0
#25 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0
#26 cfunction_call from /usr/local/src/conda/python-3.10.16/Objects/methodobject.c:543
#27 _PyObject_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:305
#28 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5917
#29 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#30 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#31 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#32 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#33 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#34 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#35 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#36 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#37 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#38 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#39 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#40 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#41 _PyObject_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:305
#42 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#43 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#44 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#45 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#46 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#47 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#48 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#49 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#50 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#51 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#52 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#53 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#54 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#55 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#56 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#58 PyVectorcall_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:267
#59 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#60 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#61 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#62 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#63 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#64 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#65 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#66 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#67 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#68 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#69 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#70 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#71 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#72 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#73 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#74 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#75 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#76 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#77 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#78 _PyObject_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:305
#79 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#80 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#81 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#82 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#83 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#84 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#85 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#86 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#87 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#88 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#89 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#90 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#91 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#92 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#93 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#94 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#95 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#96 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#97 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#98 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#99 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#100 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#101 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#102 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#103 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#104 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#105 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#106 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#107 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#108 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#109 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#110 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#111 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#112 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#113 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#114 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#115 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#116 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#117 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#118 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#119 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#120 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#121 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#122 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#123 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#124 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#125 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#126 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#127 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#128 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#129 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#130 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#131 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#132 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#133 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#134 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#135 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#136 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#137 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#138 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#139 PyEval_EvalCode from /usr/local/src/conda/python-3.10.16/Python/ceval.c:1134
#140 run_eval_code_obj from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1291
#141 run_mod from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1312
#142 pyrun_file from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1208
#143 _PyRun_SimpleFileObject from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:456
#144 _PyRun_AnyFileObject from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:90
#145 pymain_run_file_obj from /usr/local/src/conda/python-3.10.16/Modules/main.c:357
#146 Py_BytesMain from /usr/local/src/conda/python-3.10.16/Modules/main.c:1094
#147 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58
#148 __libc_start_main_impl from ./csu/../csu/libc-start.c:392
#149 _start from ??:0


To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_fallback_kernel_with_symexpr_output_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

Stack trace

Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 1089, in test_fallback_kernel_with_symexpr_output
    torch.testing.assert_close(m(*inputs), aot_model(*inputs))
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 1066, in forward
    res = torch.ops.aten._scaled_dot_product_flash_attention.default(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_ops.py", line 722, in __call__
    return self._op(*args, **kwargs)
RuntimeError: [AOTriton] Accelerated SDPA only supports MI200/MI300X/Navi31 GPUs (gfx90a:sramecc+:xnack-/gfx942:sramecc+:xnack-/gfx1100)
Exception raised from check_gpu_arch at /var/lib/jenkins/pytorch/aten/src/ATen/native/transformers/hip/flash_attn/flash_api.hip:76 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0
#7 pytorch_flash::mha_fwd(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor>&, std::optional<at::Tensor>&, float, float, bool, int, int, bool, std::optional<at::Generator>) from ??:0
#8 at::native::_flash_attention_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, long, long, double, bool, bool, std::optional<double>, std::optional<long>, std::optional<long>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&) from ??:0
#9 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___flash_attention_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&) from RegisterCUDA.cpp:0
#10 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___flash_attention_forward>, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&> >, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&) from RegisterCUDA.cpp:0
#11 std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> c10::callUnboxedKernelFunction<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor>, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&>(void*, c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt&&, c10::SymInt&&, double&&, bool&&, bool&&, std::optional<double>&&, std::optional<c10::SymInt>&&, std::optional<c10::SymInt>&&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&) [clone .isra.0] from Operators_0.cpp:0
#12 at::_ops::_flash_attention_forward::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, double, bool, bool, std::optional<double>, std::optional<c10::SymInt>, std::optional<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&) from ??:0
#13 at::_flash_attention_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, long, long, double, bool, bool, std::optional<double>, std::optional<long>, std::optional<long>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&) from ??:0
#14 at::native::_scaled_dot_product_flash_attention_cuda(at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>) from ??:0
#15 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___scaled_dot_product_flash_attention(at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>) from RegisterCUDA.cpp:0
#16 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt, at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA___scaled_dot_product_flash_attention>, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt, at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double> > >, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt, at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>) from RegisterCUDA.cpp:0
#17 at::_ops::_scaled_dot_product_flash_attention::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>) from ??:0
#18 torch::autograd::VariableType::(anonymous namespace)::_scaled_dot_product_flash_attention(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>) from VariableType_1.cpp:0
#19 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt, at::Tensor, at::Tensor, at::Tensor> (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double>), &torch::autograd::VariableType::(anonymous namespace)::_scaled_dot_product_flash_attention>, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, c10::SymInt, c10::SymInt, at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, bool, bool, std::optional<double> > >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from VariableType_1.cpp:0
#20 c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const [clone .isra.0] from register_c10_ops.cpp:0
#21 torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, pybind11::args const&, pybind11::kwargs const&, std::optional<c10::DispatchKey>) from ??:0
#22 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol, pybind11::args const&, pybind11::kwargs const&, bool, std::optional<c10::DispatchKey>) from ??:0
#23 torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#215}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}::operator()(pybind11::args const&, pybind11::kwargs const&) const from init.cpp:0
#24 pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#215}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}, pybind11::object, pybind11::args const&, pybind11::kwargs const&>(torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#215}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}&&, pybind11::object (*)(pybind11::args const&, pybind11::kwargs const&))::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0
#25 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0
#26 cfunction_call from /usr/local/src/conda/python-3.10.16/Objects/methodobject.c:543
#27 _PyObject_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:305
#28 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5917
#29 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#30 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#31 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#32 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#33 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#34 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#35 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#36 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#37 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#38 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#39 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#40 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#41 _PyObject_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:305
#42 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#43 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#44 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#45 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#46 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#47 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#48 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#49 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#50 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#51 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#52 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#53 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#54 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#55 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#56 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#58 PyVectorcall_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:267
#59 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#60 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#61 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#62 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#63 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#64 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#65 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#66 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#67 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#68 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#69 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#70 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#71 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#72 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#73 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#74 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#75 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#76 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#77 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#78 _PyObject_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:305
#79 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#80 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#81 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#82 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#83 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#84 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#85 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#86 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#87 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#88 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#89 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#90 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#91 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#92 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#93 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#94 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#95 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#96 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#97 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#98 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#99 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#100 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#101 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#102 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#103 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#104 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#105 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#106 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#107 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#108 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#109 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#110 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#111 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#112 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#113 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#114 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#115 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#116 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#117 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#118 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#119 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
#120 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#121 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#122 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#123 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#124 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#125 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.16/Objects/call.c:153
#126 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
#127 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
#128 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
#129 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
#130 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#131 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#132 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#133 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#134 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#135 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#136 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#137 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
#138 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
#139 PyEval_EvalCode from /usr/local/src/conda/python-3.10.16/Python/ceval.c:1134
#140 run_eval_code_obj from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1291
#141 run_mod from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1312
#142 pyrun_file from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1208
#143 _PyRun_SimpleFileObject from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:456
#144 _PyRun_AnyFileObject from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:90
#145 pymain_run_file_obj from /usr/local/src/conda/python-3.10.16/Modules/main.c:357
#146 Py_BytesMain from /usr/local/src/conda/python-3.10.16/Modules/main.c:1094
#147 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58
#148 __libc_start_main_impl from ./csu/../csu/libc-start.c:392
#149 _start from ??:0


To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_fallback_kernel_with_symexpr_output_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

Standard out

inductor [('pattern_matcher_count', 3), ('pattern_matcher_nodes', 3), ('extern_calls', 2)]
inductor [('pattern_matcher_count', 3), ('pattern_matcher_nodes', 3), ('extern_calls', 2)]
inductor [('pattern_matcher_count', 3), ('pattern_matcher_nodes', 3), ('extern_calls', 2)]

`AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda`

AssertionError: Tensor-likes are not close!

Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

Stack trace

Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 951, in test_sdpa_2
    self.check_model(Model(), example_inputs)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor_utils.py", line 194, in check_model
    self.assertEqual(actual, expected, atol=atol, rtol=rtol)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4040, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!

Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

Standard error

/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_config_module.py:440: UserWarning: Skipping serialization of skipfiles_inline_module_allowlist value {}
  warnings.warn(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_config_module.py:440: UserWarning: Skipping serialization of skipfiles_inline_module_allowlist value {}
  warnings.warn(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_config_module.py:440: UserWarning: Skipping serialization of skipfiles_inline_module_allowlist value {}
  warnings.warn(

Standard out

stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 2)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 2)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 2)]

`AOTInductorTestABICompatibleGpu.test_scaled_dot_product_efficient_attention_cuda`

RuntimeError: [AOTriton] Accelerated SDPA only supports MI200/MI300X/Navi31 GPUs (gfx90a:sramecc+:xnack-/gfx942:sramecc+:xnack-/gfx1100)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_scaled_dot_product_efficient_attention_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

Stack trace

Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 12048, in new_test
    return value(self)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 3048, in test_scaled_dot_product_efficient_attention
    self.check_model(Model(), example_inputs)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor_utils.py", line 182, in check_model
    expected = ref_model(*ref_inputs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 3038, in forward
    return torch.ops.aten._scaled_dot_product_efficient_attention(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_ops.py", line 1122, in __call__
    return self._op(*args, **(kwargs or {}))
RuntimeError: [AOTriton] Accelerated SDPA only supports MI200/MI300X/Navi31 GPUs (gfx90a:sramecc+:xnack-/gfx942:sramecc+:xnack-/gfx1100)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_scaled_dot_product_efficient_attention_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

`AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda`

AssertionError: Tensor-likes are not close!

Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

Stack trace

Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 951, in test_sdpa_2
    self.check_model(Model(), example_inputs)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor_utils.py", line 194, in check_model
    self.assertEqual(actual, expected, atol=atol, rtol=rtol)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4040, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!

Mismatched elements: 2851 / 196608 (1.5%)
Greatest absolute difference: 0.0078125 at index (0, 9, 2, 37) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 3, 46) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

Standard error

/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py:940: UserWarning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at /var/lib/jenkins/pytorch/aten/src/ATen/Context.cpp:328.)
  t = torch.nn.functional.scaled_dot_product_attention(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:195: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_config_module.py:440: UserWarning: Skipping serialization of skipfiles_inline_module_allowlist value {}
  warnings.warn(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_config_module.py:440: UserWarning: Skipping serialization of skipfiles_inline_module_allowlist value {}
  warnings.warn(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_config_module.py:440: UserWarning: Skipping serialization of skipfiles_inline_module_allowlist value {}
  warnings.warn(

Standard out

stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 2), ('benchmarking.TritonBenchmarker.triton_do_bench', 1)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 2)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 9), ('pattern_matcher_count', 8), ('pattern_matcher_nodes', 8), ('extern_calls', 2)]

`AOTInductorTestABICompatibleGpu.test_fallback_kernel_with_symexpr_output_cuda`

RuntimeError: [AOTriton] Accelerated SDPA only supports MI200/MI300X/Navi31 GPUs (gfx90a:sramecc+:xnack-/gfx942:sramecc+:xnack-/gfx1100)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_fallback_kernel_with_symexpr_output_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

Stack trace

Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 1089, in test_fallback_kernel_with_symexpr_output
    torch.testing.assert_close(m(*inputs), aot_model(*inputs))
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/var/lib/jenkins/pytorch/test/inductor/test_aot_inductor.py", line 1066, in forward
    res = torch.ops.aten._scaled_dot_product_flash_attention.default(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_ops.py", line 722, in __call__
    return self._op(*args, **kwargs)
RuntimeError: [AOTriton] Accelerated SDPA only supports MI200/MI300X/Navi31 GPUs (gfx90a:sramecc+:xnack-/gfx942:sramecc+:xnack-/gfx1100)

To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_fallback_kernel_with_symexpr_output_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

Standard error

/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_config_module.py:440: UserWarning: Skipping serialization of skipfiles_inline_module_allowlist value {}
  warnings.warn(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_config_module.py:440: UserWarning: Skipping serialization of skipfiles_inline_module_allowlist value {}
  warnings.warn(
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_config_module.py:440: UserWarning: Skipping serialization of skipfiles_inline_module_allowlist value {}
  warnings.warn(

Standard out

inductor [('pattern_matcher_count', 3), ('pattern_matcher_nodes', 3), ('extern_calls', 2)]
inductor [('pattern_matcher_count', 3), ('pattern_matcher_nodes', 3), ('extern_calls', 2)]
inductor [('pattern_matcher_count', 3), ('pattern_matcher_nodes', 3), ('extern_calls', 2)]

more test results are not shown here, view them on Jenkins

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip certain unit tests on NAVI #1950

Skip certain unit tests on NAVI #1950

failed: 12, skipped: 23198, passed: 25882

Details

`AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda`

`AOTInductorTestABICompatibleGpu.test_fallback_kernel_with_symexpr_output_cuda`

`AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda`

`AOTInductorTestABICompatibleGpu.test_scaled_dot_product_efficient_attention_cuda`

`AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda`

`AOTInductorTestABICompatibleGpu.test_fallback_kernel_with_symexpr_output_cuda`

Re-running checks...

Skip certain unit tests on NAVI #1950

Skipped unit tests in test_fsdp_sharded_grad_scaler.py

Skip certain unit tests on NAVI #1950

failed: 12, skipped: 23198, passed: 25882

Details

AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda

AOTInductorTestABICompatibleGpu.test_fallback_kernel_with_symexpr_output_cuda

AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda

AOTInductorTestABICompatibleGpu.test_scaled_dot_product_efficient_attention_cuda

AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda

AOTInductorTestABICompatibleGpu.test_fallback_kernel_with_symexpr_output_cuda

Re-running checks...

`AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda`

`AOTInductorTestABICompatibleGpu.test_fallback_kernel_with_symexpr_output_cuda`

`AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda`

`AOTInductorTestABICompatibleGpu.test_scaled_dot_product_efficient_attention_cuda`

`AOTInductorTestABICompatibleGpu.test_sdpa_2_cuda`

`AOTInductorTestABICompatibleGpu.test_fallback_kernel_with_symexpr_output_cuda`