You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Desktop (please complete the following information):
OS: WSL Ubuntu 22.04.5
Python Version 3.10.12
NVFlare Version 2.5.2
TensorFlow version 2.18.0
cuda version V11.5.119
Additional context
I was able to correct the issue by removing the tf.debugging.enable_check_numerics() line in the Scaffold file, however I am not sure if this is the ideal solution.
Stack Trace:
2025-02-14 21:16:45.748055: W tensorflow/core/framework/op_kernel.cc:1841] OP_REQUIRES failed at xla_ops.cc:577 : INVALID_ARGUMENT: Detected unsupported operations when trying to compile graph __inference_one_step_on_data_2092[] on XLA_GPU_JIT: DebugNumericSummaryV2 (No registered 'DebugNumericSummaryV2' OpKernel for XLA_GPU_JIT devices compatible with node {{node DebugNumericSummaryV2}}){{node DebugNumericSummaryV2}}
The op is created at:
File "usr/lib/python3.10/threading.py", line 973, in _bootstrap
File "usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
File "usr/lib/python3.10/threading.py", line 953, in run
File "mnt/c/.../python3.10/site-packages/nvflare/app_common/executors/task_script_runner.py", line 55, in run
File "usr/lib/python3.10/runpy.py", line 289, in run_path
File "usr/lib/python3.10/runpy.py", line 96, in _run_module_code
File "usr/lib/python3.10/runpy.py", line 86, in _run_code
File "tmp/nvflare/jobs/cifar10_tf_scaffold_alpha0.1/site-1/simulate_job/app_site-1/custom/src/cifar10_tf_fl_alpha_split_scaffold.py", line 294, in <module>
File "tmp/nvflare/jobs/cifar10_tf_scaffold_alpha0.1/site-1/simulate_job/app_site-1/custom/src/cifar10_tf_fl_alpha_split_scaffold.py", line 221, in main
File "mnt/c/.../python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 484, in evaluate
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 219, in function
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 132, in multi_step_on_iterator
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/polymorphism/function_type.py", line 356, in placeholder_arguments
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/trace_type/default_types.py", line 250, in placeholder_value
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/trace_type/default_types.py", line 251, in <listcomp>
tf2xla conversion failed while converting __inference_one_step_on_data_2092[]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions.
2025-02-14 21:16:45.748117: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: INVALID_ARGUMENT: Detected unsupported operations when trying to compile graph __inference_one_step_on_data_2092[] on XLA_GPU_JIT: DebugNumericSummaryV2 (No registered 'DebugNumericSummaryV2' OpKernel for XLA_GPU_JIT devices compatible with node {{node DebugNumericSummaryV2}}){{node DebugNumericSummaryV2}}
The op is created at:
File "usr/lib/python3.10/threading.py", line 973, in _bootstrap
File "usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
File "usr/lib/python3.10/threading.py", line 953, in run
File "mnt/c/.../python3.10/site-packages/nvflare/app_common/executors/task_script_runner.py", line 55, in run
File "usr/lib/python3.10/runpy.py", line 289, in run_path
File "usr/lib/python3.10/runpy.py", line 96, in _run_module_code
File "usr/lib/python3.10/runpy.py", line 86, in _run_code
File "tmp/nvflare/jobs/cifar10_tf_scaffold_alpha0.1/site-1/simulate_job/app_site-1/custom/src/cifar10_tf_fl_alpha_split_scaffold.py", line 294, in <module>
File "tmp/nvflare/jobs/cifar10_tf_scaffold_alpha0.1/site-1/simulate_job/app_site-1/custom/src/cifar10_tf_fl_alpha_split_scaffold.py", line 221, in main
File "mnt/c/.../python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 484, in evaluate
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 219, in function
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 132, in multi_step_on_iterator
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/polymorphism/function_type.py", line 356, in placeholder_arguments
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/trace_type/default_types.py", line 250, in placeholder_value
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/trace_type/default_types.py", line 251, in <listcomp>
tf2xla conversion failed while converting __inference_one_step_on_data_2092[]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions.
[[StatefulPartitionedCall]]
2025-02-14 21:16:45.749185: W tensorflow/core/framework/op_kernel.cc:1841] OP_REQUIRES failed at xla_ops.cc:577 : INVALID_ARGUMENT: Detected unsupported operations when trying to compile graph __inference_one_step_on_data_2092[] on XLA_GPU_JIT: DebugNumericSummaryV2 (No registered 'DebugNumericSummaryV2' OpKernel for XLA_GPU_JIT devices compatible with node {{node DebugNumericSummaryV2}}){{node DebugNumericSummaryV2}}
The op is created at:
File "usr/lib/python3.10/threading.py", line 973, in _bootstrap
File "usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
File "usr/lib/python3.10/threading.py", line 953, in run
File "mnt/c/.../python3.10/site-packages/nvflare/app_common/executors/task_script_runner.py", line 55, in run
File "usr/lib/python3.10/runpy.py", line 289, in run_path
File "usr/lib/python3.10/runpy.py", line 96, in _run_module_code
File "usr/lib/python3.10/runpy.py", line 86, in _run_code
File "tmp/nvflare/jobs/cifar10_tf_scaffold_alpha0.1/site-2/simulate_job/app_site-2/custom/src/cifar10_tf_fl_alpha_split_scaffold.py", line 294, in <module>
File "tmp/nvflare/jobs/cifar10_tf_scaffold_alpha0.1/site-2/simulate_job/app_site-2/custom/src/cifar10_tf_fl_alpha_split_scaffold.py", line 221, in main
File "mnt/c/.../python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 484, in evaluate
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 219, in function
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 132, in multi_step_on_iterator
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/polymorphism/function_type.py", line 356, in placeholder_arguments
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/trace_type/default_types.py", line 250, in placeholder_value
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/trace_type/default_types.py", line 251, in <listcomp>
tf2xla conversion failed while converting __inference_one_step_on_data_2092[]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions.
2025-02-14 21:16:45.749242: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: INVALID_ARGUMENT: Detected unsupported operations when trying to compile graph __inference_one_step_on_data_2092[] on XLA_GPU_JIT: DebugNumericSummaryV2 (No registered 'DebugNumericSummaryV2' OpKernel for XLA_GPU_JIT devices compatible with node {{node DebugNumericSummaryV2}}){{node DebugNumericSummaryV2}}
The op is created at:
File "usr/lib/python3.10/threading.py", line 973, in _bootstrap
File "usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
File "usr/lib/python3.10/threading.py", line 953, in run
File "mnt/c/.../python3.10/site-packages/nvflare/app_common/executors/task_script_runner.py", line 55, in run
File "usr/lib/python3.10/runpy.py", line 289, in run_path
File "usr/lib/python3.10/runpy.py", line 96, in _run_module_code
File "usr/lib/python3.10/runpy.py", line 86, in _run_code
File "tmp/nvflare/jobs/cifar10_tf_scaffold_alpha0.1/site-2/simulate_job/app_site-2/custom/src/cifar10_tf_fl_alpha_split_scaffold.py", line 294, in <module>
File "tmp/nvflare/jobs/cifar10_tf_scaffold_alpha0.1/site-2/simulate_job/app_site-2/custom/src/cifar10_tf_fl_alpha_split_scaffold.py", line 221, in main
File "mnt/c/.../python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 484, in evaluate
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 219, in function
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 132, in multi_step_on_iterator
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/polymorphism/function_type.py", line 356, in placeholder_arguments
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/trace_type/default_types.py", line 250, in placeholder_value
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/trace_type/default_types.py", line 251, in <listcomp>
tf2xla conversion failed while converting __inference_one_step_on_data_2092[]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions.
[[StatefulPartitionedCall]]
2025-02-14 21:16:45,778 - TaskScriptRunner - ERROR - Traceback (most recent call last):
File "/mnt/c/.../python3.10/site-packages/nvflare/app_common/executors/task_script_runner.py", line 55, in run
runpy.run_path(self.script_full_path, run_name="__main__")
File "/usr/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/tmp/nvflare/jobs/cifar10_tf_scaffold_alpha0.1/site-1/simulate_job/app_site-1/custom/src/cifar10_tf_fl_alpha_split_scaffold.py", line 294, in <module>
main()
File "/tmp/nvflare/jobs/cifar10_tf_scaffold_alpha0.1/site-1/simulate_job/app_site-1/custom/src/cifar10_tf_fl_alpha_split_scaffold.py", line 221, in main
_, test_global_acc = model.evaluate(x=test_ds, verbose=2)
File "/mnt/c/.../python3.10/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/mnt/c/.../python3.10/site-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:
Detected at node DebugNumericSummaryV2 defined at (most recent call last):
<stack traces unavailable>
Detected at node DebugNumericSummaryV2 defined at (most recent call last):
<stack traces unavailable>
Detected unsupported operations when trying to compile graph __inference_one_step_on_data_2092[] on XLA_GPU_JIT: DebugNumericSummaryV2 (No registered 'DebugNumericSummaryV2' OpKernel for XLA_GPU_JIT devices compatible with node {{node DebugNumericSummaryV2}}){{node DebugNumericSummaryV2}}
The op is created at:
File "usr/lib/python3.10/threading.py", line 973, in _bootstrap
File "usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
File "usr/lib/python3.10/threading.py", line 953, in run
File "mnt/c/.../python3.10/site-packages/nvflare/app_common/executors/task_script_runner.py", line 55, in run
File "usr/lib/python3.10/runpy.py", line 289, in run_path
File "usr/lib/python3.10/runpy.py", line 96, in _run_module_code
File "usr/lib/python3.10/runpy.py", line 86, in _run_code
File "tmp/nvflare/jobs/cifar10_tf_scaffold_alpha0.1/site-1/simulate_job/app_site-1/custom/src/cifar10_tf_fl_alpha_split_scaffold.py", line 294, in <module>
File "tmp/nvflare/jobs/cifar10_tf_scaffold_alpha0.1/site-1/simulate_job/app_site-1/custom/src/cifar10_tf_fl_alpha_split_scaffold.py", line 221, in main
File "mnt/c/.../python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 484, in evaluate
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 219, in function
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 132, in multi_step_on_iterator
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/polymorphism/function_type.py", line 356, in placeholder_arguments
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/trace_type/default_types.py", line 250, in placeholder_value
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/trace_type/default_types.py", line 251, in <listcomp>
tf2xla conversion failed while converting __inference_one_step_on_data_2092[]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions.
[[StatefulPartitionedCall]] [Op:__inference_multi_step_on_iterator_2141]
.........
The text was updated successfully, but these errors were encountered:
Describe the bug
Graph execution error when running Scaffold due to tf.debugging.enable_check_numerics() on line 25.
To Reproduce
Desktop (please complete the following information):
Additional context
I was able to correct the issue by removing the tf.debugging.enable_check_numerics() line in the Scaffold file, however I am not sure if this is the ideal solution.
Stack Trace:
The text was updated successfully, but these errors were encountered: