Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Graph Execution Error with Scaffold #3232

Open
RaminKahidi opened this issue Feb 15, 2025 · 1 comment
Open

[BUG] Graph Execution Error with Scaffold #3232

RaminKahidi opened this issue Feb 15, 2025 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@RaminKahidi
Copy link

Describe the bug
Graph execution error when running Scaffold due to tf.debugging.enable_check_numerics() on line 25.

To Reproduce

  1. Go to 'NVFlare/examples/advanced/job_api/tf/run_jobs.sh'
  2. Run the scaffold example:
python ./tf_fl_script_runner_cifar10.py        --algo scaffold        --n_clients 2        --num_rounds 2        --batch_size 64        --epochs 1        --alpha 0.1
  1. See
    tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error: ...
    

Desktop (please complete the following information):

  • OS: WSL Ubuntu 22.04.5
  • Python Version 3.10.12
  • NVFlare Version 2.5.2
  • TensorFlow version 2.18.0
  • cuda version V11.5.119

Additional context
I was able to correct the issue by removing the tf.debugging.enable_check_numerics() line in the Scaffold file, however I am not sure if this is the ideal solution.

Stack Trace:

2025-02-14 21:16:45.748055: W tensorflow/core/framework/op_kernel.cc:1841] OP_REQUIRES failed at xla_ops.cc:577 : INVALID_ARGUMENT: Detected unsupported operations when trying to compile graph __inference_one_step_on_data_2092[] on XLA_GPU_JIT: DebugNumericSummaryV2 (No registered 'DebugNumericSummaryV2' OpKernel for XLA_GPU_JIT devices compatible with node {{node DebugNumericSummaryV2}}){{node DebugNumericSummaryV2}}
The op is created at: 
File "usr/lib/python3.10/threading.py", line 973, in _bootstrap
File "usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
File "usr/lib/python3.10/threading.py", line 953, in run
File "mnt/c/.../python3.10/site-packages/nvflare/app_common/executors/task_script_runner.py", line 55, in run
File "usr/lib/python3.10/runpy.py", line 289, in run_path
File "usr/lib/python3.10/runpy.py", line 96, in _run_module_code
File "usr/lib/python3.10/runpy.py", line 86, in _run_code
File "tmp/nvflare/jobs/cifar10_tf_scaffold_alpha0.1/site-1/simulate_job/app_site-1/custom/src/cifar10_tf_fl_alpha_split_scaffold.py", line 294, in <module>
File "tmp/nvflare/jobs/cifar10_tf_scaffold_alpha0.1/site-1/simulate_job/app_site-1/custom/src/cifar10_tf_fl_alpha_split_scaffold.py", line 221, in main
File "mnt/c/.../python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 484, in evaluate
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 219, in function
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 132, in multi_step_on_iterator
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/polymorphism/function_type.py", line 356, in placeholder_arguments
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/trace_type/default_types.py", line 250, in placeholder_value
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/trace_type/default_types.py", line 251, in <listcomp>
        tf2xla conversion failed while converting __inference_one_step_on_data_2092[]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions.
2025-02-14 21:16:45.748117: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: INVALID_ARGUMENT: Detected unsupported operations when trying to compile graph __inference_one_step_on_data_2092[] on XLA_GPU_JIT: DebugNumericSummaryV2 (No registered 'DebugNumericSummaryV2' OpKernel for XLA_GPU_JIT devices compatible with node {{node DebugNumericSummaryV2}}){{node DebugNumericSummaryV2}}
The op is created at: 
File "usr/lib/python3.10/threading.py", line 973, in _bootstrap
File "usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
File "usr/lib/python3.10/threading.py", line 953, in run
File "mnt/c/.../python3.10/site-packages/nvflare/app_common/executors/task_script_runner.py", line 55, in run
File "usr/lib/python3.10/runpy.py", line 289, in run_path
File "usr/lib/python3.10/runpy.py", line 96, in _run_module_code
File "usr/lib/python3.10/runpy.py", line 86, in _run_code
File "tmp/nvflare/jobs/cifar10_tf_scaffold_alpha0.1/site-1/simulate_job/app_site-1/custom/src/cifar10_tf_fl_alpha_split_scaffold.py", line 294, in <module>
File "tmp/nvflare/jobs/cifar10_tf_scaffold_alpha0.1/site-1/simulate_job/app_site-1/custom/src/cifar10_tf_fl_alpha_split_scaffold.py", line 221, in main
File "mnt/c/.../python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 484, in evaluate
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 219, in function
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 132, in multi_step_on_iterator
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/polymorphism/function_type.py", line 356, in placeholder_arguments
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/trace_type/default_types.py", line 250, in placeholder_value
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/trace_type/default_types.py", line 251, in <listcomp>
        tf2xla conversion failed while converting __inference_one_step_on_data_2092[]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions.
         [[StatefulPartitionedCall]]
2025-02-14 21:16:45.749185: W tensorflow/core/framework/op_kernel.cc:1841] OP_REQUIRES failed at xla_ops.cc:577 : INVALID_ARGUMENT: Detected unsupported operations when trying to compile graph __inference_one_step_on_data_2092[] on XLA_GPU_JIT: DebugNumericSummaryV2 (No registered 'DebugNumericSummaryV2' OpKernel for XLA_GPU_JIT devices compatible with node {{node DebugNumericSummaryV2}}){{node DebugNumericSummaryV2}}
The op is created at: 
File "usr/lib/python3.10/threading.py", line 973, in _bootstrap
File "usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
File "usr/lib/python3.10/threading.py", line 953, in run
File "mnt/c/.../python3.10/site-packages/nvflare/app_common/executors/task_script_runner.py", line 55, in run
File "usr/lib/python3.10/runpy.py", line 289, in run_path
File "usr/lib/python3.10/runpy.py", line 96, in _run_module_code
File "usr/lib/python3.10/runpy.py", line 86, in _run_code
File "tmp/nvflare/jobs/cifar10_tf_scaffold_alpha0.1/site-2/simulate_job/app_site-2/custom/src/cifar10_tf_fl_alpha_split_scaffold.py", line 294, in <module>
File "tmp/nvflare/jobs/cifar10_tf_scaffold_alpha0.1/site-2/simulate_job/app_site-2/custom/src/cifar10_tf_fl_alpha_split_scaffold.py", line 221, in main
File "mnt/c/.../python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 484, in evaluate
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 219, in function
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 132, in multi_step_on_iterator
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/polymorphism/function_type.py", line 356, in placeholder_arguments
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/trace_type/default_types.py", line 250, in placeholder_value
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/trace_type/default_types.py", line 251, in <listcomp>
        tf2xla conversion failed while converting __inference_one_step_on_data_2092[]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions.
2025-02-14 21:16:45.749242: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: INVALID_ARGUMENT: Detected unsupported operations when trying to compile graph __inference_one_step_on_data_2092[] on XLA_GPU_JIT: DebugNumericSummaryV2 (No registered 'DebugNumericSummaryV2' OpKernel for XLA_GPU_JIT devices compatible with node {{node DebugNumericSummaryV2}}){{node DebugNumericSummaryV2}}
The op is created at: 
File "usr/lib/python3.10/threading.py", line 973, in _bootstrap
File "usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
File "usr/lib/python3.10/threading.py", line 953, in run
File "mnt/c/.../python3.10/site-packages/nvflare/app_common/executors/task_script_runner.py", line 55, in run
File "usr/lib/python3.10/runpy.py", line 289, in run_path
File "usr/lib/python3.10/runpy.py", line 96, in _run_module_code
File "usr/lib/python3.10/runpy.py", line 86, in _run_code
File "tmp/nvflare/jobs/cifar10_tf_scaffold_alpha0.1/site-2/simulate_job/app_site-2/custom/src/cifar10_tf_fl_alpha_split_scaffold.py", line 294, in <module>
File "tmp/nvflare/jobs/cifar10_tf_scaffold_alpha0.1/site-2/simulate_job/app_site-2/custom/src/cifar10_tf_fl_alpha_split_scaffold.py", line 221, in main
File "mnt/c/.../python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 484, in evaluate
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 219, in function
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 132, in multi_step_on_iterator
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/polymorphism/function_type.py", line 356, in placeholder_arguments
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/trace_type/default_types.py", line 250, in placeholder_value
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/trace_type/default_types.py", line 251, in <listcomp>
        tf2xla conversion failed while converting __inference_one_step_on_data_2092[]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions.
         [[StatefulPartitionedCall]]
2025-02-14 21:16:45,778 - TaskScriptRunner - ERROR - Traceback (most recent call last):
  File "/mnt/c/.../python3.10/site-packages/nvflare/app_common/executors/task_script_runner.py", line 55, in run
    runpy.run_path(self.script_full_path, run_name="__main__")
  File "/usr/lib/python3.10/runpy.py", line 289, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/tmp/nvflare/jobs/cifar10_tf_scaffold_alpha0.1/site-1/simulate_job/app_site-1/custom/src/cifar10_tf_fl_alpha_split_scaffold.py", line 294, in <module>
    main()
  File "/tmp/nvflare/jobs/cifar10_tf_scaffold_alpha0.1/site-1/simulate_job/app_site-1/custom/src/cifar10_tf_fl_alpha_split_scaffold.py", line 221, in main
    _, test_global_acc = model.evaluate(x=test_ds, verbose=2)
  File "/mnt/c/.../python3.10/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/mnt/c/.../python3.10/site-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:

Detected at node DebugNumericSummaryV2 defined at (most recent call last):
<stack traces unavailable>
Detected at node DebugNumericSummaryV2 defined at (most recent call last):
<stack traces unavailable>
Detected unsupported operations when trying to compile graph __inference_one_step_on_data_2092[] on XLA_GPU_JIT: DebugNumericSummaryV2 (No registered 'DebugNumericSummaryV2' OpKernel for XLA_GPU_JIT devices compatible with node {{node DebugNumericSummaryV2}}){{node DebugNumericSummaryV2}}
The op is created at: 
File "usr/lib/python3.10/threading.py", line 973, in _bootstrap
File "usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
File "usr/lib/python3.10/threading.py", line 953, in run
File "mnt/c/.../python3.10/site-packages/nvflare/app_common/executors/task_script_runner.py", line 55, in run
File "usr/lib/python3.10/runpy.py", line 289, in run_path
File "usr/lib/python3.10/runpy.py", line 96, in _run_module_code
File "usr/lib/python3.10/runpy.py", line 86, in _run_code
File "tmp/nvflare/jobs/cifar10_tf_scaffold_alpha0.1/site-1/simulate_job/app_site-1/custom/src/cifar10_tf_fl_alpha_split_scaffold.py", line 294, in <module>
File "tmp/nvflare/jobs/cifar10_tf_scaffold_alpha0.1/site-1/simulate_job/app_site-1/custom/src/cifar10_tf_fl_alpha_split_scaffold.py", line 221, in main
File "mnt/c/.../python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 484, in evaluate
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 219, in function
File "mnt/c/.../python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 132, in multi_step_on_iterator
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/polymorphism/function_type.py", line 356, in placeholder_arguments
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/trace_type/default_types.py", line 250, in placeholder_value
File "mnt/c/.../python3.10/site-packages/tensorflow/core/function/trace_type/default_types.py", line 251, in <listcomp>
        tf2xla conversion failed while converting __inference_one_step_on_data_2092[]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions.
         [[StatefulPartitionedCall]] [Op:__inference_multi_step_on_iterator_2141]
         .........
@RaminKahidi RaminKahidi added the bug Something isn't working label Feb 15, 2025
@chesterxgchen
Copy link
Collaborator

@holgerroth can you take a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants