From a4dbbb8475e53860c71fcdc7fc2be5a2ad96d98f Mon Sep 17 00:00:00 2001 From: Michael Yuan Date: Sun, 19 May 2024 14:56:53 -0500 Subject: [PATCH 1/2] Update cuda.md --- docs/node-guide/tasks/cuda.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/node-guide/tasks/cuda.md b/docs/node-guide/tasks/cuda.md index f145efe..74a8b61 100644 --- a/docs/node-guide/tasks/cuda.md +++ b/docs/node-guide/tasks/cuda.md @@ -76,13 +76,12 @@ Cuda compilation tools, release 12.2, V12.2.140 Build cuda_12.2.r12.2/compiler.33191640_0 ``` -After that, use the following command line to set up the environment path. +After that, use the following command line to set up the environment path. You should probably add this line to your `~/.bashrc` or `~/.zshrc` files so that new terminals and future logins will still be able to find these CUDA library files. ``` export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} ``` - ## More resources Here are more scripts that could help you in case you are stuck. From 509adbad3031c6d1cd5624882c5b3a0c4c6d51c1 Mon Sep 17 00:00:00 2001 From: Michael Yuan Date: Sun, 19 May 2024 15:30:12 -0500 Subject: [PATCH 2/2] Update troubleshooting.md --- docs/node-guide/troubleshooting.md | 53 ++++++++++++++++++++++++++++-- 1 file changed, 51 insertions(+), 2 deletions(-) diff --git a/docs/node-guide/troubleshooting.md b/docs/node-guide/troubleshooting.md index 068fe3c..c1d6e61 100644 --- a/docs/node-guide/troubleshooting.md +++ b/docs/node-guide/troubleshooting.md @@ -4,8 +4,24 @@ sidebar_position: 8 # Troubleshooting +## The system cannot find CUDA libraries -## Resolving "Too many open files" Error on macOS when Initializing Gaianet Node +Sometimes, the CUDA toolkit is installed in a non-standard location. The error message here is often not able to find `libcu*12`. For example, you might have CUDA installed with your Python setup. The following command would install CUDA into Python's enviornment. + +``` +sudo apt install python3-pip -y +pip3 install --upgrade fschat accelerate autoawq vllm +``` + +The easiest way to fix is simply to link those non-standard CUDA libraries to the standard location, like this. + +``` +ln -s /usr/local/lib/python3.10/dist-packages/nvidia/cublas/lib/libcublas.so.12 /usr/lib/libcublas.so.12 +ln -s /usr/local/lib/python3.10/dist-packages/nvidia/cuda_runtime/lib/libcudart.so.12 /usr/lib/libcudart.so.12 +ln -s /usr/local/lib/python3.10/dist-packages/nvidia/cublas/lib/libcublasLt.so.12 /usr/lib/libcublasLt.so.12 +``` + +## The "Too many open files" Error on macOS When running `gaianet init` to initialize a new node on macOS, you may encounter an error related to snapshot recovery if your snapshot contains a large amount of text. The error message may be the following: @@ -22,6 +38,39 @@ To resolve this issue, you can increase the default FD limit on your system. To ulimit -n 10000 ``` -This will temporarily set the FD limit to 10,000. Next, use `gaianet init` to init your Gaianet node. +This will temporarily set the FD limit to 10,000. Next, use `gaianet init` and `gaianet start` commands in the SAME terminal. + +## File I/O error + +``` + * Import the Qdrant collection snapshot ... + The process may take a few minutes. Please wait ... + * [Error] Failed to recover from the collection snapshot. An error occurred processing field `snapshot`: File I/O error: Operation not permitted (os error 1) +``` + +It typically indicates that the Qdrant instance was not shut down properly before you try to init it again with a new snapshot. The solution is to stop the GaiaNet node first. + +``` +gaianet stop +``` + +Alternatively, you could mannually kill the processes from the terminal or in the OS's Actvity Monitor. + +``` +sudo pkill -9 qdrant +sudo pkill -9 wasmedge +sudo pkill -9 frpc +``` + +Then you can run `gaianet init` and then `gaianet start` again. +## The "Failed to open the file" Error + +``` +Warning: Failed to open the file +Warning: https://huggingface.co/datasets/max-id/gaianet-qdrant-snapshot/resolve +Warning: /main/consensus/consensus.snapshot: No such file or directory +curl: (23) Failure writing output to destination +``` +The reason for this type of error is that when executing `gaianet init`, the comments in `config.json` are run. The solution is to delete the comments in `config.json` and re-run the `gaianet init` command.