Skip to content

Commit

Permalink
Merge pull request #5 from GaiaNet-AI/better-CUDA-instruction
Browse files Browse the repository at this point in the history
Update cuda.md
  • Loading branch information
alabulei1 authored May 20, 2024
2 parents dc1f3cb + 509adba commit bb56c32
Show file tree
Hide file tree
Showing 2 changed files with 52 additions and 4 deletions.
3 changes: 1 addition & 2 deletions docs/node-guide/tasks/cuda.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,13 +76,12 @@ Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
```

After that, use the following command line to set up the environment path.
After that, use the following command line to set up the environment path. You should probably add this line to your `~/.bashrc` or `~/.zshrc` files so that new terminals and future logins will still be able to find these CUDA library files.

```
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
```


## More resources

Here are more scripts that could help you in case you are stuck.
Expand Down
53 changes: 51 additions & 2 deletions docs/node-guide/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,24 @@ sidebar_position: 8

# Troubleshooting

## The system cannot find CUDA libraries

## Resolving "Too many open files" Error on macOS when Initializing Gaianet Node
Sometimes, the CUDA toolkit is installed in a non-standard location. The error message here is often not able to find `libcu*12`. For example, you might have CUDA installed with your Python setup. The following command would install CUDA into Python's enviornment.

```
sudo apt install python3-pip -y
pip3 install --upgrade fschat accelerate autoawq vllm
```

The easiest way to fix is simply to link those non-standard CUDA libraries to the standard location, like this.

```
ln -s /usr/local/lib/python3.10/dist-packages/nvidia/cublas/lib/libcublas.so.12 /usr/lib/libcublas.so.12
ln -s /usr/local/lib/python3.10/dist-packages/nvidia/cuda_runtime/lib/libcudart.so.12 /usr/lib/libcudart.so.12
ln -s /usr/local/lib/python3.10/dist-packages/nvidia/cublas/lib/libcublasLt.so.12 /usr/lib/libcublasLt.so.12
```

## The "Too many open files" Error on macOS

When running `gaianet init` to initialize a new node on macOS, you may encounter an error related to snapshot recovery if your snapshot contains a large amount of text. The error message may be the following:

Expand All @@ -22,6 +38,39 @@ To resolve this issue, you can increase the default FD limit on your system. To
ulimit -n 10000
```

This will temporarily set the FD limit to 10,000. Next, use `gaianet init` to init your Gaianet node.
This will temporarily set the FD limit to 10,000. Next, use `gaianet init` and `gaianet start` commands in the SAME terminal.

## File I/O error

```
* Import the Qdrant collection snapshot ...
The process may take a few minutes. Please wait ...
* [Error] Failed to recover from the collection snapshot. An error occurred processing field `snapshot`: File I/O error: Operation not permitted (os error 1)
```

It typically indicates that the Qdrant instance was not shut down properly before you try to init it again with a new snapshot. The solution is to stop the GaiaNet node first.

```
gaianet stop
```

Alternatively, you could mannually kill the processes from the terminal or in the OS's Actvity Monitor.

```
sudo pkill -9 qdrant
sudo pkill -9 wasmedge
sudo pkill -9 frpc
```

Then you can run `gaianet init` and then `gaianet start` again.

## The "Failed to open the file" Error

```
Warning: Failed to open the file
Warning: https://huggingface.co/datasets/max-id/gaianet-qdrant-snapshot/resolve
Warning: /main/consensus/consensus.snapshot: No such file or directory
curl: (23) Failure writing output to destination
```

The reason for this type of error is that when executing `gaianet init`, the comments in `config.json` are run. The solution is to delete the comments in `config.json` and re-run the `gaianet init` command.

0 comments on commit bb56c32

Please sign in to comment.