Scalable LlamaEdge API server

LlamaEdge is an app development platform. Use it to create your own Assistants APIs or apps that tie together multiple custom fine-tuned models, specialized multimodal models, document ingest algorithms, vector databases, search algorithms, prompt assembly algoithms, and external SaaS API calls. The LlamaEdge apps are portable even across GPUs -- you can develop and test your app on a Macbook and deploy it on an Nvidia device.

To learn more about how to create your own LLM inference apps using Rust and LlamaEdge, check out our examples.

For simple use cases, it also provides an OpenAI compatible API server out of the box through the llama-api-server project, which allows it to serve as a backend in frameworks like LangChain. The key benefits of LlamaEdge over other OpenAI compatible servers are easy deployment and management.

Lightweight and simple. There is no complex Python dependenices. The total package with all dependencies is less than 30MB, as opposed to 5GB for a PyTorch container.
Easy to setup. A simple installer for a variety of different CPU and GPU devices from server to edge.
Selection of backends. The server can switch between inference frameworks (e.g., llama.cpp or MLX, plain CUDA or TensorRT) through runtime configuration to optimize performance for the specific use case.
Cloud-native orchestration. The server itself is already sandboxed and ready for Kubernetes without the hassle of special containers, shims and binary builds.

LlamaEdge makes it possible to offer rich API services from a network of heterogeneous devices with a vaiety of CPUs, GPUs, NPUs, and OSes. In this repo, we will showcase how to run a scalable inference server with multiple LlamaEdge workers on the backend serving concurrent API users. We provide instructions, configuration files, and testing scripts for running your own LlamaEdge API server in production. The server will support multiple concurrent inference workers distributed across

A GPU with a large VRAM (e.g., an Nvidia Jetson 64GB device can run 10 concurrent workers for 7B models)
Multiple GPUs on a server (e.g., a GPU server could have 4x H100 GPUs, supportng 50+ concurrent workers for 7B models)
Multiple physical servers

We will use the default llama-api-server as example. The approach is to create an Nginx reverse proxy that load balances between llama-api-server workers. Each worker has an unqiue ipaddr:port combination.

Setup and stress testing instructions for Nginx on Ubuntu 22.04
Bash script for stress testing

Benchmarks

All models are llama2-chat with Q5_K_M quantization.

pp - Prompt processing tokens / sec for each worker at max concurrency
tg - Text generation tokens / sec for each worker at max concurrency
Throughput - Max text generation tokens /sec for a single worker

Provider	Processor	RAM	LLM	LlamaEdge workers	Concurrent API clients	pp	tg	Throughput
Azure NC4as T4 v3	Nvidia Tesla T4	16GB	7B	2	4	152	17	34
AWS g5.xlarge	Nvidia A10G	24GB	7B	4	8	...	17	71
OpenBayes	Nvidia RTX 4090	24GB	7B	4	8	667	32	129
OpenBayes	Nvidia RTX 4090	24GB	13B	2	4	666	31	76
Jetson Orin 64GB	Nvidia AGX	64GB	7B	10	20	...	...	...
Jetson Orin 64GB	Nvidia AGX	64GB	13B	6	12	...	...	...
Macbook	Apple M2	16GB	7B	2	4	...	...	...
Macbook	Apple M2	32GB	7B	6	12	...	...	...
Macbook	Apple M2	32GB	13B	3	6	...	...	...
Macbook	Apple M3	64GB	7B	10	20	...	...	...
Macbook	Apple M3	64GB	13B	6	12	...	...	...
OpenBayes	Nvidia A100	80GB	7B	12	24	...	...	...
OpenBayes	Nvidia A100	80GB	13B	6	12	...	...	...
OpenBayes	Nvidia A6000 x4	192GB	7B	32	64	...	...	...
OpenBayes	Nvidia A6000 x4	192GB	13B	16	32	...	...	...
OpenBayes	Nvidia RTX 4090 x4	96GB	7B	16	32	...	...	...
OpenBayes	Nvidia RTX 3090 x4	96GB	7B	16	32	...	...	...

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
nginx		nginx
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scalable LlamaEdge API server

Benchmarks

About

Releases

Packages

Languages

LlamaEdge/scalable-server

Folders and files

Latest commit

History

Repository files navigation

Scalable LlamaEdge API server

Benchmarks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages