feat(model): Support llama.cpp server deploy (#2263)

This commit is contained in:
Fangyin Cheng
2025-01-02 16:50:53 +08:00
committed by GitHub
parent 576da34e92
commit 0b2af2e9a2
14 changed files with 823 additions and 44 deletions

View File

@@ -0,0 +1,40 @@
# LLama.cpp Server
DB-GPT supports native [llama.cpp server](https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md),
which supports concurrent requests and continuous batching inference.
## Install dependencies
```bash
pip install -e ".[llama_cpp_server]"
```
If you want to accelerate the inference speed, and you have a GPU, you can install the following dependencies:
```bash
CMAKE_ARGS="-DGGML_CUDA=ON" pip install -e ".[llama_cpp_server]"
```
## Download the model
Here, we use the `qwen2.5-0.5b-instruct` model as an example. You can download the model from the [Huggingface](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF).
```bash
wget https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q4_k_m.gguf?download=true -O /tmp/qwen2.5-0.5b-instruct-q4_k_m.gguf
````
## Modify configuration file
In the `.env` configuration file, modify the inference type of the model to start `llama.cpp` inference.
```bash
LLM_MODEL=qwen2.5-0.5b-instruct
LLM_MODEL_PATH=/tmp/qwen2.5-0.5b-instruct-q4_k_m.gguf
MODEL_TYPE=llama_cpp_server
```
## Start the DB-GPT server
```bash
python dbgpt/app/dbgpt_server.py
```

View File

@@ -271,6 +271,10 @@ const sidebars = {
type: 'doc',
id: 'installation/advanced_usage/vLLM_inference',
},
{
type: 'doc',
id: 'installation/advanced_usage/Llamacpp_server',
},
{
type: 'doc',
id: 'installation/advanced_usage/OpenAI_SDK_call',