feat(model): Support llama.cpp server deploy (#2263)

2025-09-05 11:01:09 +00:00 · 2025-01-02 16:50:53 +08:00
parent 576da34e92
commit 0b2af2e9a2
14 changed files with 823 additions and 44 deletions
--- a/docs/docs/installation/advanced_usage/Llamacpp_server.md
+++ b/docs/docs/installation/advanced_usage/Llamacpp_server.md
@@ -0,0 +1,40 @@
+# LLama.cpp Server
+
+DB-GPT supports native [llama.cpp server](https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md), 
+which supports concurrent requests and continuous batching inference.
+
+
+## Install dependencies
+
+```bash
+pip install -e ".[llama_cpp_server]"
+```
+If you want to accelerate the inference speed, and you have a GPU, you can install the following dependencies:
+
+```bash
+CMAKE_ARGS="-DGGML_CUDA=ON" pip install -e ".[llama_cpp_server]"
+```
+
+## Download the model
+
+Here, we use the `qwen2.5-0.5b-instruct` model as an example. You can download the model from the [Huggingface](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF).
+
+```bash
+wget https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q4_k_m.gguf?download=true -O /tmp/qwen2.5-0.5b-instruct-q4_k_m.gguf
+````
+
+## Modify configuration file
+
+In the `.env` configuration file, modify the inference type of the model to start `llama.cpp` inference.
+
+```bash
+LLM_MODEL=qwen2.5-0.5b-instruct
+LLM_MODEL_PATH=/tmp/qwen2.5-0.5b-instruct-q4_k_m.gguf
+MODEL_TYPE=llama_cpp_server
+```
+
+## Start the DB-GPT server
+
+```bash
+python dbgpt/app/dbgpt_server.py
+```
--- a/docs/sidebars.js
+++ b/docs/sidebars.js
@@ -271,6 +271,10 @@ const sidebars = {
              type: 'doc',
              id: 'installation/advanced_usage/vLLM_inference',
            },
+            {
+              type: 'doc',
+              id: 'installation/advanced_usage/Llamacpp_server',
+            },
            {
              type: 'doc',
              id: 'installation/advanced_usage/OpenAI_SDK_call',