Files
DB-GPT/docs/getting_started/install/llm/llama/llama_cpp.md
2023-10-07 21:12:53 +08:00

3.9 KiB

llama.cpp

DB-GPT already supports llama.cpp via llama-cpp-python.

Running llama.cpp

Preparing Model Files

To use llama.cpp, you need to prepare a gguf format model file, and there are two common ways to obtain it, you can choose either:

  1. Download a pre-converted model file.

Suppose you want to use Vicuna 13B v1.5, you can download the file already converted from TheBloke/vicuna-13B-v1.5-GGUF, only one file is needed. Download it to the models directory and rename it to ggml-model-q4_0.gguf.

wget https://huggingface.co/TheBloke/vicuna-13B-v1.5-GGUF/resolve/main/vicuna-13b-v1.5.Q4_K_M.gguf -O models/ggml-model-q4_0.gguf
  1. Convert It Yourself

You can convert the model file yourself according to the instructions in llama.cpp#prepare-data--run, and put the converted file in the models directory and rename it to ggml-model-q4_0.gguf.

Installing Dependencies

llama.cpp is an optional dependency in DB-GPT, and you can manually install it using the following command:

pip install -e ".[llama_cpp]"

Modifying the Configuration File

Next, you can directly modify your .env file to enable llama.cpp.

LLM_MODEL=llama-cpp
llama_cpp_prompt_template=vicuna_v1.1

Then you can run it according to Run.

More Configurations

In DB-GPT, the model configuration can be done through {model name}_{config key}.

Environment Variable Key default Description
llama_cpp_prompt_template None Prompt template name, now support: zero_shot, vicuna_v1.1,alpaca,llama-2,baichuan-chat,internlm-chat, If None, the prompt template is automatically determined from model path。
llama_cpp_model_path None Model path
llama_cpp_n_gpu_layers 1000000000 Number of layers to offload to the GPU, Set this to 1000000000 to offload all layers to the GPU. If your GPU VRAM is not enough, you can set a low number, eg: 10
llama_cpp_n_threads None Number of threads to use. If None, the number of threads is automatically determined
llama_cpp_n_batch 512 Maximum number of prompt tokens to batch together when calling llama_eval
llama_cpp_n_gqa None Grouped-query attention. Must be 8 for llama-2 70b.
llama_cpp_rms_norm_eps 5e-06 5e-6 is a good value for llama-2 models.
llama_cpp_cache_capacity None Maximum cache capacity. Examples: 2000MiB, 2GiB
llama_cpp_prefer_cpu False If a GPU is available, it will be preferred by default, unless prefer_cpu=False is configured.

GPU Acceleration

GPU acceleration is supported by default. If you encounter any issues, you can uninstall the dependent packages with the following command:

pip uninstall -y llama-cpp-python llama_cpp_python_cuda

Then install llama-cpp-python according to the instructions in llama-cpp-python.

Mac Usage

Special attention, if you are using Apple Silicon (M1) Mac, it is highly recommended to install arm64 architecture python support, for example:

wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh

Windows Usage

The use under the Windows platform has not been rigorously tested and verified, and you are welcome to use it. If you have any problems, you can create an issue or contact us directly.