mirror of
https://github.com/hpcaitech/ColossalAI.git
synced 2025-09-02 17:46:42 +00:00
[inference] Refactor inference architecture (#5057)
* [inference] support only TP (#4998) * support only tp * enable tp * add support for bloom (#5008) * [refactor] refactor gptq and smoothquant llama (#5012) * refactor gptq and smoothquant llama * fix import error * fix linear import torch-int * fix smoothquant llama import error * fix import accelerate error * fix bug * fix import smooth cuda * fix smoothcuda * [Inference Refactor] Merge chatglm2 with pp and tp (#5023) merge chatglm with pp and tp * [Refactor] remove useless inference code (#5022) * remove useless code * fix quant model * fix test import bug * mv original inference legacy * fix chatglm2 * [Refactor] refactor policy search and quant type controlling in inference (#5035) * [Refactor] refactor policy search and quant type controling in inference * [inference] update readme (#5051) * update readme * update readme * fix architecture * fix table * fix table * [inference] udpate example (#5053) * udpate example * fix run.sh * fix rebase bug * fix some errors * update readme * add some features * update interface * update readme * update benchmark * add requirements-infer --------- Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com> Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com>
This commit is contained in:
109
colossalai/legacy/inference/serving/torch_serve/README.md
Normal file
109
colossalai/legacy/inference/serving/torch_serve/README.md
Normal file
@@ -0,0 +1,109 @@
|
||||
# Colossal-Inference with TorchServe
|
||||
|
||||
## Overview
|
||||
|
||||
This demo is used for testing and demonstrating the usage of Colossal Inference from `colossalai.inference` with deployment with TorchServe. It imports inference modules from colossalai and is based on
|
||||
https://github.com/hpcaitech/ColossalAI/tree/3e05c07bb8921f2a8f9736b6f6673d4e9f1697d0. For now, single-gpu inference serving is supported.
|
||||
|
||||
## Environment for testing
|
||||
### Option #1: Use Conda Env
|
||||
Records to create a conda env to test locally as follows. We might want to use docker or configure env on cloud platform later.
|
||||
|
||||
*NOTE*: It requires the installation of jdk and the set of `JAVA_HOME`. We recommend to install open-jdk-17 (Please refer to https://openjdk.org/projects/jdk/17/)
|
||||
|
||||
```bash
|
||||
# use python 3.8 or 3.9
|
||||
conda create -n infer python=3.9
|
||||
|
||||
# use torch 1.13+cuda11.6 for inference
|
||||
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
|
||||
|
||||
# conda cuda toolkit (e.g. nvcc, etc)
|
||||
conda install -c "nvidia/label/cuda-11.6.2" cuda-toolkit
|
||||
|
||||
# install colossalai with PyTorch extensions
|
||||
cd <path_to_ColossalAI_repo>
|
||||
pip install -r requirements/requirements.txt
|
||||
pip install -r requirements/requirements-test.txt
|
||||
CUDA_EXT=1 pip install -e .
|
||||
|
||||
# install torchserve
|
||||
cd <path_to_torch_serve_repo>
|
||||
python ./ts_scripts/install_dependencies.py --cuda=cu116
|
||||
pip install torchserve torch-model-archiver torch-workflow-archiver
|
||||
```
|
||||
|
||||
### Option #2: Use Docker
|
||||
To use the stable diffusion Docker image, you can build using the provided the [Dockerfile](./docker/Dockerfile).
|
||||
|
||||
```bash
|
||||
# build from dockerfile
|
||||
cd ColossalAI/examples/inference/serving/torch_serve/docker
|
||||
docker build -t hpcaitech/colossal-infer-ts:0.2.0 .
|
||||
```
|
||||
|
||||
Once you have the image ready, you can launch the image with the following command
|
||||
|
||||
```bash
|
||||
cd ColossalAI/examples/inference/serving/torch_serve
|
||||
|
||||
# run the docker container
|
||||
docker run --rm \
|
||||
-it --gpus all \
|
||||
--name <name_you_assign> \
|
||||
-v <your-data-dir>:/data/scratch \
|
||||
-w <ColossalAI_dir> \
|
||||
hpcaitech/colossal-infer-ts:0.2.0 \
|
||||
/bin/bash
|
||||
```
|
||||
|
||||
## Steps to deploy a model
|
||||
|
||||
### 1.download/prepare a model
|
||||
We will download a bloom model, and then zip the downloaded model. You could download the model from [HuggingFace](https://huggingface.co/models) manually, or you might want to refer to this script [download_model.py](https://github.com/pytorch/serve/blob/c3ca2599b4d36d2b61302064b02eab1b65e1908d/examples/large_models/utils/Download_model.py) provided by pytorch-serve team to help you download a snapshot of the model.
|
||||
|
||||
```bash
|
||||
# download snapshots
|
||||
cd <path_to_torch_serve>/examples/large_models/utils/
|
||||
huggingface-cli login
|
||||
python download_model.py --model_name bigscience/bloom-560m -o <path_to_store_downloaded_model>
|
||||
|
||||
# zip the model repo
|
||||
cd <path_to_store_downloaded_model>/models--bigscience--bloom-560m/snapshots/<specific_revision>
|
||||
zip -r <path_to_place_zipped_model>//model.zip *
|
||||
```
|
||||
|
||||
> **_NOTE:_** The torch archiver and server will use `/tmp/` folder. Depending on the limit of disk quota, using torch-model-archiver might cause OSError "Disk quota exceeded". To prevent the OSError, set tmp dir environment variable as follows:
|
||||
`export TMPDIR=<dir_with_enough_space>/tmp` and `export TEMP=<dir_with_enough_space>/tmp`,
|
||||
or use relatively small models (as we did) for local testing.
|
||||
|
||||
### 2. Archive the model
|
||||
With torch archiver, we will pack the model file (.zip) as well as handler file (.py) together into a .mar file. And then in serving process these files will be unpacked by TorchServe. Revelant model configs and inference configs can be set in `model-config.yaml`.
|
||||
```bash
|
||||
cd ./ColossalAI/examples/inference/serving/torch_serve
|
||||
# create a folder under the current directory to store the packed model created by torch archiver
|
||||
mkdir model_store
|
||||
torch-model-archiver --model-name bloom --version 0.1 --handler Colossal_Inference_Handler.py --config-file model-config.yaml --extra-files <dir_zipped_model>/model.zip --export-path ./model_store/
|
||||
```
|
||||
|
||||
### 3. Launch serving
|
||||
|
||||
Modify `load_models` in config.properties to select the model(s) stored in <model_store> directory to be deployed. By default we use `load_models=all` to load and deploy all the models (.mar) we have.
|
||||
|
||||
```bash
|
||||
torchserve --start --ncs --ts-config config.properties
|
||||
```
|
||||
We could set inference, management, and metrics addresses and other TorchServe settings in `config.properties`.
|
||||
|
||||
TorchServe will create a folder `logs/` under the current directory to store ts, model, and metrics logs.
|
||||
|
||||
### 4. Run inference
|
||||
|
||||
```bash
|
||||
# check inference status
|
||||
curl http://0.0.0.0:8084/ping
|
||||
|
||||
curl -X POST http://localhost:8084/predictions/bloom -T sample_text.txt
|
||||
```
|
||||
|
||||
To stop TorchServe, run `torchserve --stop`
|
Reference in New Issue
Block a user