Deploy FULLY PRIVATE & FAST LLM Chatbots!
Deploy text-generation-inference
- clone project to local
## my record installed commit hash a5def7c222174e03d815f890093584f3e815c5ce
## Date: Wed Nov 8 10:34:38 2023 -0600
git clone https://github.com/huggingface/text-generation-inference.git
- clone LLM to local
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/OpenAssistant/falcon-7b-sft-top1-696
# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1
this model I test in fp16, using RTX 4090 24GB GPU, consume 22GB when running, pretty fast, result is just fine.
# nvidia-smi
Wed Nov 15 22:18:02 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 Off| 00000000:02:00.0 Off | Off |
| 0% 45C P8 29W / 450W| 22758MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3060 Off| 00000000:03:00.0 Off | N/A |
| 0% 48C P8 15W / 170W| 8MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2173 G /usr/lib/xorg/Xorg 167MiB |
| 0 N/A N/A 2453 G /usr/bin/gnome-shell 16MiB |
| 0 N/A N/A 2703657 C /opt/conda/bin/python3.9 22570MiB |
| 1 N/A N/A 2173 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
- install
docker
andnvidia-docker2
better mention my GPU driver info of Ubuntu 20.04 TLS server:
- Driver Version: 530.30.02
- CUDA Version: 12.1
- Linux Kernel Version: 5.15.0-79-generic
install nvidia-docker2
# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee \
/etc/apt/sources.list.d/nvidia-docker.list
# Install nvidia-docker2
sudo apt update
sudo apt list -a nvidia-docker2
sudo apt install nvidia-docker2=2.13.0-1
sudo tee /etc/docker/daemon.json <<EOF
{
"registry-mirrors": ["https://2efk2pit.mirror.aliyuncs.com"],
"dns": ["223.5.5.5","119.29.29.29"],
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2"
}
EOF
sudo systemctl daemon-reload
sudo systemctl restart docker
- download image and run it
can download directly from internet, not need proxy
docker pull ghcr.io/huggingface/text-generation-inference:1.1.0
# REPOSITORY TAG IMAGE ID CREATED SIZE
# ghcr.io/huggingface/text-generation-inference 1.1.0 a111faa3a21b 6 weeks ago 10.2GB
I like to use local model which download by git.
If the model path is /home/finuks/Project/models/OpenAssistant/falcon-7b-sft-top1-696
, you can use this
docker run -d --gpus all \
--name text-generation-inference \
--shm-size 1g \
-p 30000:80 \
-e NUM_SHARD=1 \
-e LOG_LEVEL=info,text_generation_router=debug \
-e MAX_TOTAL_TOKENS=4097 \
-e MAX_INPUT_LENGTH=4096 \
-e MAX_BATCH_TOTAL_TOKENS=4097 \
-v /home/finuks/Project/models:/usr/src \
-v /home/finuks/Project/text-generation-inference/data:/data ghcr.io/huggingface/text-generation-inference:1.1.0 \
--model-id OpenAssistant/falcon-7b-sft-top1-696
Deploy chat-ui
- download mongodb image and run it
docker pull mongo:7.0.2
# REPOSITORY TAG IMAGE ID CREATED SIZE
# mongo 7.0.2 ee3b4d1239f1 4 weeks ago 748MB
docker run -d -p 27017:27017 --name mongo-chatui mongo:7.0.2
some basic mongodb oprations:
mongosh
: mongo shellmongodump
:- back up a database: mongodump –host localhost:27017 –db mydb –out /backup/dir
- back up all: mongodump –out /backup/dir
mongorestore
:- restore a database: mongorestore –db mydb /backup/mongo/mydb
- delete all data before restoring it: mongorestore –db mydb –drop /backup/mongo/mydb
- install nodejs
when I wrote this blog, I choose below version nodejs to install
$ cd ~
$ wget https://nodejs.org/dist/v18.18.2/node-v18.18.2-linux-x64.tar.gz
$ sudo tar -xzf https://nodejs.org/dist/v18.18.2/node-v18.18.2-linux-x64.tar.gz -C /opt/
$ sudo ln -s node-v16.8.0-linux-x64 nodejs
$ echo "export PATH=/opt/nodejs/bin:$PATH" >> ~/.bashrc
- clone project to local I use v0.5 version, because the latest version 0.6 have something can’t download in my special China GFW network.
# 0.5 commit version: 6fc4a59a8b7b5432191cd05ebb619b3cb0009725
git clone https://github.com/huggingface/chat-ui.git
- modify
.env.local
file
version 0.5 just add “endpoints” to point TGI port is okay.
can set “PUBLIC_ORIGIN” to use hostname to access.
MONGODB_URL=mongodb://localhost:27017
#PUBLIC_ORIGIN="http://chatbot.private.ui:30001"
# 'name', 'userMessageToken', 'assistantMessageToken' are required
MODELS=`[
{
"name": "OpenAssistant/falcon-7b-sft-top1-696",
"description": "A good alternative to ChatGPT",
"userMessageToken": "<|prompter|>",
"assistantMessageToken": "<|assistant|>",
"messageEndToken": "</s>",
"preprompt": "Below are a series of dialogues between various people and an AI assistant. The AI tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable. The assistant is happy to help with almost anything, and will do its best to understand exactly what is needed. It also tries to avoid giving false or misleading information, and it caveats when it isn't entirely sure about the right answer. That said, the assistant is practical and really does its best, and doesn't let caution get too much in the way of being useful.\n-----\n",
"promptExamples": [
{
"title": "Write an email from bullet list",
"prompt": "As a restaurant owner, write a professional email to the supplier to get these products every week: \n\n- Wine (x10)\n- Eggs (x24)\n- Bread (x12)"
}, {
"title": "Code a snake game",
"prompt": "Code a basic snake game in python, give explanations for each step."
}, {
"title": "Assist in a task",
"prompt": "How do I make a delicious lemon cheesecake?"
}
],
"parameters": {
"temperature": 0.9,
"top_p": 0.95,
"repetition_penalty": 1.2,
"top_k": 50,
"truncate": 1000,
"max_new_tokens": 1024
},
"endpoints": [
{
"url": "http://localhost:30000"
}
]
}
]`
- run project
cd ~/chat-ui
npm run dev
- nginx reverse proxy project url
server{
listen 30001;
location / {
proxy_pass http://localhost:5173;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_set_header Referer $http_referer; # Pass the Referer header
proxy_cache_bypass $http_upgrade;
}
}