Deploy FULLY PRIVATE & FAST LLM Chatbots!

Deploy text-generation-inference

Github Official Link

  1. clone project to local
## my record installed commit hash a5def7c222174e03d815f890093584f3e815c5ce
## Date:   Wed Nov 8 10:34:38 2023 -0600
git clone https://github.com/huggingface/text-generation-inference.git
  1. clone LLM to local
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/OpenAssistant/falcon-7b-sft-top1-696

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

this model I test in fp16, using RTX 4090 24GB GPU, consume 22GB when running, pretty fast, result is just fine.

# nvidia-smi
Wed Nov 15 22:18:02 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090         Off| 00000000:02:00.0 Off |                  Off |
|  0%   45C    P8               29W / 450W|  22758MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3060         Off| 00000000:03:00.0 Off |                  N/A |
|  0%   48C    P8               15W / 170W|      8MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2173      G   /usr/lib/xorg/Xorg                          167MiB |
|    0   N/A  N/A      2453      G   /usr/bin/gnome-shell                         16MiB |
|    0   N/A  N/A   2703657      C   /opt/conda/bin/python3.9                  22570MiB |
|    1   N/A  N/A      2173      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+
  1. install docker and nvidia-docker2

better mention my GPU driver info of Ubuntu 20.04 TLS server:

  • Driver Version: 530.30.02
  • CUDA Version: 12.1
  • Linux Kernel Version: 5.15.0-79-generic

install nvidia-docker2

# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee \
/etc/apt/sources.list.d/nvidia-docker.list

# Install nvidia-docker2
sudo apt update
sudo apt list -a nvidia-docker2
sudo apt install nvidia-docker2=2.13.0-1

sudo tee /etc/docker/daemon.json <<EOF
{
  "registry-mirrors": ["https://2efk2pit.mirror.aliyuncs.com"],
  "dns": ["223.5.5.5","119.29.29.29"],
  "default-runtime": "nvidia",
  "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
  },
  "exec-opts": ["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": {
  "max-size": "100m"
  },
  "storage-driver": "overlay2"
} 
EOF

sudo systemctl daemon-reload
sudo systemctl restart docker
  1. download image and run it

can download directly from internet, not need proxy

docker pull ghcr.io/huggingface/text-generation-inference:1.1.0
# REPOSITORY                                      TAG       IMAGE ID       CREATED       SIZE
# ghcr.io/huggingface/text-generation-inference   1.1.0     a111faa3a21b   6 weeks ago   10.2GB

I like to use local model which download by git.
If the model path is /home/finuks/Project/models/OpenAssistant/falcon-7b-sft-top1-696, you can use this

docker run -d --gpus all \
--name text-generation-inference \
--shm-size 1g \
-p 30000:80 \
-e NUM_SHARD=1 \
-e LOG_LEVEL=info,text_generation_router=debug \
-e MAX_TOTAL_TOKENS=4097 \
-e MAX_INPUT_LENGTH=4096 \
-e MAX_BATCH_TOTAL_TOKENS=4097 \
-v /home/finuks/Project/models:/usr/src \
-v /home/finuks/Project/text-generation-inference/data:/data ghcr.io/huggingface/text-generation-inference:1.1.0 \
--model-id OpenAssistant/falcon-7b-sft-top1-696

Deploy chat-ui

Github Official Link

  1. download mongodb image and run it
docker pull mongo:7.0.2
# REPOSITORY                                      TAG       IMAGE ID       CREATED       SIZE
# mongo                                           7.0.2     ee3b4d1239f1   4 weeks ago   748MB

docker run -d -p 27017:27017 --name mongo-chatui mongo:7.0.2

some basic mongodb oprations:

  • mongosh: mongo shell
  • mongodump:
    • back up a database: mongodump –host localhost:27017 –db mydb –out /backup/dir
    • back up all: mongodump –out /backup/dir
  • mongorestore:
    • restore a database: mongorestore –db mydb /backup/mongo/mydb
    • delete all data before restoring it: mongorestore –db mydb –drop /backup/mongo/mydb
  1. install nodejs

when I wrote this blog, I choose below version nodejs to install

$ cd ~
$ wget https://nodejs.org/dist/v18.18.2/node-v18.18.2-linux-x64.tar.gz
$ sudo tar -xzf https://nodejs.org/dist/v18.18.2/node-v18.18.2-linux-x64.tar.gz -C /opt/
$ sudo ln -s node-v16.8.0-linux-x64 nodejs
$ echo "export PATH=/opt/nodejs/bin:$PATH" >> ~/.bashrc
  1. clone project to local I use v0.5 version, because the latest version 0.6 have something can’t download in my special China GFW network.
# 0.5 commit version: 6fc4a59a8b7b5432191cd05ebb619b3cb0009725
git clone https://github.com/huggingface/chat-ui.git
  1. modify .env.local file

version 0.5 just add “endpoints” to point TGI port is okay.
can set “PUBLIC_ORIGIN” to use hostname to access.

MONGODB_URL=mongodb://localhost:27017
#PUBLIC_ORIGIN="http://chatbot.private.ui:30001"

# 'name', 'userMessageToken', 'assistantMessageToken' are required
MODELS=`[
  {
    "name": "OpenAssistant/falcon-7b-sft-top1-696",
    "description": "A good alternative to ChatGPT",
    "userMessageToken": "<|prompter|>",
    "assistantMessageToken": "<|assistant|>",
    "messageEndToken": "</s>",
    "preprompt": "Below are a series of dialogues between various people and an AI assistant. The AI tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable. The assistant is happy to help with almost anything, and will do its best to understand exactly what is needed. It also tries to avoid giving false or misleading information, and it caveats when it isn't entirely sure about the right answer. That said, the assistant is practical and really does its best, and doesn't let caution get too much in the way of being useful.\n-----\n",
    "promptExamples": [
      {
        "title": "Write an email from bullet list",
        "prompt": "As a restaurant owner, write a professional email to the supplier to get these products every week: \n\n- Wine (x10)\n- Eggs (x24)\n- Bread (x12)"
      }, {
        "title": "Code a snake game",
        "prompt": "Code a basic snake game in python, give explanations for each step."
      }, {
        "title": "Assist in a task",
        "prompt": "How do I make a delicious lemon cheesecake?"
      }
    ],
    "parameters": {
      "temperature": 0.9,
      "top_p": 0.95,
      "repetition_penalty": 1.2,
      "top_k": 50,
      "truncate": 1000,
      "max_new_tokens": 1024
    },
    "endpoints": [
      {
        "url": "http://localhost:30000"
      }
    ]
  }
]`
  1. run project
cd ~/chat-ui
npm run dev
  1. nginx reverse proxy project url
server{
    listen 30001;

    location / {
        proxy_pass http://localhost:5173;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';
        proxy_set_header Host $host;
        proxy_set_header Referer $http_referer;  # Pass the Referer header
        proxy_cache_bypass $http_upgrade;
    }
}