To keep this easy I am going with a particular setup that will provide basic LLM chatbot capability and a selection of models we can choose from. This is intended to get quick access to play with some LLMs offline and inexpensively. Even using the Rockchip NPU this will not be a fast or Chat-GPT-like experience.
Operating system
Installing the necessary kernel and drivers can be challenging and more work than the scope of an "Easy" guide. So, I used a prebuilt version of Armbian with all the components necessary to enable the NPU. This build is from Pelochus built using the Armbian Build system.
https://github.com/Pelochus/armbian-build-rknpu-updates/releases/tag/02-11-2024
Install the OS, an SD Card will work for testing but larger models will load very slowly. I recommend using EMC or an NVME SSD for longer-term use.
sudo dmesg | grep "Initialized rknpu"
This is what we are looking for in the output
Since this is a fresh OS install let's update.
sudo apt update && sudo apt update -y
We do have one thing we need to install before we run the next script, it can be installed if we run the script using sudo. I have found the script tends to work better if we run it as our user, so let's install this manually.
sudo apt install python3.10-venv
Let's download the WebGUI we are going to use.
git clone https://github.com/c0zaut/rkllm-gradio && cd rkllm-gradio
This will set up a virtual Python environment for us with everything else we need. We don't want to use sudo because we want the environment under our user.
bash setup.sh
source ~/.venv/rkllm-gradio/bin/activate
You should have a command prompt like this now with your user name at your hostname. This
shows us we are in the virtual Python environment the setup.py created.
Now let's grab some models to play with. Head over to CozAut's profile on Hugging Face. We are looking for ones that have rk3588 in the name. These are already converted to the format required by the RK3588's NPU. It cannot run the standard models you find on Huggingface or Ollama, they must be converted.
https://huggingface.co/c01zaut
Here are the ones I'll cover but feel free to play around. The files are quite large and the deepseek-coder-7b will require a board with at least 16GB of RAM.
cd ~/rkllm-gradio/models/
c01zaut/Qwen2.5-Coder-3B-Instruct-rk3588-1.1.1
- 3.5GB download
- 3.9 GB of RAM when the model is loaded(substracted.3GB for OS)
- 4.8 seconds load time from class 10 micro SD Card
- 0 seconds load time from EMMC
- 7.7 Tokens per second
wget -O Qwen2.5-Coder-3cdB-Instruct-rk3588-w8a8-opt-0-hybrid-ratio-1.0.rkllm https://huggingface.co/c01zaut/Qwen2.5-3B-Instruct-rk3588-1.1.1/resolve/main/Qwen2.5-3B-Instruct-rk3588-w8a8-opt-0-hybrid-ratio-1.0.rkllm?download=true
c01zaut/Llama-3.2-1B-Instruct-rk3588-1.1.1
- 1.7GB download
- 2.18GB of RAM when the model is loaded(substracted.3GB for OS)
- 7.7 seconds load time from class 10 micro SD Card
- seconds load time from EMMC
- 18.3 Tokens per second
wget -O Llama-3.2-1B-Instruct-rk3588-w8a8-opt-1-hybrid-ratio-1.0.rkllm https://huggingface.co/c01zaut/Llama-3.2-1B-Instruct-rk3588-1.1.1/resolve/main/Llama-3.2-1B-Instruct-rk3588-w8a8-opt-1-hybrid-ratio-1.0.rkllm?download=truec01zaut/deepseek-coder-7b-instruct-v1.5-rk3588-1.1.1
- 6.9GB download
- 9.03GB of RAM when the model is loaded(substracted.3GB for OS)
- 115 seconds load time from class 10 micro SD Card
- seconds load time from EMMC
- 3.6 Tokens per second ( Note this same model runs at 8.6 tokens per second on my Ryzen 5800x3D cpu so this is pretty impressive for the NPU)
wget -O deepseek-coder-7b-instruct-v1.5-rk3588-w8a8-opt-0-hybrid-ratio-0.0.rkllm https://huggingface.co/c01zaut/deepseek-coder-7b-instruct-v1.5-rk3588-1.1.1/resolve/main/deepseek-coder-7b-instruct-v1.5-rk3588-w8a8-opt-0-hybrid-ratio-0.0.rkllm?download=trueIf you choose to use the deepseek model above or select another model from CozAut's list of pre converted models they may not automatically show up in the drop-down. In that case, you can add them to the model_configs.py file as I will add the deepseek one below.
nano ~/rkllm-gradio/model_configs.py
"deepseek-coder-7b": {"filename": "deepseek-coder-7b-instruct-v1.5-rk3588-w8a8-opt-0-hybrid-ratio-0.0.rkllm"}
Let's make sure we have the device IP address we will need it in a few steps
ip addr
Now let's start our UI, this may take a bit
python3 ~/rkllm-gradio/rkllm_server_gradio.py
From a browser on the same network open http://<ipaddress from ip addr>:8080
Select our model, if you don't see it go back and update the ~/rkllm-gradio/model_configs.py file
Once the model loads go to the Txt2Txt tab and you can chat with the model
Yes, the UI is basic, but you can edit the ~/rkllm-gradio/rkllm_server_gradio.py file to change the name or just about anything else. Here is the documentation for Gradio https://www.gradio.app/docs
Now let's create an alias so we can launch out llm with one command
echo "alias rkllm='source ~/.venv/rkllm-gradio/bin/activate && cd ~/rkllm-gradio && python3 rkllm_server_gradio.py'" >> ~/.bashrcexec bash
Now we can launch our llm with just
rkllm
No comments:
Post a Comment