James Helman: Easy LLM on Orange PI 5 Plus with NPU acceleration

To keep this easy I am going with a particular setup that will provide basic LLM chatbot capability and a selection of models we can choose from. This is intended to get quick access to play with some LLMs offline and inexpensively. Even using the Rockchip NPU this will not be a fast or Chat-GPT-like experience.

Operating system

Installing the necessary kernel and drivers can be challenging and more work than the scope of an "Easy" guide. So, I used a prebuilt version of Armbian with all the components necessary to enable the NPU. This build is from Pelochus built using the Armbian Build system.

https://github.com/Pelochus/armbian-build-rknpu-updates/releases/tag/02-11-2024

Install the OS, an SD Card will work for testing but larger models will load very slowly. I recommend using EMC or an NVME SSD for longer-term use.


sudo dmesg | grep "Initialized rknpu"

This is what we are looking for in the output

Since this is a fresh OS install let's update.

sudo apt update && sudo apt update -y

We do have one thing we need to install before we run the next script, it can be installed if we run the script using sudo. I have found the script tends to work better if we run it as our user, so let's install this manually.

sudo apt install python3.10-venv

Let's download the WebGUI we are going to use.

git clone https://github.com/c0zaut/rkllm-gradio && cd rkllm-gradio

This will set up a virtual Python environment for us with everything else we need. We don't want to use sudo because we want the environment under our user.

bash setup.sh

source ~/.venv/rkllm-gradio/bin/activate

You should have a command prompt like this now with your user name at your hostname. This

shows us we are in the virtual Python environment the setup.py created.

Now let's grab some models to play with. Head over to CozAut's profile on Hugging Face. We are looking for ones that have rk3588 in the name. These are already converted to the format required by the RK3588's NPU. It cannot run the standard models you find on Huggingface or Ollama, they must be converted.

https://huggingface.co/c01zaut

Here are the ones I'll cover but feel free to play around. The files are quite large and the deepseek-coder-7b will require a board with at least 16GB of RAM.


cd ~/rkllm-gradio/models/

c01zaut/Qwen2.5-Coder-3B-Instruct-rk3588-1.1.1

3.5GB download
3.9 GB of RAM when the model is loaded(substracted.3GB for OS)
4.8 seconds load time from class 10 micro SD Card
0 seconds load time from EMMC
7.7 Tokens per second

wget -O Qwen2.5-Coder-3cdB-Instruct-rk3588-w8a8-opt-0-hybrid-ratio-1.0.rkllm https://huggingface.co/c01zaut/Qwen2.5-3B-Instruct-rk3588-1.1.1/resolve/main/Qwen2.5-3B-Instruct-rk3588-w8a8-opt-0-hybrid-ratio-1.0.rkllm?download=true

c01zaut/Llama-3.2-1B-Instruct-rk3588-1.1.1

1.7GB download
2.18GB of RAM when the model is loaded(substracted.3GB for OS)
7.7 seconds load time from class 10 micro SD Card
seconds load time from EMMC
18.3 Tokens per second

wget -O Llama-3.2-1B-Instruct-rk3588-w8a8-opt-1-hybrid-ratio-1.0.rkllm https://huggingface.co/c01zaut/Llama-3.2-1B-Instruct-rk3588-1.1.1/resolve/main/Llama-3.2-1B-Instruct-rk3588-w8a8-opt-1-hybrid-ratio-1.0.rkllm?download=true

c01zaut/deepseek-coder-7b-instruct-v1.5-rk3588-1.1.1

6.9GB download
9.03GB of RAM when the model is loaded(substracted.3GB for OS)
115 seconds load time from class 10 micro SD Card
seconds load time from EMMC
3.6 Tokens per second ( Note this same model runs at 8.6 tokens per second on my Ryzen 5800x3D cpu so this is pretty impressive for the NPU)

wget -O deepseek-coder-7b-instruct-v1.5-rk3588-w8a8-opt-0-hybrid-ratio-0.0.rkllm https://huggingface.co/c01zaut/deepseek-coder-7b-instruct-v1.5-rk3588-1.1.1/resolve/main/deepseek-coder-7b-instruct-v1.5-rk3588-w8a8-opt-0-hybrid-ratio-0.0.rkllm?download=true

If you choose to use the deepseek model above or select another model from CozAut's list of pre converted models they may not automatically show up in the drop-down. In that case, you can add them to the model_configs.py file as I will add the deepseek one below.


nano ~/rkllm-gradio/model_configs.py

"deepseek-coder-7b": {"filename": "deepseek-coder-7b-instruct-v1.5-rk3588-w8a8-opt-0-hybrid-ratio-0.0.rkllm"}

Let's make sure we have the device IP address we will need it in a few steps

ip addr

Now let's start our UI, this may take a bit

python3 ~/rkllm-gradio/rkllm_server_gradio.py

From a browser on the same network open http://<ipaddress from ip addr>:8080

Select our model, if you don't see it go back and update the ~/rkllm-gradio/model_configs.py file

Once the model loads go to the Txt2Txt tab and you can chat with the model

Yes, the UI is basic, but you can edit the ~/rkllm-gradio/rkllm_server_gradio.py file to change the name or just about anything else. Here is the documentation for Gradio https://www.gradio.app/docs

Now let's create an alias so we can launch out llm with one command


echo "alias rkllm='source ~/.venv/rkllm-gradio/bin/activate && cd ~/rkllm-gradio && python3 rkllm_server_gradio.py'" >> ~/.bashrc
exec bash

Now we can launch our llm with just

rkllm

James Helman

Tuesday, February 4, 2025

Easy LLM on Orange PI 5 Plus with NPU acceleration

Operating system

No comments:

Post a Comment

UniFi Network force DNS over TLS network wide

Report Abuse

Labels