Run a Large Language Model locally
on Mixtile Blade 3 NPU
Learn how to deploy a chatbot using a large language model (LLM) on the Mixtile Blade 3 (RK3588) NPU
Learn how to deploy a chatbot using a large language model (LLM) on the Mixtile Blade 3 (RK3588) NPU
Mixtile Blade 3 ×1
Mixtile Blade 3 Case ×1
Asus ZenScreen Touch ×1
This project aims to implement a local chatbot application using a large language model (LLM) on the Rockchip NPU. An NPU (Neural Processing Unit) is a specialized processor that speeds up neural network computations. Utilizing the NPU’s capabilities, the chatbot will provide real-time, efficient, and privacy-focused interactions without relying on cloud services. Key steps include selecting and optimizing an LLM for the Rockchip NPU, integrating it into the chatbot application, and ensuring robust performance.
Large language models (LLMs) are advanced AI systems that understand and generate human-like text. They are trained on vast amounts of data, enabling them to perform various tasks such as answering questions, translating languages, and creating content. LLMs, like GPT-4, leverage deep learning techniques to predict and generate coherent and contextually relevant text, making them powerful tools for applications in customer service, content creation, and more. Their ability to process and generate natural language makes them invaluable in enhancing human-computer interactions.
The network architecture of a large language model (LLM) typically involves several key components:
We will use a Mixtile Blade 3, a low-power SBC based on the 8nm Rockchip RK3588 processor. The RK3588 features an NPU (Neural Process Unit) with a maximum performance of 6 TOPS.
We will utilize a Mixtile Blade 3 Case with a built-in fan, and the case also functions as a heatsink to maintain a cool temperature.
For the initial setup, we will need a monitor and a keyboard.
We will be utilizing the advanced RKLLM software stack to expedite the deployment of sophisticated AI models onto the Rockchip NPU, ensuring seamless integration and optimal performance. This comprehensive framework offers a streamlined approach to AI deployment.
To convert and quantize a Hugging Face trained model to an RKLLM format, we first need to install an RKLLM Toolkit on an x86 Linux machine. Once the conversion process is completed, we can then proceed to perform inference on the Mixtile Blade 3 using the RKLLM C API. The RKLLM Runtime provides an API for the Rockchip NPU, allowing for the deployment of RKLLM models and acceleration of LLM applications. The runtime utilizes the RKNPU Kernel Driver to interact with the NPU hardware.
First, clone the RKNN-LLM GitHub repository, and then create a virtual environment using the provided commands.
$ git clone https://github.com/airockchip/rknn-llm.git
$ cd rknn-llm/
$ virtualenv --python=python3.8 .
$ source bin/activate
Execute the following commands to install the RKLLM Toolkit.
$ pip3 install pytz
$ pip3 install ./rkllm-toolkit/packages/rkllm_toolkit-1.0.1-cp38-cp38-linux_x86_64.whl
We will be utilizing the Microsoft Phi-3-Mini-4K-Instruct model for this project. This model comprises 3.8 billion parameters and is trained using the Phi-3 datasets, which encompass synthetic data as well as filtered publicly available website data. The model is part of the Phi-3 family with a context length of 4K tokens.
To download the model, run the following command.
$ git clone https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
$ cd ~/rknn-llm/rkllm-runtime/examples/rkllm_api_demo
We used the following Python script to convert the model to the RKLLM format so that it can be deployed to the Rockchip RK3588 NPU.
from rkllm.api import RKLLM
modelpath = '/home/user/Phi-3-mini-4k-instruct'
llm = RKLLM()
ret = llm.load_huggingface(model = modelpath)
if ret != 0:
print('Load model failed!')
exit(ret)
ret = llm.build(do_quantization=True, optimization_level=1, quantized_dtype='w8a8', target_platform='rk3588')
if ret != 0:
print('Build model failed!')
exit(ret)
ret = llm.export_rkllm("./Phi-3-mini-4k-instruct.rkllm")
if ret != 0:
print('Export model failed!')
exit(ret)
The Mixtile Blade 3’s default OS installation has an outdated NPU driver version, but the RKLLM requires the NPU driver version 0.9.6, or above. To resolve this, we need to recompile the kernel from the source and install it. Please follow the instructions below to build and install it.
$ git clone https://github.com/mixtile-rockchip/ubuntu-rockchip.git
$ cd ubuntu-rockchip
$ git checkout mixtile-blade3
There are a few missing function definitions. To fix this, apply the following changes: append the following code to the build/linux-rockchip/include/linux/mm.h file.
static inline void vm_flags_set(struct vm_area_struct *vma, vm_flags_t flags)
{
vma->vm_flags |= flags;
}
static inline void vm_flags_clear(struct vm_area_struct *vma,vm_flags_t flags)
{
vma->vm_flags &= ~flags;
}
At the beginning of the build/linux-rockchip/drivers/rknpu/rknpu_devfreq.c file, add the definition of the function as below.
static inline void rockchip_uninit_opp_table(struct device *dev,
struct rockchip_opp_info *info)
{
}
To build the kernel image, execute the following command.
$ sudo ./build.sh --board=mixtile-blade3 -k
First, download the cross-compilation toolchain gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu and go to the rkllm-runtime/examples/rkllm_api_demo directory.
$ cd rknn-llm/rkllm-runtime/examples/rkllm_api_demo
Modify the src/main.cpp as shown in the diff below:
--- a/rkllm-runtime/examples/rkllm_api_demo/src/main.cpp
+++ b/rkllm-runtime/examples/rkllm_api_demo/src/main.cpp
@@ -71,7 +71,7 @@ int main(int argc, char **argv)
//设置参数及初始化
RKLLMParam param = rkllm_createDefaultParam();
param.model_path = rkllm_model.c_str();
- param.num_npu_core = 2;
+ param.num_npu_core = 3;
param.top_k = 1;
param.max_new_tokens = 256;
param.max_context_len = 512;
@@ -115,8 +115,8 @@ int main(int argc, char **argv)
cout << input_str << endl;
}
}
- // string text = PROMPT_TEXT_PREFIX + input_str + PROMPT_TEXT_POSTFIX;
- string text = input_str;
+ string text = PROMPT_TEXT_PREFIX + input_str + PROMPT_TEXT_POSTFIX;
+ //string text = input_str;
Modify the GCC_COMPILER_PATH in the build-linux.sh compilation script.
GCC_COMPILER_PATH=~/rknn-llm/gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu
To compile the application, execute the following command.
$ bash build-linux.sh
The executable file created can be located at the following path: build/build_linux_aarch64_Release/llm_demo
Copy the kernel image file linux-image-5.10.160-rockchip_5.10.160-21_arm64.deb, the converted model file Phi-3-mini-4k-instruct.rkllm, and application executable file llm_demo to the Mixtile Blade 3.
To install the kernel, execute the following command on the Mixtile Blade 3.
$ sudo dpkg -i linux-image-5.10.160-rockchip_5.10.160-21_arm64.deb
To verify the NPU driver version, execute the following command.
$ sudo cat /sys/kernel/debug/rknpu/version
RKNPU driver: v0.9.6
First, execute the commands below to set up the RKLLM runtime library on the Mixtile Blade 3.
$ git clone https://github.com/airockchip/rknn-llm.git
$ export LD_LIBRARY_PATH=rknn-llm/rkllm-runtime/runtime/Linux/librkllm_api/aarch64:$LD_LIBRARY_PATH
Before running the application, we need to set the user limit using the following command; otherwise, the NPU memory allocation will fail.
$ ulimit -n 102400
To run the application, execute the following command.
$ ./llm_demo Phi-3-mini-4k-instruct.rkllm
When the application starts, it will take a few seconds to load the model. Once ready, it will present a user prompt, allowing us to input any questions or instructions.
This project demonstrates an advanced conversational AI solution using a low-power edge device. The solution is highly responsive and efficient, though sometimes not accurate without relying on cloud-based services. It also demonstrates the advantages of running sophisticated AI applications locally by making use of Rockchip NPU’s capabilities.
Bioinformatician, Researcher, Programmer, Maker, Tinkerer, Community contributor Machine Learning Tokyo