Run a Large Language Model locally
on Mixtile Blade 3 NPU

 

Learn how to deploy a chatbot using a large language model (LLM) on the Mixtile Blade 3 (RK3588) NPU

 

Things used in this project


 

Hardware components

Mixtile Blade 3 ×1
Mixtile Blade 3 Case ×1
Asus ZenScreen Touch ×1

Software apps and online services

Ubuntu 22.04 (Rockchip)
Hugging Face
RK-LLM

Story


This project aims to implement a local chatbot application using a large language model (LLM) on the Rockchip NPU. An NPU (Neural Processing Unit) is a specialized processor that speeds up neural network computations. Utilizing the NPU’s capabilities, the chatbot will provide real-time, efficient, and privacy-focused interactions without relying on cloud services. Key steps include selecting and optimizing an LLM for the Rockchip NPU, integrating it into the chatbot application, and ensuring robust performance.

 

What is a Large Language model?

Large language models (LLMs) are advanced AI systems that understand and generate human-like text. They are trained on vast amounts of data, enabling them to perform various tasks such as answering questions, translating languages, and creating content. LLMs, like GPT-4, leverage deep learning techniques to predict and generate coherent and contextually relevant text, making them powerful tools for applications in customer service, content creation, and more. Their ability to process and generate natural language makes them invaluable in enhancing human-computer interactions.

The network architecture of a large language model (LLM) typically involves several key components:

  • Input Layer: This layer processes the input text data, converting it into a format that the model can understand, usually through tokenization.
  • Embedding Layer: Converts tokens into dense vectors that capture semantic meanings.
  • Transformer Blocks: The core of the LLM, consisting of multiple layers of transformers. Each transformer block includes a Multi-Head Attention to allow the model to focus on different parts of the input sequence simultaneously, a Feed-Forward Neural Network that processes the output from the attention mechanism, and a Layer Normalization that stabilizes and accelerates training by normalizing the inputs.
  • Output Layer: Converts the processed data back into a human-readable format, generating the final text output.

Hardware Setup

We will use a Mixtile Blade 3, a low-power SBC based on the 8nm Rockchip RK3588 processor. The RK3588 features an NPU (Neural Process Unit) with a maximum performance of 6 TOPS.

We will utilize a Mixtile Blade 3 Case with a built-in fan, and the case also functions as a heatsink to maintain a cool temperature.

For the initial setup, we will need a monitor and a keyboard.

Model Conversion

We will be utilizing the advanced RKLLM software stack to expedite the deployment of sophisticated AI models onto the Rockchip NPU, ensuring seamless integration and optimal performance. This comprehensive framework offers a streamlined approach to AI deployment.

lmage courtesy of Rockchip

To convert and quantize a Hugging Face trained model to an RKLLM format, we first need to install an RKLLM Toolkit on an x86 Linux machine. Once the conversion process is completed, we can then proceed to perform inference on the Mixtile Blade 3 using the RKLLM C API. The RKLLM Runtime provides an API for the Rockchip NPU, allowing for the deployment of RKLLM models and acceleration of LLM applications. The runtime utilizes the RKNPU Kernel Driver to interact with the NPU hardware.

First, clone the RKNN-LLM GitHub repository, and then create a virtual environment using the provided commands.

$ git clone https://github.com/airockchip/rknn-llm.git
$ cd rknn-llm/
$ virtualenv --python=python3.8 .
$ source bin/activate

Execute the following commands to install the RKLLM Toolkit.

$ pip3 install pytz 
$ pip3 install ./rkllm-toolkit/packages/rkllm_toolkit-1.0.1-cp38-cp38-linux_x86_64.whl

We will be utilizing the Microsoft Phi-3-Mini-4K-Instruct model for this project. This model comprises 3.8 billion parameters and is trained using the Phi-3 datasets, which encompass synthetic data as well as filtered publicly available website data. The model is part of the Phi-3 family with a context length of 4K tokens.

To download the model, run the following command.

$ git clone https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
$ cd ~/rknn-llm/rkllm-runtime/examples/rkllm_api_demo

We used the following Python script to convert the model to the RKLLM format so that it can be deployed to the Rockchip RK3588 NPU.

from rkllm.api import RKLLM

modelpath = '/home/user/Phi-3-mini-4k-instruct'
llm = RKLLM()

ret = llm.load_huggingface(model = modelpath)
if ret != 0:
print('Load model failed!')
exit(ret)

ret = llm.build(do_quantization=True, optimization_level=1, quantized_dtype='w8a8', target_platform='rk3588')
if ret != 0:
print('Build model failed!')
exit(ret)

ret = llm.export_rkllm("./Phi-3-mini-4k-instruct.rkllm")
if ret != 0:
print('Export model failed!')
exit(ret)

The Mixtile Blade 3’s default OS installation has an outdated NPU driver version, but the RKLLM requires the NPU driver version 0.9.6, or above. To resolve this, we need to recompile the kernel from the source and install it. Please follow the instructions below to build and install it.

$ git clone https://github.com/mixtile-rockchip/ubuntu-rockchip.git
$ cd ubuntu-rockchip
$ git checkout mixtile-blade3

There are a few missing function definitions. To fix this, apply the following changes: append the following code to the build/linux-rockchip/include/linux/mm.h file.

static inline void vm_flags_set(struct vm_area_struct *vma, vm_flags_t flags)
{
vma->vm_flags |= flags;
}

static inline void vm_flags_clear(struct vm_area_struct *vma,vm_flags_t flags)
{
vma->vm_flags &= ~flags;
}

At the beginning of the build/linux-rockchip/drivers/rknpu/rknpu_devfreq.c file, add the definition of the function as below.

static inline void rockchip_uninit_opp_table(struct device *dev,
struct rockchip_opp_info *info)
{
}

To build the kernel image, execute the following command.

$ sudo ./build.sh --board=mixtile-blade3 -k

Compile demo application

First, download the cross-compilation toolchain gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu and go to the rkllm-runtime/examples/rkllm_api_demo directory.

$ cd rknn-llm/rkllm-runtime/examples/rkllm_api_demo

Modify the src/main.cpp as shown in the diff below:

--- a/rkllm-runtime/examples/rkllm_api_demo/src/main.cpp
+++ b/rkllm-runtime/examples/rkllm_api_demo/src/main.cpp
@@ -71,7 +71,7 @@ int main(int argc, char **argv)
//设置参数及初始化
RKLLMParam param = rkllm_createDefaultParam();
param.model_path = rkllm_model.c_str();
- param.num_npu_core = 2;
+ param.num_npu_core = 3;
param.top_k = 1;
param.max_new_tokens = 256;
param.max_context_len = 512;
@@ -115,8 +115,8 @@ int main(int argc, char **argv)
cout << input_str << endl;
}
}
- // string text = PROMPT_TEXT_PREFIX + input_str + PROMPT_TEXT_POSTFIX;
- string text = input_str;
+ string text = PROMPT_TEXT_PREFIX + input_str + PROMPT_TEXT_POSTFIX;
+ //string text = input_str;

Modify the GCC_COMPILER_PATH in the build-linux.sh compilation script.

GCC_COMPILER_PATH=~/rknn-llm/gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu

To compile the application, execute the following command.

$ bash build-linux.sh 

The executable file created can be located at the following path: build/build_linux_aarch64_Release/llm_demo

Model Deployment

Copy the kernel image file linux-image-5.10.160-rockchip_5.10.160-21_arm64.deb, the converted model file Phi-3-mini-4k-instruct.rkllm, and application executable file llm_demo to the Mixtile Blade 3.

To install the kernel, execute the following command on the Mixtile Blade 3.

$ sudo dpkg -i linux-image-5.10.160-rockchip_5.10.160-21_arm64.deb

To verify the NPU driver version, execute the following command.

$ sudo cat /sys/kernel/debug/rknpu/version

RKNPU driver: v0.9.6

Run Application

First, execute the commands below to set up the RKLLM runtime library on the Mixtile Blade 3.

$ git clone https://github.com/airockchip/rknn-llm.git
$ export LD_LIBRARY_PATH=rknn-llm/rkllm-runtime/runtime/Linux/librkllm_api/aarch64:$LD_LIBRARY_PATH

Before running the application, we need to set the user limit using the following command; otherwise, the NPU memory allocation will fail.

$ ulimit -n 102400

To run the application, execute the following command.

$ ./llm_demo Phi-3-mini-4k-instruct.rkllm

When the application starts, it will take a few seconds to load the model. Once ready, it will present a user prompt, allowing us to input any questions or instructions.

Demo

Conclusion

This project demonstrates an advanced conversational AI solution using a low-power edge device. The solution is highly responsive and efficient, though sometimes not accurate without relying on cloud-based services. It also demonstrates the advantages of running sophisticated AI applications locally by making use of Rockchip NPU’s capabilities.


 

Credits


Naveen Kumar

Bioinformatician, Researcher, Programmer, Maker, Tinkerer, Community contributor Machine Learning Tokyo