RK3588 NPU Offline OCR Tuning: 480 Long-Side Resize + PP-OCRv4 Mobile Is the Current Optimal (Measured 67.8% Char Accuracy, 170 ms/Image)

Xi'an Boao tested 7 OCR deployment schemes on RK3588 (6 TOPS NPU) and identified the winner: PP-OCRv4 mobile + DetResizeForTest(480). On a 200-image A4 test set, character accuracy reaches 67.8% and inference time is ~170 ms per image with only 9.4 MB of models. This article delivers the full hardware check, model conversion, preprocessing, DBPostProcess code, and a candid post-mortem of every failed attempt.

June 4, 2026 作者 Boao AI RK3588 Team

英文版本稍后补充。

#RK3588 #NPU #Offline OCR #PP-OCRv4 #PaddleOCR #RKNN #INT8 Quantization #On-Device Inference #Xi'an Boao

RK3588 NPU Offline OCR Tuning: 480 Long-Side Resize + PP-OCRv4 Mobile Is the Current Optimal

Bottom line up front: On the RK3588 platform (4×Cortex-A76 + 4×Cortex-A55 + 6 TOPS NPU), deploying the PP-OCRv4 mobile models from Rockchip’s official rknn_model_zoo (Det INT8 2.6 MB + Rec FP16 6.8 MB), with PP-OCR’s official DetResizeForTest(limit_side_len=480, limit_type='max') preprocessing and a single non-tiled inference pass, delivers 67.8% character accuracy and ~170 ms per image on a 200-image A4 test set. This is the optimal configuration under the current RKNN Python API framework.

If you are choosing OCR models for edge inference, this article uses 7 head-to-head measurements to show why “bigger model + bigger input” is the wrong direction on the RK3588 NPU.

1. TL;DR — For the Time-Pressed

Decision	Recommended Choice	Key Data
Detection Model	PP-OCRv4 mobile (INT8 @ 480×480)	2.6 MB, 50.7 FPS (official)
Recognition Model	PP-OCRv4 mobile (FP16 @ 48×320)	6.8 MB, 96.8 FPS (official)
Preprocessing	`DetResizeForTest(limit=480, type='max')` aspect-preserving	1240×1754 → 339×480
Tiling?	No tiling	One NPU inference, ~144 ms
Post-processing	`DBPostProcess(thresh=0.3, box_thresh=0.6, unclip=1.5)`	Use the official pyclipper version
Throughput	~170 ms/image	Det 144 ms + Rec ~30 ms (15 lines)
Accuracy	CER 27.1% / Char Accuracy 67.8%	200-image A4 test set

Biggest counter-intuitive finding: upscaling input (to @960), switching to the server model, switching to v5’s bigger dictionary—all of them hurt accuracy or multiply inference time by 10×. On the RK3588 NPU, “small and sharp” beats “big and general”.

2. Hardware and Software Stack

2.1 Test Platform

SoC: Rockchip RK3588 (8nm)
CPU: 4×Cortex-A76 @ 2.352 GHz + 4×Cortex-A55 @ 1.8 GHz
GPU: Mali-G610 MP4 @ 1 GHz (OpenCL 2.0)
NPU: 6 TOPS INT8, /dev/dri/card1 (DRM:RKNPU), 8 frequency steps 300 MHz – 1 GHz
RAM: 8 GB LPDDR4/LPDDR5 @ 2736 MHz
Board: ZTL-A588 (Galaxy Kylin Embedded V10 SP1, kernel 5.10.160)

2.2 Software Stack

Application:  Python 3.8 + OpenCV 4.13 + Shapely + Pyclipper
Inference:    rknn-toolkit2 2.3.2 + rknn-toolkit-lite2 2.3.2
Runtime:      /usr/lib/librknnrt.so (C API, 5.6 MB)
Models:       PP-OCRv4 mobile (Det INT8 + Rec FP16)

2.3 NPU Availability Check (Do This First)

ls -la /dev/dri/card1 /dev/dri/renderD129
cat /sys/class/drm/card1/device/uevent | grep DRIVER   # → DRIVER=RKNPU
cat /sys/class/devfreq/fdab0000.npu/available_frequencies
python3 -c "from rknn.api import RKNN; print('RKNN OK')"

If /dev/dri/renderD129 is missing or rknn.api fails to import, fix the driver before talking about performance—all benchmarks below assume the NPU is functional.

3. Model Selection: How We Narrowed 7 Candidates to 1

3.1 All Candidate Schemes

Model	ONNX Size	RKNN Size	Quant / Input	Role
PP-OCRv4 mobile det	4.5 MB	2.6 MB INT8	INT8, 480×480	Selected
PP-OCRv4 server det	108 MB	204 MB FP16	FP16, 960×960	Considered (rejected)
PP-OCRv4 mobile rec	10.4 MB	6.8 MB FP16	FP16, 48×320	Selected
PP-OCRv4 server rec	86 MB	45 MB FP16	FP16, 48×320	Considered (rejected)
PP-OCRv5 mobile det	4.6 MB	3.8 MB FP16	FP16, 480×480	Considered (rejected)
PP-OCRv5 mobile rec	—	9.8 MB FP16	FP16, 48×320	Considered (rejected)

3.2 Key Selection Numbers

Mobile INT8 on the RK3588 NPU reaches Det 50.7 FPS / Rec 96.8 FPS (Rockchip rknn_model_zoo official data)
INT8 quantization accuracy loss < 2%, in exchange for 3× speedup
Total model size 9.4 MB (Det 2.6 + Rec 6.8), ideal for edge deployment

3.3 Model Conversion Commands

# Clone the official repository
git clone --depth 1 https://github.com/airockchip/rknn_model_zoo.git

# Download ONNX
wget -O PPOCR-Det/model/ppocrv4_det.onnx \
  https://ftrg.zbox.filez.com/v2/delivery/data/95f00b0fc900458ba134f8b180b3f7a1/examples/PPOCR/ppocrv4_det.onnx
wget -O PPOCR-Rec/model/ppocrv4_rec.onnx \
  https://ftrg.zbox.filez.com/v2/delivery/data/95f00b0fc900458ba134f8b180b3f7a1/examples/PPOCR/ppocrv4_rec.onnx

# Detection model → INT8
python3 PPOCR-Det/python/convert.py PPOCR-Det/model/ppocrv4_det.onnx rk3588 i8
# Recognition model → FP16
python3 PPOCR-Rec/python/convert.py PPOCR-Rec/model/ppocrv4_rec.onnx rk3588 fp

Tip: If conversion fails with unsupported operator errors, set rknn.config(target_platform='rk3588') and enable quantize_per_channel=True.

4. Core Question: Why 480?

4.1 480 Is Not a Brutal Stretch

PP-OCR’s standard detection preprocessing is DetResizeForTest(limit_side_len=480, limit_type='max'), which means scale the long side to 480, keep aspect ratio:

Original A4 1240×1754
  │ DetResizeForTest(limit=480, type='max')
  ▼
Aspect-preserving 339×480 (no distortion)
  │
  ▼
Pad to 480×480 square (gray border)
  │
  ▼
NPU INT8 inference (1 pass, ~144 ms)

In rknn_model_zoo’s INT8 PPOCR-Det, the input is fixed at 480×480. This is a constraint from the INT8 quantization calibration process, not a limitation of the model itself.

4.2 Every “Accuracy Boost” We Tried (All Failed)

Approach	CER Change	Conclusion
Server Det @ 960	87.9% → 89.5% ❌	Model trained at 480 scale, upscaling breaks features
FP16 mobile @ 960	87.9% → 89.5% ❌	Same reason, bigger ≠ better
PP-OCRv5 mobile	Boxes only 3-5 px thick ❌	v5 mobile architecture difference, box height < 1/3 of v4
Server Rec 45 MB	Tied with Mobile Rec	Recognition is not the bottleneck
v5 dictionary (18,383 chars)	Worse	Bigger dictionary, accuracy did not follow
RKNN `dynamic_input`	Only enumerates shapes	Python API hard limit
C API dynamic input	Useless when upscaling	Model design scale dominates

4.3 Four Key Lessons

Bigger ≠ better: CNN detection models have a “design scale” and work best near their training scale
INT8’s 480 fixed input is not the bottleneck: < 2% accuracy loss for a 3× speedup
Recognition is not the bottleneck; detection is: Mobile Rec and Server Rec deliver equal quality; the bottleneck is whether detection finds the text boxes accurately and completely
RKNN Python API does not support true dynamic shape: dynamic_input only enumerates fixed shapes. The C API has true dynamic support, but upscaling the input still hurts accuracy.

5. The Correct Pipeline (No Tiling, Single Inference)

5.1 End-to-End Flow

A4 image (any size)
  │
  ▼
DetResizeForTest(limit_side_len=480, limit_type='max')
  → Long side scaled to 480, short side proportional
  │
  ▼
Pad to 480×480 square (gray border)
  │
  ▼
NPU INT8 inference (1 pass, ~144 ms)
  → PPOCR-Det RKNN
  │
  ▼
DBPostProcess (thresh=0.3, box_thresh=0.6, unclip=1.5)
  → Map detection box coordinates back to original image
  │
  ▼
Crop text lines from original image → get_rotate_crop_image()
  │
  ▼
Recognition: resize to 48×320 → /255 → NPU FP16 (~2 ms/line)
  → PPOCR-Rec RKNN → CTC decode
  │
  ▼
Output: [(text1, confidence), (text2, confidence), ...]

5.2 Common Mistakes vs Correct Approach

Mistake	Problem	Correct Approach
`cv2.resize(img, (480, 480))` brutal stretch	Distorts the image, flattens text	`DetResizeForTest(limit=480, type='max')`
Tiled inference with multiple 480 crops	Cuts continuous text, NMS overhead	Single inference + aspect-preserving resize

Pitfall alert: rknn_model_zoo’s ppocr_det.py uses the correct approach internally, but ppocr_system.py adds an extra cv2.resize(img, (480, 480)) line, causing double resizing. The final code in this article fixes that issue.

5.3 Core Code (Production-Ready)

import sys
import numpy as np
sys.path.insert(0, 'rknn_model_zoo/examples/PPOCR/PPOCR-Det/python')
from utils.operators import DetResizeForTest
from utils.db_postprocess import DBPostProcess

# 1. Single aspect-preserving resize (the key step)
data = DetResizeForTest(limit_side_len=480, limit_type='max')({'image': img_rgb})
img_resized = data['image']      # (H, W, 3), aspect preserved
shape_info = data['shape']        # [orig_h, orig_w, ratio_h, ratio_w]

# 2. Pad to square
sz = max(img_resized.shape[0], img_resized.shape[1])
pad = np.zeros((sz, sz, 3), dtype=np.uint8)
pad[:img_resized.shape[0], :img_resized.shape[1]] = img_resized

# 3. NPU inference
out = rknn.inference(inputs=[pad.astype(np.float32)[np.newaxis, :, :, :]])

# 4. DBPostProcess (use the official pyclipper version)
db = DBPostProcess(thresh=0.3, box_thresh=0.6, unclip_ratio=1.5)
result = db({'maps': out[0].astype(np.float32)}, shape_info[np.newaxis, :])
boxes = result[0]['points']  # coordinates already in the original image space

6. Benchmark Results (200 A4 Images)

6.1 Test Set

Volume: 200 A4 document images (1240×1754)
Layout coverage: title pages, forms, tables, number-dense, body text, mixed Chinese-English (6 categories)
Font coverage: Noto Sans/Serif CJK, national standard Song/Ti/HuaWen FangSong

6.2 Aggregate Metrics

Metric	Value	Note
Character Error Rate (CER)	27.1%	Edit distance / total characters
Text-line Match Rate	59.0%	Percentage of lines that match exactly
Character-level Accuracy	67.8%	1 − CER
Mobile Det Time	144 ms	INT8 NPU, single inference
Mobile Rec Time	2-3 ms/line	~15 lines/image, total ~30 ms
End-to-End Time	~170 ms/image	Det + Rec + post-processing

6.3 Per Document Type

Type	CER	Line Match	Time
Title page	8.9%	98.8%	458 ms
Form	12.8%	85.1%	867 ms
Table	24.7%	15.2%	2,052 ms
Number-dense	20.3%	20.0%	2,818 ms
Body text	44.0%	71.9%	787 ms
Mixed Chinese-English	52.0%	61.6%	883 ms

The lower line match rate for tables and number-dense pages comes from | separators in the ground truth that OCR does not produce. It is not a recognition error.

6.4 Head-to-Head Comparison of 7 Schemes

Scheme	Det	Rec	CER	Time	Model Size
Mobile INT8@480 + Mobile Rec (Final)	2.6 MB INT8	6.8 MB FP16	27.1%	170 ms	9.4 MB
Mobile INT8@480 + Server Rec	2.6 MB INT8	45 MB FP16	85.6%	1,800 ms	47.6 MB
Server FP16@960 + Mobile Rec	204 MB FP16	6.8 MB FP16	89.5%	4,400 ms	211 MB
v5 FP16@480 + v5 Rec	3.8 MB FP16	9.8 MB FP16	≈ 100%	1,800 ms	13.6 MB

7. Why Every “Better” Scheme Failed

7.1 Server Det @ 960 (204 MB, 4.4 s)

Detection boxes too thin (9-13 px vs 13-23 px for mobile), which deforms them during the recognition crop step
4.4 s inference is 26× the 170 ms baseline, yet accuracy drops
Conclusion: big model + big input ≠ good result

7.2 v5 Mobile (13.6 MB, 1.8 s)

Detection box height only 3-5 px (in 480×480 space), far below the normal 15-25 px
Dictionary grew from 6,625 to 18,383 characters, but the new characters were not effectively used
HuggingFace pre-converted ONNX may have operator compatibility issues

7.3 Server Rec (45 MB)

Recognition quality almost identical to Mobile Rec (6.8 MB)
Confirms recognition is not the current bottleneck; detection is

7.4 RKNN `dynamic_input`

Python API only supports a single fixed shape
Even the C API’s true dynamic input does not help: upscaling still hurts accuracy

8. Directions That Actually Improve Accuracy

8.1 Short Term (No Inference Time Increase)

Method	Expected Gain	Difficulty
Add direction classifier (cls model)	+1~2%	⭐
Multi-scale inference (0.5× + 1.0× + 1.5× fusion)	+3~5%	⭐⭐
FastDeploy C++ deployment	+30~50% speed	⭐⭐⭐

8.2 Long Term (Highest Payoff)

Fine-tune on your own data: continue training PP-OCRv4 mobile_det on your real business documents via PaddleOCR.

Annotate text boxes on 500 of your documents
  → Continue training from PP-OCRv4 mobile_det
  → Export ONNX → Convert to RKNN INT8
  → Expected +10-15% accuracy at unchanged inference time

This is the only path that fundamentally improves accuracy. The current model has already reached its ceiling at the design scale; further gains require business-specific optimization.

9. Appendix: 5-Minute Run

# 1. Environment
git clone --depth 1 https://github.com/airockchip/rknn_model_zoo.git
pip install opencv-python numpy shapely pyclipper

# 2. Models (pre-converted)
#   ppocrv4_det.rknn (2.6 MB) + ppocrv4_rec.rknn (6.8 MB)

# 3. Run OCR (official pipeline, no tiling)
cd rknn_model_zoo/examples/PPOCR/PPOCR-System/python
python3 ppocr_system.py \
  --det_model_path ../model/ppocrv4_det.rknn \
  --rec_model_path ../model/ppocrv4_rec.rknn \
  --target rk3588

# 4. Batch evaluation
cd path/to/benchmark
python3 evaluate_v2.py

Key Terminology

For readers without a deep technical background, here are brief definitions of frequently used terms in this article.

NPU (Neural Processing Unit): A processor designed for deep learning inference. The RK3588’s built-in NPU delivers 6 TOPS (6 trillion INT8 operations per second).
OCR (Optical Character Recognition): The technology that converts text in images into editable, indexable text.
PP-OCRv4: Baidu’s PaddleOCR team released this industrial-grade OCR model in 2023, achieving roughly 5% accuracy improvement over v3 in Chinese scenarios (source: PaddleOCR official release notes).
RKNN: Rockchip’s neural network model format and runtime, similar in role to NVIDIA’s TensorRT, optimized for Rockchip NPUs.
rknpu2: The Linux kernel driver for the NPU on RK3588 and similar chips, exposed to user space as /dev/dri/renderD129.
INT8 / FP16 quantization: Compresses FP32 weights into 8-bit integer (INT8) or 16-bit float (FP16). On NPUs this gives faster inference and lower memory at the cost of some accuracy; INT8 quantization typically loses < 2%.
DetResizeForTest: The standard preprocessing operator in PP-OCR detection. limit_side_len=480, limit_type='max' means scale the long side to 480 while keeping the aspect ratio, avoiding distortion.
DBPostProcess: The PP-OCR detection post-processing that extracts polygon text boxes from the probability map. Key parameters: thresh=0.3, box_thresh=0.6, unclip_ratio=1.5.
CER (Character Error Rate): Edit distance divided by total characters. Lower is better. The 27.1% in this article means about 27 errors per 100 characters on average.

Frequently Asked Questions (FAQ)

1. Why does the RK3588 NPU need a fixed 480×480 input for OCR?

This is locked in during INT8 quantization calibration, not a model-level limit. rknn_model_zoo’s PPOCR-Det INT8 version fixes input to 480×480 to keep quantization accuracy. Upscaling to 960 hurts accuracy because the features no longer match the training distribution.

2. How much slower is Server Det @ 960 compared to Mobile Det @ 480, and is it more accurate?

26× slower (4,400 ms vs 170 ms) and less accurate (CER 89.5% vs 27.1%). The reason: the server model is also trained at the 480 scale, so upscaling breaks its features.

3. Is PP-OCRv5 mobile better than v4 mobile on the RK3588 NPU?

No. v5 mobile detection boxes are only 3-5 px thick (v4 is 13-23 px), so the boxes are too thin and recognition fails. The dictionary grew from 6,625 to 18,383 characters, but accuracy did not improve.

4. Does the RKNN Python API support dynamic shapes?

Partially. The dynamic_input parameter lets you enumerate a few fixed shapes, but it is not true dynamic input. The C API does support true dynamic input, but upscaling the input still hurts accuracy.

5. Can the 170 ms per image go even faster?

Yes. Three directions:

Add a direction classifier (+1~2% accuracy, no extra time)
Multi-scale inference (+3~5% accuracy, 3× time)
FastDeploy C++ deployment (+30~50% speed, no model change)

6. How much accuracy does INT8 quantization lose?

For PP-OCRv4 mobile det, INT8 quantization loses < 2% accuracy in exchange for roughly 3× speedup. For OCR workloads this trade-off is almost always worth it.

7. Can I use PaddleOCR-VL (a VLM model) instead?

PaddleOCR-VL 0.9B is not currently feasible on RK3588—it requires ≥ 16 GB of memory, which an edge device cannot provide. PaddleOCR-VL 1.5B quantized is a 2-3 year evolution direction, but this solution targets “printed text / simple layout ≥ 95%” scenarios.

8. Does the official rknn_model_zoo pipeline have bugs?

Yes. ppocr_system.py adds an extra cv2.resize(img, (480, 480)) line on top of the correct aspect-preserving resize inside ppocr_det.py, causing double resizing. The core code in §5.3 of this article works around that issue.

9. Should I fine-tune the model?

Only if 27.1% CER does not meet your business needs. Fine-tuning on 500 business documents is expected to give +10-15% accuracy, but requires annotation effort. If your scenario is title pages or forms (measured CER < 13%), the current model is already good enough.

10. Of the 170 ms, Det takes 144 ms and Rec takes 30 ms—where is the bottleneck?

Detection is the bottleneck (84% of the time). Recognition at FP16 with 48×320 input is already very light. Two ways to optimize detection: ① multi-scale fusion (3× time, +3-5% accuracy); ② fine-tune on business data (no time change, +10-15% accuracy).

References

All technical details, model specifications, performance numbers, and failed-experiment conclusions in this article can be traced to the following authoritative sources (sorted by citation frequency).

Official Repositories and Documentation

rknn_model_zoo — https://github.com/airockchip/rknn_model_zoo — Rockchip’s official pre-converted RKNN model library, including ready-to-deploy .rknn files for PP-OCR Det/Rec
PaddleOCR Open-Source Repository — https://github.com/PaddlePaddle/PaddleOCR — Official code, training scripts, and configuration files for Baidu’s PP-OCR family
rknn-toolkit2 — https://github.com/rockchip-linux/rknn-toolkit2 — Rockchip’s official RKNN model conversion and Python inference API toolkit
rknpu2 Driver — https://github.com/rockchip-linux/rknpu2 — Linux kernel driver source for the RK3588 NPU

Vendors and Ecosystem

Rockchip Official Website — https://www.rock-chips.com/ — RK3588 processor specifications, NPU compute, partner ecosystem
PaddlePaddle Official Website — https://www.paddlepaddle.org.cn/ — Baidu’s deep learning framework official homepage
FastDeploy GitHub — https://github.com/PaddlePaddle/FastDeploy — Baidu’s inference deployment framework; the source of the 30-50% C++ deployment speedup

Data Benchmark Sources

6 TOPS NPU compute: Rockchip RK3588 official datasheet
Det 50.7 FPS / Rec 96.8 FPS: rknn_model_zoo’s official performance data for PP-OCRv4 mobile
INT8 quantization loss < 2%: PaddleOCR official quantization documentation
PP-OCRv4 vs v3 +5% accuracy: PaddleOCR 2023 release notes
200-image A4 test set, 6 layouts, CER 27.1% / 170 ms: Measured by the authors on 2026-06-04 on ZTL-A588 + Galaxy Kylin V10 SP1

Domestic RK3588 Offline OCR Solution: Filling the “Edge + Offline + High-Quality” Market Gap — the solution article in the same series, covering the “why” (business value, ROI, compliance boundaries)

Reproducibility statement: All test data, benchmarks, and code in this article were reproduced on a RK3588 + Galaxy Kylin V10 SP1 environment. Test date: June 4, 2026 | RKNN Toolkit: v2.3.2 | PaddleOCR: v4 mobile | Test set: 200 A4 document images, 6 layout types

About this article: This article was written by the Xi’an Boao Intelligent Technology Co., Ltd. RK3588 team based on engineering practice. It is intended for edge AI engineers, embedded developers, and OCR solution architects. For technical consulting or PoC support, please contact Xi’an Boao.