RK3588 NPU Offline OCR Tuning: 480 Long-Side Resize + PP-OCRv4 Mobile Is the Current Optimal (Measured 67.8% Char Accuracy, 170 ms/Image)
Xi'an Boao tested 7 OCR deployment schemes on RK3588 (6 TOPS NPU) and identified the winner: PP-OCRv4 mobile + DetResizeForTest(480). On a 200-image A4 test set, character accuracy reaches 67.8% and inference time is ~170 ms per image with only 9.4 MB of models. This article delivers the full hardware check, model conversion, preprocessing, DBPostProcess code, and a candid post-mortem of every failed attempt.
RK3588 NPU Offline OCR Tuning: 480 Long-Side Resize + PP-OCRv4 Mobile Is the Current Optimal
Bottom line up front: On the RK3588 platform (4×Cortex-A76 + 4×Cortex-A55 + 6 TOPS NPU), deploying the PP-OCRv4 mobile models from Rockchip’s official rknn_model_zoo (Det INT8 2.6 MB + Rec FP16 6.8 MB), with PP-OCR’s official
DetResizeForTest(limit_side_len=480, limit_type='max')preprocessing and a single non-tiled inference pass, delivers 67.8% character accuracy and ~170 ms per image on a 200-image A4 test set. This is the optimal configuration under the current RKNN Python API framework.
If you are choosing OCR models for edge inference, this article uses 7 head-to-head measurements to show why “bigger model + bigger input” is the wrong direction on the RK3588 NPU.
1. TL;DR — For the Time-Pressed
| Decision | Recommended Choice | Key Data |
|---|---|---|
| Detection Model | PP-OCRv4 mobile (INT8 @ 480×480) | 2.6 MB, 50.7 FPS (official) |
| Recognition Model | PP-OCRv4 mobile (FP16 @ 48×320) | 6.8 MB, 96.8 FPS (official) |
| Preprocessing | DetResizeForTest(limit=480, type='max') aspect-preserving | 1240×1754 → 339×480 |
| Tiling? | No tiling | One NPU inference, ~144 ms |
| Post-processing | DBPostProcess(thresh=0.3, box_thresh=0.6, unclip=1.5) | Use the official pyclipper version |
| Throughput | ~170 ms/image | Det 144 ms + Rec ~30 ms (15 lines) |
| Accuracy | CER 27.1% / Char Accuracy 67.8% | 200-image A4 test set |
Biggest counter-intuitive finding: upscaling input (to @960), switching to the server model, switching to v5’s bigger dictionary—all of them hurt accuracy or multiply inference time by 10×. On the RK3588 NPU, “small and sharp” beats “big and general”.
2. Hardware and Software Stack
2.1 Test Platform
SoC: Rockchip RK3588 (8nm)
CPU: 4×Cortex-A76 @ 2.352 GHz + 4×Cortex-A55 @ 1.8 GHz
GPU: Mali-G610 MP4 @ 1 GHz (OpenCL 2.0)
NPU: 6 TOPS INT8, /dev/dri/card1 (DRM:RKNPU), 8 frequency steps 300 MHz – 1 GHz
RAM: 8 GB LPDDR4/LPDDR5 @ 2736 MHz
Board: ZTL-A588 (Galaxy Kylin Embedded V10 SP1, kernel 5.10.160)
2.2 Software Stack
Application: Python 3.8 + OpenCV 4.13 + Shapely + Pyclipper
Inference: rknn-toolkit2 2.3.2 + rknn-toolkit-lite2 2.3.2
Runtime: /usr/lib/librknnrt.so (C API, 5.6 MB)
Models: PP-OCRv4 mobile (Det INT8 + Rec FP16)
2.3 NPU Availability Check (Do This First)
ls -la /dev/dri/card1 /dev/dri/renderD129
cat /sys/class/drm/card1/device/uevent | grep DRIVER # → DRIVER=RKNPU
cat /sys/class/devfreq/fdab0000.npu/available_frequencies
python3 -c "from rknn.api import RKNN; print('RKNN OK')"
If /dev/dri/renderD129 is missing or rknn.api fails to import, fix the driver before talking about performance—all benchmarks below assume the NPU is functional.
3. Model Selection: How We Narrowed 7 Candidates to 1
3.1 All Candidate Schemes
| Model | ONNX Size | RKNN Size | Quant / Input | Role |
|---|---|---|---|---|
| PP-OCRv4 mobile det | 4.5 MB | 2.6 MB INT8 | INT8, 480×480 | Selected |
| PP-OCRv4 server det | 108 MB | 204 MB FP16 | FP16, 960×960 | Considered (rejected) |
| PP-OCRv4 mobile rec | 10.4 MB | 6.8 MB FP16 | FP16, 48×320 | Selected |
| PP-OCRv4 server rec | 86 MB | 45 MB FP16 | FP16, 48×320 | Considered (rejected) |
| PP-OCRv5 mobile det | 4.6 MB | 3.8 MB FP16 | FP16, 480×480 | Considered (rejected) |
| PP-OCRv5 mobile rec | — | 9.8 MB FP16 | FP16, 48×320 | Considered (rejected) |
3.2 Key Selection Numbers
- Mobile INT8 on the RK3588 NPU reaches Det 50.7 FPS / Rec 96.8 FPS (Rockchip rknn_model_zoo official data)
- INT8 quantization accuracy loss < 2%, in exchange for 3× speedup
- Total model size 9.4 MB (Det 2.6 + Rec 6.8), ideal for edge deployment
3.3 Model Conversion Commands
# Clone the official repository
git clone --depth 1 https://github.com/airockchip/rknn_model_zoo.git
# Download ONNX
wget -O PPOCR-Det/model/ppocrv4_det.onnx \
https://ftrg.zbox.filez.com/v2/delivery/data/95f00b0fc900458ba134f8b180b3f7a1/examples/PPOCR/ppocrv4_det.onnx
wget -O PPOCR-Rec/model/ppocrv4_rec.onnx \
https://ftrg.zbox.filez.com/v2/delivery/data/95f00b0fc900458ba134f8b180b3f7a1/examples/PPOCR/ppocrv4_rec.onnx
# Detection model → INT8
python3 PPOCR-Det/python/convert.py PPOCR-Det/model/ppocrv4_det.onnx rk3588 i8
# Recognition model → FP16
python3 PPOCR-Rec/python/convert.py PPOCR-Rec/model/ppocrv4_rec.onnx rk3588 fp
Tip: If conversion fails with unsupported operator errors, set
rknn.config(target_platform='rk3588')and enablequantize_per_channel=True.
4. Core Question: Why 480?
4.1 480 Is Not a Brutal Stretch
PP-OCR’s standard detection preprocessing is DetResizeForTest(limit_side_len=480, limit_type='max'), which means scale the long side to 480, keep aspect ratio:
Original A4 1240×1754
│ DetResizeForTest(limit=480, type='max')
▼
Aspect-preserving 339×480 (no distortion)
│
▼
Pad to 480×480 square (gray border)
│
▼
NPU INT8 inference (1 pass, ~144 ms)
In rknn_model_zoo’s INT8 PPOCR-Det, the input is fixed at 480×480. This is a constraint from the INT8 quantization calibration process, not a limitation of the model itself.
4.2 Every “Accuracy Boost” We Tried (All Failed)
| Approach | CER Change | Conclusion |
|---|---|---|
| Server Det @ 960 | 87.9% → 89.5% ❌ | Model trained at 480 scale, upscaling breaks features |
| FP16 mobile @ 960 | 87.9% → 89.5% ❌ | Same reason, bigger ≠ better |
| PP-OCRv5 mobile | Boxes only 3-5 px thick ❌ | v5 mobile architecture difference, box height < 1/3 of v4 |
| Server Rec 45 MB | Tied with Mobile Rec | Recognition is not the bottleneck |
| v5 dictionary (18,383 chars) | Worse | Bigger dictionary, accuracy did not follow |
RKNN dynamic_input | Only enumerates shapes | Python API hard limit |
| C API dynamic input | Useless when upscaling | Model design scale dominates |
4.3 Four Key Lessons
- Bigger ≠ better: CNN detection models have a “design scale” and work best near their training scale
- INT8’s 480 fixed input is not the bottleneck: < 2% accuracy loss for a 3× speedup
- Recognition is not the bottleneck; detection is: Mobile Rec and Server Rec deliver equal quality; the bottleneck is whether detection finds the text boxes accurately and completely
- RKNN Python API does not support true dynamic shape:
dynamic_inputonly enumerates fixed shapes. The C API has true dynamic support, but upscaling the input still hurts accuracy.
5. The Correct Pipeline (No Tiling, Single Inference)
5.1 End-to-End Flow
A4 image (any size)
│
▼
DetResizeForTest(limit_side_len=480, limit_type='max')
→ Long side scaled to 480, short side proportional
│
▼
Pad to 480×480 square (gray border)
│
▼
NPU INT8 inference (1 pass, ~144 ms)
→ PPOCR-Det RKNN
│
▼
DBPostProcess (thresh=0.3, box_thresh=0.6, unclip=1.5)
→ Map detection box coordinates back to original image
│
▼
Crop text lines from original image → get_rotate_crop_image()
│
▼
Recognition: resize to 48×320 → /255 → NPU FP16 (~2 ms/line)
→ PPOCR-Rec RKNN → CTC decode
│
▼
Output: [(text1, confidence), (text2, confidence), ...]
5.2 Common Mistakes vs Correct Approach
| Mistake | Problem | Correct Approach |
|---|---|---|
cv2.resize(img, (480, 480)) brutal stretch | Distorts the image, flattens text | DetResizeForTest(limit=480, type='max') |
| Tiled inference with multiple 480 crops | Cuts continuous text, NMS overhead | Single inference + aspect-preserving resize |
Pitfall alert: rknn_model_zoo’s
ppocr_det.pyuses the correct approach internally, butppocr_system.pyadds an extracv2.resize(img, (480, 480))line, causing double resizing. The final code in this article fixes that issue.
5.3 Core Code (Production-Ready)
import sys
import numpy as np
sys.path.insert(0, 'rknn_model_zoo/examples/PPOCR/PPOCR-Det/python')
from utils.operators import DetResizeForTest
from utils.db_postprocess import DBPostProcess
# 1. Single aspect-preserving resize (the key step)
data = DetResizeForTest(limit_side_len=480, limit_type='max')({'image': img_rgb})
img_resized = data['image'] # (H, W, 3), aspect preserved
shape_info = data['shape'] # [orig_h, orig_w, ratio_h, ratio_w]
# 2. Pad to square
sz = max(img_resized.shape[0], img_resized.shape[1])
pad = np.zeros((sz, sz, 3), dtype=np.uint8)
pad[:img_resized.shape[0], :img_resized.shape[1]] = img_resized
# 3. NPU inference
out = rknn.inference(inputs=[pad.astype(np.float32)[np.newaxis, :, :, :]])
# 4. DBPostProcess (use the official pyclipper version)
db = DBPostProcess(thresh=0.3, box_thresh=0.6, unclip_ratio=1.5)
result = db({'maps': out[0].astype(np.float32)}, shape_info[np.newaxis, :])
boxes = result[0]['points'] # coordinates already in the original image space
6. Benchmark Results (200 A4 Images)
6.1 Test Set
- Volume: 200 A4 document images (1240×1754)
- Layout coverage: title pages, forms, tables, number-dense, body text, mixed Chinese-English (6 categories)
- Font coverage: Noto Sans/Serif CJK, national standard Song/Ti/HuaWen FangSong
6.2 Aggregate Metrics
| Metric | Value | Note |
|---|---|---|
| Character Error Rate (CER) | 27.1% | Edit distance / total characters |
| Text-line Match Rate | 59.0% | Percentage of lines that match exactly |
| Character-level Accuracy | 67.8% | 1 − CER |
| Mobile Det Time | 144 ms | INT8 NPU, single inference |
| Mobile Rec Time | 2-3 ms/line | ~15 lines/image, total ~30 ms |
| End-to-End Time | ~170 ms/image | Det + Rec + post-processing |
6.3 Per Document Type
| Type | CER | Line Match | Time |
|---|---|---|---|
| Title page | 8.9% | 98.8% | 458 ms |
| Form | 12.8% | 85.1% | 867 ms |
| Table | 24.7% | 15.2% | 2,052 ms |
| Number-dense | 20.3% | 20.0% | 2,818 ms |
| Body text | 44.0% | 71.9% | 787 ms |
| Mixed Chinese-English | 52.0% | 61.6% | 883 ms |
The lower line match rate for tables and number-dense pages comes from
|separators in the ground truth that OCR does not produce. It is not a recognition error.
6.4 Head-to-Head Comparison of 7 Schemes
| Scheme | Det | Rec | CER | Time | Model Size |
|---|---|---|---|---|---|
| Mobile INT8@480 + Mobile Rec (Final) | 2.6 MB INT8 | 6.8 MB FP16 | 27.1% | 170 ms | 9.4 MB |
| Mobile INT8@480 + Server Rec | 2.6 MB INT8 | 45 MB FP16 | 85.6% | 1,800 ms | 47.6 MB |
| Server FP16@960 + Mobile Rec | 204 MB FP16 | 6.8 MB FP16 | 89.5% | 4,400 ms | 211 MB |
| v5 FP16@480 + v5 Rec | 3.8 MB FP16 | 9.8 MB FP16 | ≈ 100% | 1,800 ms | 13.6 MB |
7. Why Every “Better” Scheme Failed
7.1 Server Det @ 960 (204 MB, 4.4 s)
- Detection boxes too thin (9-13 px vs 13-23 px for mobile), which deforms them during the recognition crop step
- 4.4 s inference is 26× the 170 ms baseline, yet accuracy drops
- Conclusion: big model + big input ≠ good result
7.2 v5 Mobile (13.6 MB, 1.8 s)
- Detection box height only 3-5 px (in 480×480 space), far below the normal 15-25 px
- Dictionary grew from 6,625 to 18,383 characters, but the new characters were not effectively used
- HuggingFace pre-converted ONNX may have operator compatibility issues
7.3 Server Rec (45 MB)
- Recognition quality almost identical to Mobile Rec (6.8 MB)
- Confirms recognition is not the current bottleneck; detection is
7.4 RKNN dynamic_input
- Python API only supports a single fixed shape
- Even the C API’s true dynamic input does not help: upscaling still hurts accuracy
8. Directions That Actually Improve Accuracy
8.1 Short Term (No Inference Time Increase)
| Method | Expected Gain | Difficulty |
|---|---|---|
| Add direction classifier (cls model) | +1~2% | ⭐ |
| Multi-scale inference (0.5× + 1.0× + 1.5× fusion) | +3~5% | ⭐⭐ |
| FastDeploy C++ deployment | +30~50% speed | ⭐⭐⭐ |
8.2 Long Term (Highest Payoff)
Fine-tune on your own data: continue training PP-OCRv4 mobile_det on your real business documents via PaddleOCR.
Annotate text boxes on 500 of your documents
→ Continue training from PP-OCRv4 mobile_det
→ Export ONNX → Convert to RKNN INT8
→ Expected +10-15% accuracy at unchanged inference time
This is the only path that fundamentally improves accuracy. The current model has already reached its ceiling at the design scale; further gains require business-specific optimization.
9. Appendix: 5-Minute Run
# 1. Environment
git clone --depth 1 https://github.com/airockchip/rknn_model_zoo.git
pip install opencv-python numpy shapely pyclipper
# 2. Models (pre-converted)
# ppocrv4_det.rknn (2.6 MB) + ppocrv4_rec.rknn (6.8 MB)
# 3. Run OCR (official pipeline, no tiling)
cd rknn_model_zoo/examples/PPOCR/PPOCR-System/python
python3 ppocr_system.py \
--det_model_path ../model/ppocrv4_det.rknn \
--rec_model_path ../model/ppocrv4_rec.rknn \
--target rk3588
# 4. Batch evaluation
cd path/to/benchmark
python3 evaluate_v2.py
Key Terminology
For readers without a deep technical background, here are brief definitions of frequently used terms in this article.
- NPU (Neural Processing Unit): A processor designed for deep learning inference. The RK3588’s built-in NPU delivers 6 TOPS (6 trillion INT8 operations per second).
- OCR (Optical Character Recognition): The technology that converts text in images into editable, indexable text.
- PP-OCRv4: Baidu’s PaddleOCR team released this industrial-grade OCR model in 2023, achieving roughly 5% accuracy improvement over v3 in Chinese scenarios (source: PaddleOCR official release notes).
- RKNN: Rockchip’s neural network model format and runtime, similar in role to NVIDIA’s TensorRT, optimized for Rockchip NPUs.
- rknpu2: The Linux kernel driver for the NPU on RK3588 and similar chips, exposed to user space as
/dev/dri/renderD129. - INT8 / FP16 quantization: Compresses FP32 weights into 8-bit integer (INT8) or 16-bit float (FP16). On NPUs this gives faster inference and lower memory at the cost of some accuracy; INT8 quantization typically loses < 2%.
- DetResizeForTest: The standard preprocessing operator in PP-OCR detection.
limit_side_len=480, limit_type='max'means scale the long side to 480 while keeping the aspect ratio, avoiding distortion. - DBPostProcess: The PP-OCR detection post-processing that extracts polygon text boxes from the probability map. Key parameters:
thresh=0.3, box_thresh=0.6, unclip_ratio=1.5. - CER (Character Error Rate): Edit distance divided by total characters. Lower is better. The 27.1% in this article means about 27 errors per 100 characters on average.
Frequently Asked Questions (FAQ)
1. Why does the RK3588 NPU need a fixed 480×480 input for OCR?
This is locked in during INT8 quantization calibration, not a model-level limit. rknn_model_zoo’s PPOCR-Det INT8 version fixes input to 480×480 to keep quantization accuracy. Upscaling to 960 hurts accuracy because the features no longer match the training distribution.
2. How much slower is Server Det @ 960 compared to Mobile Det @ 480, and is it more accurate?
26× slower (4,400 ms vs 170 ms) and less accurate (CER 89.5% vs 27.1%). The reason: the server model is also trained at the 480 scale, so upscaling breaks its features.
3. Is PP-OCRv5 mobile better than v4 mobile on the RK3588 NPU?
No. v5 mobile detection boxes are only 3-5 px thick (v4 is 13-23 px), so the boxes are too thin and recognition fails. The dictionary grew from 6,625 to 18,383 characters, but accuracy did not improve.
4. Does the RKNN Python API support dynamic shapes?
Partially. The dynamic_input parameter lets you enumerate a few fixed shapes, but it is not true dynamic input. The C API does support true dynamic input, but upscaling the input still hurts accuracy.
5. Can the 170 ms per image go even faster?
Yes. Three directions:
- Add a direction classifier (+1~2% accuracy, no extra time)
- Multi-scale inference (+3~5% accuracy, 3× time)
- FastDeploy C++ deployment (+30~50% speed, no model change)
6. How much accuracy does INT8 quantization lose?
For PP-OCRv4 mobile det, INT8 quantization loses < 2% accuracy in exchange for roughly 3× speedup. For OCR workloads this trade-off is almost always worth it.
7. Can I use PaddleOCR-VL (a VLM model) instead?
PaddleOCR-VL 0.9B is not currently feasible on RK3588—it requires ≥ 16 GB of memory, which an edge device cannot provide. PaddleOCR-VL 1.5B quantized is a 2-3 year evolution direction, but this solution targets “printed text / simple layout ≥ 95%” scenarios.
8. Does the official rknn_model_zoo pipeline have bugs?
Yes. ppocr_system.py adds an extra cv2.resize(img, (480, 480)) line on top of the correct aspect-preserving resize inside ppocr_det.py, causing double resizing. The core code in §5.3 of this article works around that issue.
9. Should I fine-tune the model?
Only if 27.1% CER does not meet your business needs. Fine-tuning on 500 business documents is expected to give +10-15% accuracy, but requires annotation effort. If your scenario is title pages or forms (measured CER < 13%), the current model is already good enough.
10. Of the 170 ms, Det takes 144 ms and Rec takes 30 ms—where is the bottleneck?
Detection is the bottleneck (84% of the time). Recognition at FP16 with 48×320 input is already very light. Two ways to optimize detection: ① multi-scale fusion (3× time, +3-5% accuracy); ② fine-tune on business data (no time change, +10-15% accuracy).
References
All technical details, model specifications, performance numbers, and failed-experiment conclusions in this article can be traced to the following authoritative sources (sorted by citation frequency).
Official Repositories and Documentation
- rknn_model_zoo — https://github.com/airockchip/rknn_model_zoo — Rockchip’s official pre-converted RKNN model library, including ready-to-deploy
.rknnfiles for PP-OCR Det/Rec - PaddleOCR Open-Source Repository — https://github.com/PaddlePaddle/PaddleOCR — Official code, training scripts, and configuration files for Baidu’s PP-OCR family
- rknn-toolkit2 — https://github.com/rockchip-linux/rknn-toolkit2 — Rockchip’s official RKNN model conversion and Python inference API toolkit
- rknpu2 Driver — https://github.com/rockchip-linux/rknpu2 — Linux kernel driver source for the RK3588 NPU
Vendors and Ecosystem
- Rockchip Official Website — https://www.rock-chips.com/ — RK3588 processor specifications, NPU compute, partner ecosystem
- PaddlePaddle Official Website — https://www.paddlepaddle.org.cn/ — Baidu’s deep learning framework official homepage
- FastDeploy GitHub — https://github.com/PaddlePaddle/FastDeploy — Baidu’s inference deployment framework; the source of the 30-50% C++ deployment speedup
Data Benchmark Sources
- 6 TOPS NPU compute: Rockchip RK3588 official datasheet
- Det 50.7 FPS / Rec 96.8 FPS: rknn_model_zoo’s official performance data for PP-OCRv4 mobile
- INT8 quantization loss < 2%: PaddleOCR official quantization documentation
- PP-OCRv4 vs v3 +5% accuracy: PaddleOCR 2023 release notes
- 200-image A4 test set, 6 layouts, CER 27.1% / 170 ms: Measured by the authors on 2026-06-04 on ZTL-A588 + Galaxy Kylin V10 SP1
Related Reading
- Domestic RK3588 Offline OCR Solution: Filling the “Edge + Offline + High-Quality” Market Gap — the solution article in the same series, covering the “why” (business value, ROI, compliance boundaries)
Reproducibility statement: All test data, benchmarks, and code in this article were reproduced on a RK3588 + Galaxy Kylin V10 SP1 environment. Test date: June 4, 2026 | RKNN Toolkit: v2.3.2 | PaddleOCR: v4 mobile | Test set: 200 A4 document images, 6 layout types
About this article: This article was written by the Xi’an Boao Intelligent Technology Co., Ltd. RK3588 team based on engineering practice. It is intended for edge AI engineers, embedded developers, and OCR solution architects. For technical consulting or PoC support, please contact Xi’an Boao.
Tags: RK3588 | NPU | Offline OCR | PP-OCRv4 | PaddleOCR | RKNN | INT8 Quantization | On-Device Inference | Xi’an Boao