RK3588 NPU Offline OCR Tuning: 480 Long-Side Resize + PP-OCRv4 Mobile Is the Current Optimal (Measured 67.8% Char Accuracy, 170 ms/Image)

Xi'an Boao tested 7 OCR deployment schemes on RK3588 (6 TOPS NPU) and identified the winner: PP-OCRv4 mobile + DetResizeForTest(480). On a 200-image A4 test set, character accuracy reaches 67.8% and inference time is ~170 ms per image with only 9.4 MB of models. This article delivers the full hardware check, model conversion, preprocessing, DBPostProcess code, and a candid post-mortem of every failed attempt.

作者 Boao AI RK3588 Team
英文版本稍后补充。
#RK3588 #NPU #Offline OCR #PP-OCRv4 #PaddleOCR #RKNN #INT8 Quantization #On-Device Inference #Xi'an Boao

RK3588 NPU Offline OCR Tuning: 480 Long-Side Resize + PP-OCRv4 Mobile Is the Current Optimal

Bottom line up front: On the RK3588 platform (4×Cortex-A76 + 4×Cortex-A55 + 6 TOPS NPU), deploying the PP-OCRv4 mobile models from Rockchip’s official rknn_model_zoo (Det INT8 2.6 MB + Rec FP16 6.8 MB), with PP-OCR’s official DetResizeForTest(limit_side_len=480, limit_type='max') preprocessing and a single non-tiled inference pass, delivers 67.8% character accuracy and ~170 ms per image on a 200-image A4 test set. This is the optimal configuration under the current RKNN Python API framework.

If you are choosing OCR models for edge inference, this article uses 7 head-to-head measurements to show why “bigger model + bigger input” is the wrong direction on the RK3588 NPU.

1. TL;DR — For the Time-Pressed

DecisionRecommended ChoiceKey Data
Detection ModelPP-OCRv4 mobile (INT8 @ 480×480)2.6 MB, 50.7 FPS (official)
Recognition ModelPP-OCRv4 mobile (FP16 @ 48×320)6.8 MB, 96.8 FPS (official)
PreprocessingDetResizeForTest(limit=480, type='max') aspect-preserving1240×1754 → 339×480
Tiling?No tilingOne NPU inference, ~144 ms
Post-processingDBPostProcess(thresh=0.3, box_thresh=0.6, unclip=1.5)Use the official pyclipper version
Throughput~170 ms/imageDet 144 ms + Rec ~30 ms (15 lines)
AccuracyCER 27.1% / Char Accuracy 67.8%200-image A4 test set

Biggest counter-intuitive finding: upscaling input (to @960), switching to the server model, switching to v5’s bigger dictionary—all of them hurt accuracy or multiply inference time by 10×. On the RK3588 NPU, “small and sharp” beats “big and general”.

2. Hardware and Software Stack

2.1 Test Platform

SoC: Rockchip RK3588 (8nm)
CPU: 4×Cortex-A76 @ 2.352 GHz + 4×Cortex-A55 @ 1.8 GHz
GPU: Mali-G610 MP4 @ 1 GHz (OpenCL 2.0)
NPU: 6 TOPS INT8, /dev/dri/card1 (DRM:RKNPU), 8 frequency steps 300 MHz – 1 GHz
RAM: 8 GB LPDDR4/LPDDR5 @ 2736 MHz
Board: ZTL-A588 (Galaxy Kylin Embedded V10 SP1, kernel 5.10.160)

2.2 Software Stack

Application:  Python 3.8 + OpenCV 4.13 + Shapely + Pyclipper
Inference:    rknn-toolkit2 2.3.2 + rknn-toolkit-lite2 2.3.2
Runtime:      /usr/lib/librknnrt.so (C API, 5.6 MB)
Models:       PP-OCRv4 mobile (Det INT8 + Rec FP16)

2.3 NPU Availability Check (Do This First)

ls -la /dev/dri/card1 /dev/dri/renderD129
cat /sys/class/drm/card1/device/uevent | grep DRIVER   # → DRIVER=RKNPU
cat /sys/class/devfreq/fdab0000.npu/available_frequencies
python3 -c "from rknn.api import RKNN; print('RKNN OK')"

If /dev/dri/renderD129 is missing or rknn.api fails to import, fix the driver before talking about performance—all benchmarks below assume the NPU is functional.

3. Model Selection: How We Narrowed 7 Candidates to 1

3.1 All Candidate Schemes

ModelONNX SizeRKNN SizeQuant / InputRole
PP-OCRv4 mobile det4.5 MB2.6 MB INT8INT8, 480×480Selected
PP-OCRv4 server det108 MB204 MB FP16FP16, 960×960Considered (rejected)
PP-OCRv4 mobile rec10.4 MB6.8 MB FP16FP16, 48×320Selected
PP-OCRv4 server rec86 MB45 MB FP16FP16, 48×320Considered (rejected)
PP-OCRv5 mobile det4.6 MB3.8 MB FP16FP16, 480×480Considered (rejected)
PP-OCRv5 mobile rec9.8 MB FP16FP16, 48×320Considered (rejected)

3.2 Key Selection Numbers

3.3 Model Conversion Commands

# Clone the official repository
git clone --depth 1 https://github.com/airockchip/rknn_model_zoo.git

# Download ONNX
wget -O PPOCR-Det/model/ppocrv4_det.onnx \
  https://ftrg.zbox.filez.com/v2/delivery/data/95f00b0fc900458ba134f8b180b3f7a1/examples/PPOCR/ppocrv4_det.onnx
wget -O PPOCR-Rec/model/ppocrv4_rec.onnx \
  https://ftrg.zbox.filez.com/v2/delivery/data/95f00b0fc900458ba134f8b180b3f7a1/examples/PPOCR/ppocrv4_rec.onnx

# Detection model → INT8
python3 PPOCR-Det/python/convert.py PPOCR-Det/model/ppocrv4_det.onnx rk3588 i8
# Recognition model → FP16
python3 PPOCR-Rec/python/convert.py PPOCR-Rec/model/ppocrv4_rec.onnx rk3588 fp

Tip: If conversion fails with unsupported operator errors, set rknn.config(target_platform='rk3588') and enable quantize_per_channel=True.

4. Core Question: Why 480?

4.1 480 Is Not a Brutal Stretch

PP-OCR’s standard detection preprocessing is DetResizeForTest(limit_side_len=480, limit_type='max'), which means scale the long side to 480, keep aspect ratio:

Original A4 1240×1754
  │ DetResizeForTest(limit=480, type='max')

Aspect-preserving 339×480 (no distortion)


Pad to 480×480 square (gray border)


NPU INT8 inference (1 pass, ~144 ms)

In rknn_model_zoo’s INT8 PPOCR-Det, the input is fixed at 480×480. This is a constraint from the INT8 quantization calibration process, not a limitation of the model itself.

4.2 Every “Accuracy Boost” We Tried (All Failed)

ApproachCER ChangeConclusion
Server Det @ 96087.9% → 89.5% ❌Model trained at 480 scale, upscaling breaks features
FP16 mobile @ 96087.9% → 89.5% ❌Same reason, bigger ≠ better
PP-OCRv5 mobileBoxes only 3-5 px thick ❌v5 mobile architecture difference, box height < 1/3 of v4
Server Rec 45 MBTied with Mobile RecRecognition is not the bottleneck
v5 dictionary (18,383 chars)WorseBigger dictionary, accuracy did not follow
RKNN dynamic_inputOnly enumerates shapesPython API hard limit
C API dynamic inputUseless when upscalingModel design scale dominates

4.3 Four Key Lessons

  1. Bigger ≠ better: CNN detection models have a “design scale” and work best near their training scale
  2. INT8’s 480 fixed input is not the bottleneck: < 2% accuracy loss for a 3× speedup
  3. Recognition is not the bottleneck; detection is: Mobile Rec and Server Rec deliver equal quality; the bottleneck is whether detection finds the text boxes accurately and completely
  4. RKNN Python API does not support true dynamic shape: dynamic_input only enumerates fixed shapes. The C API has true dynamic support, but upscaling the input still hurts accuracy.

5. The Correct Pipeline (No Tiling, Single Inference)

5.1 End-to-End Flow

A4 image (any size)


DetResizeForTest(limit_side_len=480, limit_type='max')
  → Long side scaled to 480, short side proportional


Pad to 480×480 square (gray border)


NPU INT8 inference (1 pass, ~144 ms)
  → PPOCR-Det RKNN


DBPostProcess (thresh=0.3, box_thresh=0.6, unclip=1.5)
  → Map detection box coordinates back to original image


Crop text lines from original image → get_rotate_crop_image()


Recognition: resize to 48×320 → /255 → NPU FP16 (~2 ms/line)
  → PPOCR-Rec RKNN → CTC decode


Output: [(text1, confidence), (text2, confidence), ...]

5.2 Common Mistakes vs Correct Approach

MistakeProblemCorrect Approach
cv2.resize(img, (480, 480)) brutal stretchDistorts the image, flattens textDetResizeForTest(limit=480, type='max')
Tiled inference with multiple 480 cropsCuts continuous text, NMS overheadSingle inference + aspect-preserving resize

Pitfall alert: rknn_model_zoo’s ppocr_det.py uses the correct approach internally, but ppocr_system.py adds an extra cv2.resize(img, (480, 480)) line, causing double resizing. The final code in this article fixes that issue.

5.3 Core Code (Production-Ready)

import sys
import numpy as np
sys.path.insert(0, 'rknn_model_zoo/examples/PPOCR/PPOCR-Det/python')
from utils.operators import DetResizeForTest
from utils.db_postprocess import DBPostProcess

# 1. Single aspect-preserving resize (the key step)
data = DetResizeForTest(limit_side_len=480, limit_type='max')({'image': img_rgb})
img_resized = data['image']      # (H, W, 3), aspect preserved
shape_info = data['shape']        # [orig_h, orig_w, ratio_h, ratio_w]

# 2. Pad to square
sz = max(img_resized.shape[0], img_resized.shape[1])
pad = np.zeros((sz, sz, 3), dtype=np.uint8)
pad[:img_resized.shape[0], :img_resized.shape[1]] = img_resized

# 3. NPU inference
out = rknn.inference(inputs=[pad.astype(np.float32)[np.newaxis, :, :, :]])

# 4. DBPostProcess (use the official pyclipper version)
db = DBPostProcess(thresh=0.3, box_thresh=0.6, unclip_ratio=1.5)
result = db({'maps': out[0].astype(np.float32)}, shape_info[np.newaxis, :])
boxes = result[0]['points']  # coordinates already in the original image space

6. Benchmark Results (200 A4 Images)

6.1 Test Set

6.2 Aggregate Metrics

MetricValueNote
Character Error Rate (CER)27.1%Edit distance / total characters
Text-line Match Rate59.0%Percentage of lines that match exactly
Character-level Accuracy67.8%1 − CER
Mobile Det Time144 msINT8 NPU, single inference
Mobile Rec Time2-3 ms/line~15 lines/image, total ~30 ms
End-to-End Time~170 ms/imageDet + Rec + post-processing

6.3 Per Document Type

TypeCERLine MatchTime
Title page8.9%98.8%458 ms
Form12.8%85.1%867 ms
Table24.7%15.2%2,052 ms
Number-dense20.3%20.0%2,818 ms
Body text44.0%71.9%787 ms
Mixed Chinese-English52.0%61.6%883 ms

The lower line match rate for tables and number-dense pages comes from | separators in the ground truth that OCR does not produce. It is not a recognition error.

6.4 Head-to-Head Comparison of 7 Schemes

SchemeDetRecCERTimeModel Size
Mobile INT8@480 + Mobile Rec (Final)2.6 MB INT86.8 MB FP1627.1%170 ms9.4 MB
Mobile INT8@480 + Server Rec2.6 MB INT845 MB FP1685.6%1,800 ms47.6 MB
Server FP16@960 + Mobile Rec204 MB FP166.8 MB FP1689.5%4,400 ms211 MB
v5 FP16@480 + v5 Rec3.8 MB FP169.8 MB FP16≈ 100%1,800 ms13.6 MB

7. Why Every “Better” Scheme Failed

7.1 Server Det @ 960 (204 MB, 4.4 s)

7.2 v5 Mobile (13.6 MB, 1.8 s)

7.3 Server Rec (45 MB)

7.4 RKNN dynamic_input

8. Directions That Actually Improve Accuracy

8.1 Short Term (No Inference Time Increase)

MethodExpected GainDifficulty
Add direction classifier (cls model)+1~2%
Multi-scale inference (0.5× + 1.0× + 1.5× fusion)+3~5%⭐⭐
FastDeploy C++ deployment+30~50% speed⭐⭐⭐

8.2 Long Term (Highest Payoff)

Fine-tune on your own data: continue training PP-OCRv4 mobile_det on your real business documents via PaddleOCR.

Annotate text boxes on 500 of your documents
  → Continue training from PP-OCRv4 mobile_det
  → Export ONNX → Convert to RKNN INT8
  → Expected +10-15% accuracy at unchanged inference time

This is the only path that fundamentally improves accuracy. The current model has already reached its ceiling at the design scale; further gains require business-specific optimization.

9. Appendix: 5-Minute Run

# 1. Environment
git clone --depth 1 https://github.com/airockchip/rknn_model_zoo.git
pip install opencv-python numpy shapely pyclipper

# 2. Models (pre-converted)
#   ppocrv4_det.rknn (2.6 MB) + ppocrv4_rec.rknn (6.8 MB)

# 3. Run OCR (official pipeline, no tiling)
cd rknn_model_zoo/examples/PPOCR/PPOCR-System/python
python3 ppocr_system.py \
  --det_model_path ../model/ppocrv4_det.rknn \
  --rec_model_path ../model/ppocrv4_rec.rknn \
  --target rk3588

# 4. Batch evaluation
cd path/to/benchmark
python3 evaluate_v2.py

Key Terminology

For readers without a deep technical background, here are brief definitions of frequently used terms in this article.

Frequently Asked Questions (FAQ)

1. Why does the RK3588 NPU need a fixed 480×480 input for OCR?

This is locked in during INT8 quantization calibration, not a model-level limit. rknn_model_zoo’s PPOCR-Det INT8 version fixes input to 480×480 to keep quantization accuracy. Upscaling to 960 hurts accuracy because the features no longer match the training distribution.

2. How much slower is Server Det @ 960 compared to Mobile Det @ 480, and is it more accurate?

26× slower (4,400 ms vs 170 ms) and less accurate (CER 89.5% vs 27.1%). The reason: the server model is also trained at the 480 scale, so upscaling breaks its features.

3. Is PP-OCRv5 mobile better than v4 mobile on the RK3588 NPU?

No. v5 mobile detection boxes are only 3-5 px thick (v4 is 13-23 px), so the boxes are too thin and recognition fails. The dictionary grew from 6,625 to 18,383 characters, but accuracy did not improve.

4. Does the RKNN Python API support dynamic shapes?

Partially. The dynamic_input parameter lets you enumerate a few fixed shapes, but it is not true dynamic input. The C API does support true dynamic input, but upscaling the input still hurts accuracy.

5. Can the 170 ms per image go even faster?

Yes. Three directions:

6. How much accuracy does INT8 quantization lose?

For PP-OCRv4 mobile det, INT8 quantization loses < 2% accuracy in exchange for roughly 3× speedup. For OCR workloads this trade-off is almost always worth it.

7. Can I use PaddleOCR-VL (a VLM model) instead?

PaddleOCR-VL 0.9B is not currently feasible on RK3588—it requires ≥ 16 GB of memory, which an edge device cannot provide. PaddleOCR-VL 1.5B quantized is a 2-3 year evolution direction, but this solution targets “printed text / simple layout ≥ 95%” scenarios.

8. Does the official rknn_model_zoo pipeline have bugs?

Yes. ppocr_system.py adds an extra cv2.resize(img, (480, 480)) line on top of the correct aspect-preserving resize inside ppocr_det.py, causing double resizing. The core code in §5.3 of this article works around that issue.

9. Should I fine-tune the model?

Only if 27.1% CER does not meet your business needs. Fine-tuning on 500 business documents is expected to give +10-15% accuracy, but requires annotation effort. If your scenario is title pages or forms (measured CER < 13%), the current model is already good enough.

10. Of the 170 ms, Det takes 144 ms and Rec takes 30 ms—where is the bottleneck?

Detection is the bottleneck (84% of the time). Recognition at FP16 with 48×320 input is already very light. Two ways to optimize detection: ① multi-scale fusion (3× time, +3-5% accuracy); ② fine-tune on business data (no time change, +10-15% accuracy).

References

All technical details, model specifications, performance numbers, and failed-experiment conclusions in this article can be traced to the following authoritative sources (sorted by citation frequency).

Official Repositories and Documentation

  1. rknn_model_zoohttps://github.com/airockchip/rknn_model_zoo — Rockchip’s official pre-converted RKNN model library, including ready-to-deploy .rknn files for PP-OCR Det/Rec
  2. PaddleOCR Open-Source Repositoryhttps://github.com/PaddlePaddle/PaddleOCR — Official code, training scripts, and configuration files for Baidu’s PP-OCR family
  3. rknn-toolkit2https://github.com/rockchip-linux/rknn-toolkit2 — Rockchip’s official RKNN model conversion and Python inference API toolkit
  4. rknpu2 Driverhttps://github.com/rockchip-linux/rknpu2 — Linux kernel driver source for the RK3588 NPU

Vendors and Ecosystem

  1. Rockchip Official Websitehttps://www.rock-chips.com/ — RK3588 processor specifications, NPU compute, partner ecosystem
  2. PaddlePaddle Official Websitehttps://www.paddlepaddle.org.cn/ — Baidu’s deep learning framework official homepage
  3. FastDeploy GitHubhttps://github.com/PaddlePaddle/FastDeploy — Baidu’s inference deployment framework; the source of the 30-50% C++ deployment speedup

Data Benchmark Sources


Reproducibility statement: All test data, benchmarks, and code in this article were reproduced on a RK3588 + Galaxy Kylin V10 SP1 environment. Test date: June 4, 2026 | RKNN Toolkit: v2.3.2 | PaddleOCR: v4 mobile | Test set: 200 A4 document images, 6 layout types

About this article: This article was written by the Xi’an Boao Intelligent Technology Co., Ltd. RK3588 team based on engineering practice. It is intended for edge AI engineers, embedded developers, and OCR solution architects. For technical consulting or PoC support, please contact Xi’an Boao.

Tags: RK3588 | NPU | Offline OCR | PP-OCRv4 | PaddleOCR | RKNN | INT8 Quantization | On-Device Inference | Xi’an Boao