标记化与增强:基于惯性测量单元的在线手写识别中的写作者差异系统研究

📄 中文摘要

基于惯性测量单元的在线手写识别技术能够识别在不同书写表面上收集的输入信号,但在字符分布不均和写作者间的变异性方面仍面临挑战。研究系统性地探讨了两种应对策略:子词标记化和基于连接的数据显示增强。在OnHW-Words500数据集上的实验结果显示,处理写作者间和写作者内变异性存在明显的差异。在写作者独立分割中,通过二元标记化实现的结构抽象显著提高了对未见书写风格的识别性能,将词错误率(WER)从15.40%降低至12.99%。而在写作者依赖分割中,效果则有所不同。

📄 English Summary

Tokenization vs. Augmentation: A Systematic Study of Writer Variance in IMU-Based Online Handwriting Recognition

Inertial measurement unit-based online handwriting recognition enables the recognition of input signals collected across different writing surfaces but faces challenges due to uneven character distributions and inter-writer variability. This study systematically investigates two strategies to address these issues: sub-word tokenization and concatenation-based data augmentation. Experiments conducted on the OnHW-Words500 dataset reveal a clear distinction in handling inter-writer and intra-writer variance. On the writer-independent split, structural abstraction through Bigram tokenization significantly enhances performance on unseen writing styles, reducing the word error rate (WER) from 15.40% to 12.99%. In contrast, results on the writer-dependent split show different outcomes.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等