HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Conference papers

HIT-MW Dataset for Offline Chinese Handwritten Text Recognition

Abstract : A Chinese handwritten text dataset, HIT-MW, is presented to facilitate the offline Chinese handwritten text recognition. Texts for handcopying are sampled from China Daily corpus with a stratified random manner. To collect naturally written handwriting, forms are distributed by postal mail or middleman instead of face to face. The current version of HIT-MW includes 853 forms and 186,444 characters that are written by more than 780 participants under an unconstrained condition without preprinted character boxes. Its lexical coverage of 3,041 characters is about 99.33% measured on China Daily corpus with about 80 million characters. Handwritten texts of HIT-MW mainly written by college students follow a balanced distribution both in sex and in department. It can be used to conduct Chinese textline segmentation, segmentation-free recognition, and to verify the effect of statistical language model in a real handwriting situation.
Complete list of metadata

Cited literature [16 references]  Display  Hide  Download

Contributor : Anne Jaigu Connect in order to contact the contributor
Submitted on : Thursday, October 5, 2006 - 11:01:49 AM
Last modification on : Thursday, October 5, 2006 - 11:20:04 AM
Long-term archiving on: : Tuesday, April 6, 2010 - 6:21:41 PM


  • HAL Id : inria-00103725, version 1



Tonghua Su, Tianwen Zhang, Dejun Guan. HIT-MW Dataset for Offline Chinese Handwritten Text Recognition. Tenth International Workshop on Frontiers in Handwriting Recognition, Université de Rennes 1, Oct 2006, La Baule (France). ⟨inria-00103725⟩



Record views


Files downloads