Latest revision as of 18:04, 27 January 2011

Datasets -> Datasets List -> Current Page

Created: 2010-04-30

Last updated: 2011-001-27

Contact Author

Dr Qingcai Chen
Shenzhen Graduate School,
Harbin Institute of Technology.
Shenzhen, P.R. China 518055
Email: qingcai.chen@hitsz.edu.cn
Tel: +86 755 26033475

Shusen Zhou
Harbin Institute of Technology.
Shenzhen, P.R. China 518055
Email: zhoushusen@gmail.com

Current Version

1.1

Keywords

HIT-OR3C, Online handwriting, Chinese, characters, documents, Character recognition corpus

Description

Example of characters in the dataset

HIT_OR3C is a dataset of handwritten Chinese characters. Both online and offline information is available. The characters have been collected using a handwriting pad and are recorded and labelled automatically via the handwriting document collection software: OR3C Toolkit. The software used to collect the characters is also made available (supplied version is in Chinese).

The dataset is organised in 5 subsets: 4 subsets of characters [Digit (1-10), Letter (11-62), GB1 (63-3817), GB2 (3818-6825)], and 1 subset of documents.

The 4 subsets of characters contain 6,825 classes produced by 122 subjects and 832,650 samples in total. A single file per subject is provided for online data and a single file per subject for offline data (see below for the file format used). The different subsets are defined as index ranges within these files.

The document corpus corresponds to 10 news articles that contain in total 77,168 samples drawn from 2,442 classes and produced by 20 subjects. The document captured data have been post-processed and split into individual characters, the characters resized to 128 x 128 pixels and stored sequentially in a single image and a single vector file, similarly to the first four subsets.

The dataset contains 909,818 images. The total size of the dataset is 15.5 GB (1125 Mb compressed).

There are three file formats, defined by ourselves and introduced in the related documents. The individual character images are 128 x 128 greyscale.

Metadata

For each image, a label is provided. The labels of digits and letters are encoded in ASCII; the labels of Chinese characters are encoded in GB2312 80. The label file is in every folder and named “labels.txt”.

Related Ground Truth Data

N/A

Related Tasks

Handwriting recognition for Chinese characters

References

S. Zhou, Q. Chen, X. Wang, “HIT-OR3C: An Opening Recognition Corpus for Chinese Characters”, DAS 2010, to appear

Version Correspondence

Dataset	Task
V1.0	V1.0
V1.1	V1.0

Submitted Files

Version 1.1

Files

Offline Characters (807 Mb)

Offline Documents (120 Mb)

Online Characters (146 Mb)

Online Documents (21 Mb)

File Format Specification (English or Chinese)

Source Code for using the dataset (C++, java, MatLab)

HIT-OR3C Toolkit (English, Chinese)

HIT-OR3C Toolkit manual (English, Chinese)

This page is editable only by TC11 Officers .

@@ Line 1: / Line 1: @@
+[[Datasets]] -> [[Datasets List]] -> Current Page
 {| style="width: 100%"
 |-
@@ Line 17: / Line 19: @@
   Harbin Institute of Technology.
   Shenzhen, P.R. China 518055
-  e. qingcai.chen@hitsz.edu.cn
+  Email: qingcai.chen@hitsz.edu.cn
-  t. +86 755 26033475
+  Tel: +86 755 26033475
+ Shusen Zhou
+ Harbin Institute of Technology.
+ Shenzhen, P.R. China 518055
+ Email: zhoushusen@gmail.com
+=Current Version=
+.1
 =Keywords=
@@ Line 24: / Line 34: @@
 =Description=
-HIT_OR3C is a dataset of handwritten Chinese characters. Both online and offline information is available. The characters have been collected using a handwriting pad and are recorded and labelled automatically via the the handwriting document collection software: OR3C Toolkit. The software used to collect the characters is also made available (supplied version is in Chinese).
+[[Image:Dataset_OR3C_Thumbnail.jpg|400px|thumb|right|Example of characters in the dataset]]
+HIT_OR3C is a dataset of handwritten Chinese characters. Both online and offline information is available. The characters have been collected using a handwriting pad and are recorded and labelled automatically via the handwriting document collection software: OR3C Toolkit. The software used to collect the characters is also made available (supplied version is in Chinese).
 The dataset is organised in 5 subsets: 4 subsets of characters [Digit (1-10), Letter (11-62), GB1 (63-3817), GB2 (3818-6825)], and 1 subset of documents.
@@ Line 43: / Line 55: @@
 =Related Tasks=
-[[Handwriting recognition for Chinese characters]]
+* [[Handwriting recognition for Chinese characters]]
 =References=
-. S. Zhou, Q. Chen, X. Wang, “HIT-OR3C: An Opening Recognition Corpus for Chinese Characters”, DAS 2010, to appear
+# S. Zhou, Q. Chen, X. Wang, “HIT-OR3C: An Opening Recognition Corpus for Chinese Characters”, DAS 2010, to appear
+=Version Correspondence=
+{| border="1"
+|-
+! Dataset
+! Task
+|-
+| align="center" | V1.0
+| rowspan=2 align="center" | V1.0
+|-
+| align="center" | V1.1
+|}
 =Submitted Files=
-* [http://www.iapr-tc11.org/dataset/OR3C_DAS2010/OR3C/offline(V1.0)/character.rar Offline Characters] (825 Mb)
-* [http://www.iapr-tc11.org/dataset/OR3C_DAS2010/OR3C/offline(V1.0)/document.rar Offline Documents] (136 Mb)
+==Version 1.1==
-* [http://www.iapr-tc11.org/dataset/OR3C_DAS2010/OR3C/online(V1.0)/character.rar Online Characters] (150 Mb)
-* [http://www.iapr-tc11.org/dataset/OR3C_DAS2010/OR3C/online(V1.0)/document.rar Online Documents] (24 Mb)
+===Files===
-* File Format Specification ([http://www.iapr-tc11.org/dataset/OR3C_DAS2010/OR3C/online(V1.0)/File%20style(English).doc English] or [http://www.iapr-tc11.org/dataset/OR3C_DAS2010/OR3C/online(V1.0)/File%20style(Chinese).doc Chinese])
+* [http://www.iapr-tc11.org/dataset/OR3C_DAS2010/v1.1/OR3C/offline/character.rar Offline Characters] (807 Mb)
-* [http://www.iapr-tc11.org/dataset/OR3C_DAS2010/OR3C/source%20code.zip Source Code for using the dataset (C++, java, MatLab, test files)]
-* HIT-OR3C Toolkit ([http://www.iapr-tc11.org/dataset/OR3C_DAS2010/OR3C%20Toolkit/English%20Version/Toolkit.rar English], [http://www.iapr-tc11.org/dataset/OR3C_DAS2010/OR3C%20Toolkit/Chinese%20Version/Toolkit.rar Chinese])
+* [http://www.iapr-tc11.org/dataset/OR3C_DAS2010/v1.1/OR3C/offline/document.rar Offline Documents] (120 Mb)
-* HIT-OR3C Toolkit manual ([http://www.iapr-tc11.org/dataset/OR3C_DAS2010/OR3C%20Toolkit/English%20Version/HIT-OR3C%20handwriting%20collection%20system%20operation%20instructions(english).pdf English], [http://www.iapr-tc11.org/dataset/OR3C_DAS2010/OR3C%20Toolkit/Chinese%20Version/HIT-OR3C%20Toolkit%20operation%20instructions(english).pdf Chinese])
+* [http://www.iapr-tc11.org/dataset/OR3C_DAS2010/v1.1/OR3C/online/character.rar Online Characters] (146 Mb)
+* [http://www.iapr-tc11.org/dataset/OR3C_DAS2010/v1.1/OR3C/online/document.rar Online Documents] (21 Mb)
+* File Format Specification ([http://www.iapr-tc11.org/dataset/OR3C_DAS2010/v1.0/OR3C/online/File%20style(English).doc English] or [http://www.iapr-tc11.org/dataset/OR3C_DAS2010/v1.0/OR3C/online/File%20style(Chinese).doc Chinese])
+* [http://www.iapr-tc11.org/dataset/OR3C_DAS2010/v1.0/OR3C/source%20code.zip Source Code for using the dataset (C++, java, MatLab)]
+* HIT-OR3C Toolkit ([http://www.iapr-tc11.org/dataset/OR3C_DAS2010/v1.0/OR3C%20Toolkit/English%20Version/Toolkit.rar English], [http://www.iapr-tc11.org/dataset/OR3C_DAS2010/v1.0/OR3C%20Toolkit/Chinese%20Version/Toolkit.rar Chinese])
+* HIT-OR3C Toolkit manual ([http://www.iapr-tc11.org/dataset/OR3C_DAS2010/v1.0/OR3C%20Toolkit/English%20Version/HIT-OR3C%20Toolkit%20operation%20instructions(English).pdf English], [http://www.iapr-tc11.org/dataset/OR3C_DAS2010/v1.0/OR3C%20Toolkit/Chinese%20Version/HIT-OR3C%20Toolkit%20operation%20instructions(Chinese).pdf Chinese])
 ----
 This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

Navigation menu

Difference between revisions of "Harbin Institute of Technology Opening Recognition Corpus for Chinese Characters (HIT-OR3C)"