ICDAR 2003


Cursive Script Recognition
Page Segmentation
Table Segmentation
Robust Reading
Robust Word Recognition
Robust Character Recognition
Text Locating
ICDAR2003 Logo


This page lists the ICDAR 2003 datasets available for download on this site: Robust Reading , Robust Word Recognition , Robust OCR , Text Locating and Cursive Script .

Please note that the Page Segmentation and Table Segmentation competitions have their own separate datasets and procedures.

Robust Reading Datasets

These datasets were collected and tagged by the ICDAR 2003 Robust Reading Dataset Collection Team ( photo . Clockwise from left: Shirley Wong, Simon Lucas, Alex Panaretos, Luis Sosa Velazquez, Robert Young, Anthony Tang.)

The datasets are organized into Sample , Trial and Competition datasets.

Sample datasets are provided to give you a quick impression of the data, and also to allow function testing of your software. That is, you can run tests on the sample data to check that your software works with the data, but the results won't mean much.

Trial datasets serve two purposes. Use them to get results for your ICDAR 2003 papers. For this purpose, they are partitioned into two sets: TrialTrain and TrialTest. Use TrialTrain to train or tune your algorithms, then quote results on TrialTest. For the competitions, you should train/tune your system on the entire Trial set.

Competition datasets will be used to measure the performance of your algorithms for the competitions. These will be kept private until the ICDAR 2003 conference, when they will be made public.

Robust Reading and Text Locating

Each dataset is provided as a zip file, and contains a set of JPEG scene images, and three XML tag files: locations.xml, words.xml and segmentation.xml.

locations.xml is for the Text Locating problem, and contains the path to each image and the set of rectangles for each image.
words.xml is for the Robust Reading competition - this tags each image with the bouding rectangles of each word in the image together with the text in each rectangle.
segmentation.xml - like words.xml, except that each word is also given its segmentation points - just in case this information is useful to your algorithm (e.g. may be used to speed up EM).

Robust Word Recognition

Each dataset is provided as a zip file, and contains a set of JPEG images of single words and an XML tag file.

Robust OCR

Each dataset is provided as a zip file, and contains a set of JPEG images of single characters and an XML tag file.

Cursive Script Recognition

Off-line cursive script datasets are available here .

Hosted with kind thanks to the University of Essex , 2002.

University of Essex