Data Sets

The competition will be carried out on the LAMIS-MSHD, the CERUG and the WDAD data sets. The first data set contains Arabic and French handwritten texts, the second contains Chinese and English texts and the third data set contains Farsi and English samples..

The LAMIS-MSHD Data set comprises 1200 handwritten text images written by 100 different writers. All the 100 writers are adults in order to ensure that they have their own characteristic handwriting style. Each writer produced 12 handwritten documents. The peculiarities of the data set are as follows. The first 6 pages contain Arabic while the remaining 6 pages contain French handwritten text. The text on each of the 12 pages is different and all writers copied the same text.

The CERUG data set contains handwritten documents collected from 105 Chinese subjects, predominantly students from China. Some of them live in China and the rest study in the Netherlands. Every subject is required to write four different A4 pages. On page 1, the participants were asked to copy a text of two paragraphs in Chinese. On page 2, the subjects described certain topics they liked in their own words in Chinese. Page 3 contains English text copied from two paragraphs. This page is split into two sub pages, and each sub page contains one paragraph. In total, there are four handwritten samples from each writer, two in Chinese language and two in English language. All the documents were scanned at 300 dpi, 8 bits/pixel, gray-scale.

The WDAD Data set comprises 800 handwritten text images written by 200 different writers. All the forms in the data are filled by writers selected from different of different ages, handedness, educational levels, and genders. Each writer produced 4 handwritten documents. The peculiarities of the data set are as follows. The first 2 pages contain Farsi while the remaining 2 pages contain English handwritten text. The text on each of the 12 pages is different and all writers copied the same text.

a) Dataset Distribution for Tasks 1 and 2

Tasks 1 and 2 will be carried out on all writing samples of the 105 writers of the CERUG data set. 50 writing samples will be provided as validation dataset for each task while 80 test samples per task will be used to evaluate the system performance.

The training data comprises 80 samples in Chinese and 80 in English text from a total of 40 different writers while the validation data contains Chinese and English handwriting samples of 25 different writers. The naming convention of the images is AAA_B, where AAA represents the writer ID while B represents the sample number. The training and validation data will be grouped as a function of the tasks.

The test set will comprise 160 unlabeled handwritten images, 80 in Chinese and 80 in English. The test data will also be grouped as a function of the tasks and will be provided to the participants to evaluate their systems and submit the results.

b) Dataset Distribution for Tasks 3 and 4

Tasks 3 and 4 will be carried out on all writing samples of the 100 writers of the LAMIS-MSHD data set. 120 writing samples will be provided as validation dataset for each task while 240 test samples per task will be used to evaluate the system performance.

The training data comprises 240 samples in Arabic and 240 in French text from a total of 40 different writers while the validation data contains Arabic and French handwriting samples of 20 different writers. The naming convention of the images is CCC_D, where CCC represents the writer ID while D represents the sample number. The training and validation data will be grouped as a function of the tasks.

The test set will comprise 480 unlabeled handwritten images, 240 in Arabic and 240 in French. The test data will also be grouped as a function of the tasks and will be provided to the participants to evaluate their systems and submit the results.

c) Dataset Distribution for Tasks 5 and 6

Tasks 5 and 6 will be carried out on all writing samples of the 200 writers of the WDAD Data set. 80 writing samples will be provided as validation dataset for each task while 80 test samples per task will be used to evaluate the system performance.

The training data comprises 160 samples in Farsi and 160 in English text from a total of 160 different writers while the validation data contains Farsi and English handwriting samples of 160 different writers. The naming convention of the images is EEE_F, where EEE represents the writer ID while F represents the sample number. The training and validation data will be grouped as a function of the tasks.

The test set will comprise 320 unlabeled handwritten images, 160 in Farsi and 160 in English. The test data will also be grouped as a function of the tasks and will be provided to the participants to evaluate their systems and submit the results.

Online user: 1