	-*- outline -*-

Dataset name: LRDE Document Binarization Dataset (LRDE DBD)
Authors: Guillaume Lazzara and Thierry Géraud


* Introduction
* Content
* Tools
* Benchmark
* Requirements
* Running the benchmark
* Getting results
* Cleaning directories
* FAQ
** How to run this benchmark for specific implementations ?
** How to run this benchmark for my own algorithm on Linux ?
** How to run this benchmark for my own algorithm NOT RUNNING ON LINUX ?
* Updates and Fixes
* License and Copyright
* Acknowledgements
* Contacts
* Related Links




* Introduction

  This benchmark have been developed in order to evaluate the quality
  of binarization algorithms.

  It provides all the necessary tools to reproduce the results exposed
  in papers using this dataset and to extend the benchmark to any
  other methods.

  Feel free to use it and share your results in your
  publications. Please cite our paper in that case (See section
  "License and Copyright").

  Comments and improvements are welcome!

  Thanks and enjoy!



* Content

  src/
    eval_gt.cc		Source code of the binarization groundtruth
  	     		evaluation program.

    edit_dist/  	Source code of the program computing OCR
    			output edit distance.

    line_maker/		Source code of the program extracting text
    			lines images from binarization outputs.

  bench/
    bench.py		The Python script running the evaluations.

    bin.conf		Files configuring each binarization algorithm
			variants (parameters, binaries, data conversion, ...)

    lib/
      html/ 		A python library used to generate HTML outputs.



* Tools

  This benchmark provides implementations and binaries for the
  following algorithms:

  - Kim
  - Niblack
  - Otsu
  - Sauvola
  - Sauvola Multiscale
  - Wolf

These implementations are based on the image processing plaform
Olena. They are released under the GNU GPLv2 license.



* Benchmark

bench.py python script run the full benchmark process. This process is
performed in two steps:

1/ The quality of the binarization of the "clean documents" is
evaluated.

2/ For each binarization algorithm, a selection of lines is passed to
the OCR. Then, the OCR output is compared to the groundtruth and
evaluated thanks to mean edit distance. Lines are grouped by x height
(small, medium and large) ; the result are given for each size and
quality of documents ("clean", "scanned" and "orig").



* Requirements

  - 5GB free on disk.
  - Python 2.7 or later.



* Running the benchmark

  python bench.py

  Now it's time for coffee. :)

  If you encounter any error, please send an email to the authors
  including the generated bench.log file.

  Be careful, the benchmark script does not compute again image
  outputs except if you force it to do so with option
  '--force-regen-output'.  This is intended in order to limit
  computation time and use pre-computed data (for Windows
  implementations for example).



* Getting results

  At the end of the benchmark, you can check if everything went well
  by comparing the results to those exposed in the paper: they should
  be identical.

  The results are located in 'result' directory.

  It contains the following files:

    - bin_evaluation_[implementation].{csv,html}:

	 Pixel-based evaluation results.

    - ocr_evaluation_per_file_[text_size]_[quality].csv:

	Edit distance for each implementation on each line with a text
      	of size [text_size] and from images of quality [quality].

    - *.html files are more readable and summarize the results.


  The pixel-based evaluation is performed only on 'clean' documents.

  All output images and OCR outputs are stored in 'output' directory.



* Cleaning directories

There is no way to clean properly the directories for the
moment. There are two solutions :

- Delete and setup this directory again using the setup script.

- Remove "result" directory and launch bench/bench.py with option
  "--force-regen-output".



* FAQ

** How to run this benchmark for specific implementations ?

For instance, to run this benchmark with Sauvola only:

  cd bench ; python ./bench.py --use-impl 'sauvola'



** How to run this benchmark for my own algorithm on Linux ?

You can include your own algorithm in this benchmark.

0/ Make sure your program:
   - Read and write PNG for input and output.
   - Has a usage matching the following syntax:
     	 ./my_algo [options] <input.png> <output.png>

1/ Copy a binary of your program 'my_algo' in bench/bin

2/ In bench/bin.conf create a file of the same name 'my_algo.conf'. It
describes how to run your binary. It should look something like this:

  my_algo --my-option1 toto --my-option2 plop

Your algorithm will be launched as this with input and output as extra
arguments:

  my_algo --my-option1 toto --my-option2 plop input.png output.png

3/ Run the benchmark for your algorithm:

   cd bench; python bench.py --use-implementation "my_algo"

4/ Get the results in bench/result



** How to run this benchmark for my own algorithm NOT RUNNING ON LINUX ?

No problem. If your algorithm runs on Windows, MATLAB or other tools,
you can just feed the tools with the results.

1/ Compute a binarization for each document in bench/input. Generated
files should have the same name as the input and they must be stored
in bench/output/my_algo.

2/ Create an empty file in bench/bin.conf with name 'my_algo.conf'

3/ Run the benchmark for your algorithm:

   cd bench; python bench.py --use-implementation "my_algo"

Make sure *NOT* to use option --force-regen-output while benchmarking
algorithms with pre-computed results.

4/ Get the results in bench/result



* Updates and Fixes

This tools is meant to be updated and improved by the community.  You
are invited to send bug fixes, suggestions and comments on both tools
and dataset.

From time to time, we invite you to check the latest version of this
benchmark using the setup script available here:

http://www.lrde.epita.fr/dload/olena/datasets/dbd/setup.py



* License and Copyright

LRDE is the copyright holder of all the images included in the dataset
except for the original documents subset which are copyrighted from Le
Nouvel Observateur. This work is based on the French magazine Le
Nouvel Observateur, issue 2402, November 18th-24th, 2010.

You are allowed to reused these documents for research purpose for
evaluation and illustration. If so, please specify the following
copyright: "Copyright (c) 2012. EPITA Research and Development
Laboratory (LRDE) with permission from Le Nouvel Observateur". You are
not allowed to redistribute this dataset.

If you use this dataset, please also cite the most appropriate paper
from this list:

- Efficient Multiscale Sauvola's Binarization. In the International
  Journal of Document Analysis and Recognition, 2013

- The SCRIBO Module of the Olena Platform: a Free Software Framework
for Document Image Analysis. In the proceedings of the 11th
International Conference on Document Analysis and Recognition (ICDAR),
2011.

This data set is provided "as is" and without any express or implied
warranties, including, without limitation, the implied warranties of
merchantability and fitness for a particular purpose.



* Acknowledgements

The LRDE is very grateful to Yan Gilbert who has accepted that we use
and publish as data some pages from this French magazine "Le Nouvel
Observateur" (issue 4202, November 18th-24th, 2010) for our
experiments.



* Contacts

Guillaume Lazzara - z@lrde.epita.fr	- Maintainer
Thierry Géraud 	  - theo@lrde.epita.fr



* Related Links

** All Ressources Related to the Paper
http://publications.lrde.epita.fr/201209-IJDAR

** Dataset Ressources Page
http://olena.lrde.epita.fr/Datasets

** Olena - A Generic Image Processing Platform
http://olena.lrde.epita.fr

** Olena Demos
http://caroussel.lrde.epita.fr/olena/demos

*** Sauvola Multiscale Online Demo
http://caroussel.lrde.epita.fr/olena/demos/sauvola_ms.php

** Publications Related to Olena
http://olena.lrde.epita.fr/Publications

** EPITA Research and Development Laboratory (LRDE)
http://www.lrde.epita.fr
