OCR Evaluation for LRDE DBD
scanned, magazine, documents, OCR
OCR evaluation: Lines are extracted from the binarization outputs and OCR (Tesseract) is run in order to compare to OCR ground-truth. It is performed from binarization of “clean”, “scanned” and “original” documents.
Purpose of the three document qualities :
- Original : evaluate the binarization quality on perfect documents mixing text and images.
- Clean : evaluate the binarization quality on perfect document with text only.
- Scanned : evaluate the binarization quality on slightly degraded documents with text only.
Lines for OCR evaluation are also grouped by size: small, medium and large. (0 < small <= 30 < medium <= 55 < large < +inf). It shows how robust is a binarization algorithm to objects with different sizes in a single document.
Tools are provided to read and process all the data.
A setup script is provided to download and configure the benchmarking environment.
A Python script is provided to launch the benchmark and compute scores.
C++ programs (and sources) are provided for performing evaluations and reading ground-truth data.
6 binarization algorithms (and their respective C++ sources) are provided and compiled to run this benchmark on their results.
A setup script is available to download and setup the benchmark system. This is the recommanded way to run this benchmark. Note that this script also includes features to update the dataset if a new version is released.
Minimum requirements: 5GB of free space, Linux (Ubuntu, Debian, …)
Dependencies: Python 2.7, tesseract-ocr, tesseract-ocr-fra, git, libgraphicsmagick++1-dev, graphicsmagick-imagemagick-compat, graphicsmagick-libmagick-dev-compat, build-essential. libtool. automake, autoconf. g++-4.6, libqt4-dev (installed automatically with the setup script on Ubuntu and Debian).
Related Ground Truth Data
- Tools for processing (0.08 Mb)
This page is editable only by TC11 Officers .