The DocLab Dataset for Evaluating Table Interpretation Methods
Raghav Krishna Padmanabhan 89, 14th Street , Apt#1 Troy, NY - 12180 Email: raghav.krishna[at]gmail.com Tel: 979-571-5551 USA
Augmentations, Aggregates, Evaluation , Footnotes, Table interpretation
The dataset is a collection of 165 files culled from 9 websites in the geopolitical domain. The files are in one of the following formats – HTML (77), Excel (67), and CSV (20). Each file contains at least one table. The dataset consists of a total of 172 tables.
DATASET CONSTRUCTION: The files comprising the dataset were selected based on the following constraints on the tables they contained. 1. Tables with rectilinear structure only. 2. Tables with text in English language only. 3. Tables that do not contain graphic symbols or figures. 4. Non recursive tables, i.e., no table with a table as one of its content cells. 5. Non-concatenated tables (no tables formed by concatenating two or more tables). 6. Tables which do not span more than one HTML page or Excel sheet.
Statistics for each table are provided in an Excel file. The information recorded is table size (number of rows and columns), augmentations (aggregates, footnotes, units), Wang dimensionality and source Web Site.
Related Ground Truth Data
- Padmanabhan, R., Jandhyala, R. C., Krishnamoorthy, M., Nagy, G., Seth, S., Silversmith, W.: Interactive Conversion of Web Tables. In: Procs. Eighth IAPR International Workshop on Graphics Recognition (GREC 2009), City University of La Rochelle, France, Lecture Notes in Computer Science, 6020, Springer, Heidelberg (In Press) (2010)
- Seth, S., Jandhyala, R. C., Krishnamoorthy, M., Nagy, G.: Analysis and Taxonomy of Column Header Categories for Web Tables (Oral Presentation). In: Procs. Ninth IAPR International Workshop on Document Analysis Systems, Boston, Massachusetts (2010), ID: 73
- Nagy, G. Padmanabhan, R., Jandhyala, R. C., Silversmith, W., Krishnamoorthy, M.: Table Metadata: Headers, Augmentations and Aggregates. In: Procs. Ninth IAPR International Workshop on Document Analysis Systems, Boston, Massachusetts (2010), ID: 77
- Padmanabhan, R.: Table Abstraction Tool, Master’s Thesis, Rensselaer Polytechnic Institute, May 2009.
This page is editable only by TC11 Officers .