DAS-Discussion: Datasets, Benchmarks, Competition, and Continuity of Research

Last updated: 2015-001-02

DAS Working Subgroup Meeting: Datasets and Benchmarks

Authors:

Bart Lamiroy (Secretary) – Université de Lorraine

Further Participants:

Elisa Barney Smith – Boise State University
Abdel Belaïd – Université de Lorraine
John Fletcher – Canon
Liangcai Gao – Peking University
Albert Gordo – CVC Barcelona
Masakazu Iwamura – Osaka Prefecture University
Dan Lopresti – Lehigh University
Tomohsa Matsushita – Tokyo University of Agriculture and Technology
Jean-‐Yves Ramel – Université de Tours
Marc-‐Peter Schambach – Siemens
Ray Smith (Moderator) – Google Inc.

Context

The goal of this discussion group is to address the availability, use and dissemination of benchmarks, datasets and ground truth in order to promote subjective and reproducible assessment of document analysis methods, collaboration and exchange of research results in the document analysis domain. The main idea is that “what you measure is what improves”, and that it is difficult to obtain reliable measures expressing the global progress of the state‐of‐the‐art.

Topic Discussion History

As a brief reminder of the evolution of this topic as discussed during other DAS editions, we refer the interested reader to the TC‐11 website. In 2010 the main focus of discussion essentially related to making datasets other reference material available to the community and how to provide centralized access to it, how to credit and value contributors and how to maintain a level of control (data curation, availability over time, …) that would insure that the data and algorithms remain usable and useful over an as long as possible period of time. The reported discussions were essentially concerned with feasibility of these concepts, rather than impact, and focused on the TC‐11 initiative of data collection and the DAE platform (http://dae.cse.lehigh.edu).

Discussion Topics

During the DAS 1012 edition, the following potential discussion topics were identified after a short brainstorming session, ranked by order of (subjectively) perceived importance: 1. When is a problem stated? Should CFPs be more specific to what topics to address and how they should (could) be measured? How does this relate to hosting competitions? Interaction with whole or end‐to‐end evaluation systems. 2. What are the fundamental reasons to the perceived difficulties to sharing data sets? (public vs. copyright vs. privacy) 3. Would it be a good idea to more formally integrate the availability of data sets and reports of benchmarking into the acceptance criteria for publications. 4. Is there a risk of data sets directing research? Is this good or bad? 5. Open binaries/open source?

When is a Problem Stated?

This question is considered by the discussion panel members as an essential preliminary step to D. Lopresti and G. Nagy's paper “When is a Problem Solved” in ICDAR 2011, and relates to the initially identified issue concerning the difficulty of measuring the overall contributions of individual research results to the improvement of the global state‐of‐the‐art. Stating a problem is related to measuring some level of achievement, and therefore directly correlated to expressing ground truth. One may conjecture that a problem is stated when there is consensus on the ground truth on the one hand and there is a data set collection of statistically proven significance. Measurement of advancement toward solving a stated problem would then consist of:

track record of results over time,
defined best practices by the community,

This means that the evolution of the best practices (and the track record of the results) could give a more precise view of the improvement of the agreed upon state‐the‐art. This also means that there is a need of commenting and annotating the reference data sets by the community and also that there may be a need to evaluate individual research results within the scope of broader criteria (e.g. contribution in end‐to‐end application evaluation) The general consensus of the discussion panel is that there might be an interest in experimenting a more formal approach to managing tracks in conferences and acceptance criteria to particular events or publications, by clearly stating (at the time of the CFP) the benchmark to which contributions need to measured. This could consist of:

specific problem statements,
hosting competitions in direct relation with the track or conference and creating strong incentives for all submissions to compete,
ensuring continuity of both data sets, ground truth, and algorithm availability year after year,
requiring that reviewers have reasonable access to the data sets and have the means of checking the reported results.

However, it is extremely important to stress that this should never be the sole criteria for acceptance and publication of papers since there is a significant risk of limiting innovating non‐mainstream approaches and the emergence of investigations into new (previously not considered, or considered uninteresting) problems. This is discussed in one of items developed below.

Difficulties in Sharing Data Sets

On this issue, the discussions have rather identified a number of open issues, without necessary finding ways of solving them. The issues are:

Making data sets available is not a technical issue but a cultural one, not only related to legal issues, but also to the need of acknowledgement by peers and ROI with respect to the effort/cost of creating data sets.
Although data sets may be of significant interest, and not be limited by intellectual property or copyright, they may be restricted for publication because of privacy issues. In that case, anonymization processes may not necessarily be possible or appropriate, and are always very costly. Approaches of creating synthetic data sets may yield solutions in some cases.
DMCA protection is probably the most convenient framework for academia to reduce the risk of distributing data sets of which the origin cannot be totally guaranteed copyright infringement free.
On the other hand, companies are often reluctant to release data, either because of overly concerned legal departments and zero‐risk policies, or because of the significant competitive advantage particular data sets may yield. With respect to the issues mentioned to open and verifiable access to reported results it may be possible to conceive non--‐disclosure bound access to datasets, while still giving reasonable possibilities to verify reported results (e.g. the open access to provenance data – and not the original data – in systems like the DAE platform)

Changing Acceptance Criteria

Imposing to confront results to a previously agreed upon benchmark prior to acceptance for publication may prove to be a double edged sword. The discussions have tried to identify pros and cons.

As already hinted previously, this would require a shift in the way some events are publicized and organized, since the CFP would necessarily include all the required information (evaluation procedures, data sets, benchmark infrastructure, ...)
Imposing stringent benchmarking criteria may not prove a good idea for smaller events, confronted to basic economics and affect the number of submissions and the acceptance rate too strongly.
On the other hand, some mature topics should very strongly impose the use of standard benchmarks.
This would also require a shift in the review/acceptance process: ◦ In order to preserve the possibility to publish innovative non--‐mainstream new research evaluation should integrate some level of weighting setting a cursor between “correctly benchmarked and conforming to criteria” and “out of scope with respect to criteria, but potentially groundbreaking new topic”, for instance

 * The possibility to have conditional acceptance and a response phase after review. 
 * Extra load on reviewers, and requirements to be able to correctly verify claimed results.

There may be an extra load of reviewers
It would be interesting to get the broader community's feeling about this.

Data Sets Direct Research

Before data sets become commonly accepted and agreed upon bases for benchmarking, they should undergo some community approval. This raises quite some potentially controversial issues:

Data‐driven research evaluation may have a very good impact if the data is good, buy may be harmful is data is bad.
Data sets progressively get out of date as knowledge evolves. (one might consider a problem underpinning a data set solved, when the set is considered obsolete)
Special interest groups can try to dominate or influence decisions.
Some datasets may not be considered of interest in specific cases of 3rd party supported research.

Navigation menu