ICDAR2017

Special Workshop Speaker

Title

Context-Aware Document Analysis


Name

Apostolos Antonacopoulos

Abstract

While our community has made major advances in image pre-processing and OCR over the years, the billions of documents digitised around the world remain unusable by the standards most people expect to access the information. It is not possible to improve much further the individual technologies in isolation – their performance is already almost as good as it can get. The latest OCR engines perform at the limit of what is possible, given just pixels as input. We even have free apps on our phones that produce good quality scans of everyday documents. However, for each digitised document we do not have much more than a bunch of (almost) correct words – no meaning. Apart from simple keyword search, this is not useful for much else.

To break through the current performance bottleneck and achieve a higher-level description of the documents (as demanded by the public) we need to place the methods we develop in context. Instead of just the (pre-processed perhaps) image pixels being converted by OCR to words in textlines, we need to first (i) understand the layout accurately, and then (ii) the application domain. The former is the natural next step in enhancing the usefulness of the extracted text (basic semantics). The latter is important for both achieving better recognition performance and increasing the utility of the information to answer more natural questions that people want to ask. In addition to improving (physical and logical) layout analysis, the obvious following step is to encode and use domain knowledge. We need to extract the relations encoded in tables, recognise whether text is part of the body of an article or a figure caption etc. and understand terms and concepts. Ultimately we want to link information across documents. The possibilities are endless and our community is the best placed one to act on this!


Short Bio

Apostolos Antonacopoulos leads the Pattern Recognition and Image Analysis (PRImA) research Lab at the School of Computing, Science and Engineering at the University of Salford, UK where he currently holds the post of Professor of Pattern Recognition. He received his PhD from the University of Manchester, Institute of Science and Technology (UMIST), UK in 1995. From 1995 to 2004 he worked as Lecturer in the Department of Computer Science at the University of Liverpool where he founded the PRImA Lab. In 2005, he joined the University of Salford as Senior Lecturer and the PRImA Lab was established and strengthened at Salford. In the same year, he received the IAPR/ICDAR Young Investigator Award for "Outstanding service to the ICDAR community and his innovative research in historical document processing applications."

Professor Antonacopoulos has worked and published extensively on various problems in Document Analysis and Understanding (Image Enhancement, Segmentation, Recognition, Performance Evaluation) as well as on other applications of Pattern Recognition and Image Analysis. He has co-edited the first Special Issue (IJDAR) on Historical Document Analysis as well as the first book on Web Document Analysis. He is currently serving on the Executive Committee of the International Association for Pattern Recognition (IAPR) as Treasurer, having also held the posts of 1st and 2nd Vice President. He has also chaired or served as a member of a number of IAPR and other professional committees.

Professor Antonacopoulos has given a number of invited talks and tutorials and has held engagements as a technical advisor to libraries and archives, among which are the British Library and the Wellcome Library. He has significant experience in leading and participating in national, European (FP7 and earlier), international and industry-sponsored projects. Recent significant project involvement includes the €4M Europeana Newspapers EU-funded project (extraction and recognition of text in newspapers for the European Digital Library), the €1.8M SUCCEED EU-funded support action for the Centre of Competence in Digitisation and the US$734K Early Modern OCR project (EMOP) funded by the Andrew W. Mellon Foundation.