Difference between revisions of "DAS-Discussion: Information Extraction (2014)"

From TC11
Jump to: navigation, search
(Created page with "Back to DAS-Discussion:Index {| style="width: 100%" |- | align="right" | {| |- | {{Last updated}} |} |} == DAS Working Subgroup Meeting: Information Extraction == Autho...")
 
(DAS Working Subgroup Meeting: Information Extraction)
 
Line 20: Line 20:
 
* Xin TAO
 
* Xin TAO
 
* Ronaldo MESSINA (a2ia)
 
* Ronaldo MESSINA (a2ia)
* Nibal NAYEF (me !)
+
* Nibal NAYEF (France)
 
* Bao
 
* Bao
  

Latest revision as of 22:54, 2 January 2015

Back to DAS-Discussion:Index

Last updated: 2015-001-02

DAS Working Subgroup Meeting: Information Extraction

Authors:

  • Nibal Nayef

Participants:

  • Yoshinori AKAO (Japanese police)
  • Saddok KEBAIRI (Itesoft)
  • Manaba OHTA
  • Xin TAO
  • Ronaldo MESSINA (a2ia)
  • Nibal NAYEF (France)
  • Bao

Introduction

We have totally different views of information extraction Different tasks:

  • Entity spotting (numbers, words, ….)
  • Graphics spotting (logos, symbols, tables etc.)
  • Semantics after text recognition
  • Logical structure

What is a document ??!!

We have many types of documents [and increasing]:

  • Digitally born documents
  • Camera / mobile captured
  • Scanned

..

To extract any kind of information from any type of document, we need a sort of “prerequisite” module, so that IE modules can work on all document types

Problems of IE

  • What kind of semantic information should we extract?: Technical terms, ….
  • Define the logical structure of a document
  • Same information in different representations: Same name in different languages
  • What are the ground truth data, size of training data?: Use human voting to build GT
  • Ultimate goal: Automatic and complete understanding of document contents.
  • Application: Enrich Data Mining

Approaches

CRF, NLP, and all methods for word/graphic spotting

Future Directions

Combine methods from different fields:

  • Image processing
  • Natural language processing

Take into account that documents are drastically changing