ChemInfty - Chemical Structure GT

From TC11
Jump to: navigation, search

Datasets -> Datasets List -> Current Page

Created: 2010-05-13
Last updated: 2011-001-27


Chemical Structure Recognition, Diagram Recognition, Character Recognition


At first, the images were recognized through our initial recognition engine, and the results were corrected manually. The result of this step is called the 'ChemInfty Graphical Structure GT' and it basically includes the positions of characters and lines.

Then using a separate software the 'ChemInfty Graphical Structure GT' was further processed to extract the chemical structure representation. The results of this step were also manually corrected and gave rise to the 'ChemInfty Chemical Structure GT'.

The file format used is the commonly used MDL SDF format, which is one of CTfile formats. The specification of the SDF format can be downloaded from here.

In addition to the complete dataset ground truth, a more focused subset is supplied, selected to have only organic molecules with at least 5 heavy (non-hydrogen) atoms and molecular weight less than 1,000. This subset represents chemical structures which are potentially of interest to medicinal chemists and the pharmaceutical industry. Opening this subset was proposed and made by Igor Filippov. Many thanks go to Igor Filippov (

Related Dataset

Related Tasks

None defined

Submitted Files

This page is editable only by TC11 Officers .