only search Keymodule website

Automated extraction of chemical information from documents: Recent advances in the CLiDE project

Aniko T. Valko, A. Peter Johnson
Poster presented at:
9th International Conference on Chemical Structures
June 2011, Noordwijkerhout, The Netherlands
ABSTRACT

CLiDE, an acronym for Chemical Literature Data Extraction [1], is a software application that interprets molecular images found in a variety of sources and extracts meaningful chemical information such as structure connection tables that can be stored in standard electronic formats. The tool is commonly used to extract chemical structures from scientific literature, patents, legacy corporate documents and miscellaneous image files.

During the last few years, CLiDE has seen extensive development aimed at making it an easy-to-use aid for experimental chemists. The system can be used either interactively or in batch mode for unsupervised extraction. The interactive versions are now equipped with a new document browser style user interface, akin to the Adobe Acrobat Reader. The structures extracted by CLiDE can be seamlessly transferred to drawing software such as ChemDraw, ISIS/Draw or Accelrys Draw, edited where necessary and saved in a variety of chemical formats.

The presentation will focus on a number of difficult problems faced in this ongoing work and the techniques used to solve some of them. These include (a) automated detection of regions of a document which contain chemical structures; (b) automated identification and storage of possible errors in interpretation in order to flag the need for manual editing (especially important for batch processing); (c) structural motifs which frequently cause errors; (d) capture of data associated with structures.

CLiDE has been tested on a large number of images and documents originating from various sources. The results of these tests will be summarized to show the level of accuracy of recognition that is achievable with CLiDE.

[1]: Valko, A. T; Johnson, A. P. CLiDE Pro: The Latest Generation of CLiDE, a Tool for Optical Chemical Structure Recognition. J. Chem. Inf. Model., 2009, 49(4), 780-787.