Aniko T. Valko, A. Peter Johnson
Poster presented at:
9th International Conference on Chemical Structures
June 2011, Noordwijkerhout, The Netherlands
ABSTRACT
CLiDE, an acronym for Chemical Literature Data Extraction [1],
is a software application that interprets
molecular images found in a variety of sources and extracts
meaningful chemical information such as
structure connection tables that can be stored in
standard electronic formats. The tool is commonly used to
extract chemical structures from scientific literature,
patents, legacy corporate documents and miscellaneous image files.
During the last few years, CLiDE has seen extensive development
aimed at making it an easy-to-use aid for experimental chemists.
The system can be used either interactively or in batch mode
for unsupervised extraction. The interactive versions are now
equipped with a new document browser style user interface,
akin to the Adobe Acrobat Reader. The structures extracted by
CLiDE can be seamlessly transferred to drawing software such as
ChemDraw, ISIS/Draw or Accelrys Draw, edited where necessary and
saved in a variety of chemical formats.
The presentation will focus on a number of difficult problems
faced in this ongoing work and the techniques used to solve
some of them. These include (a) automated detection of regions
of a document which contain chemical structures;
(b) automated identification and storage of possible errors in
interpretation in order to flag the need for manual editing
(especially important for batch processing);
(c) structural motifs which frequently cause errors;
(d) capture of data associated with structures.
CLiDE has been tested on a large number of images and documents
originating from various sources. The results of these tests will
be summarized to show the level of accuracy of recognition
that is achievable with CLiDE.
[1]: Valko, A. T; Johnson, A. P. CLiDE Pro:
The Latest Generation of CLiDE, a Tool for Optical
Chemical Structure Recognition.
J. Chem. Inf. Model., 2009, 49(4), 780-787.