Aniko T. Valko, A. Peter Johnson
Poster presented at:
8th International Conference on Chemical Structures
June 2008, Noordwijkerhout, The Netherlands
ABSTRACT
Depictions of two-dimensional chemical structures published in the literature are stored as bitmap images in most electronic sources of chemical information such as reports, journals and patents. Although the original chemical structures are usually created using chemical drawing programs which generate complete structural information, this information is lost during the publication process and if required, is normally regenerated by redrawing the structure with a computer program, which is time-consuming and prone to errors.
CLiDE Pro is a chemical OCR software tool aimed at automatic extraction of chemical information from either the printed chemistry literature, or from the equivalent electronic pdf version. CLiDE Pro is the latest incarnation of software to emerge from the long-term CLiDE (Chemical Literature Data Extraction) project [1-3]. Chemical OCR involves three main problems: (a) identification of chemical images within a document, (b) compilation of chemical graphs of individual molecules from chemical images, and (c) interpretation of complex objects such as generic molecules and reaction schemes using the retrieved chemical graphs. The structure recognition methods implemented in CLiDE Pro will be presented. Structure features which frequently cause problems such as crossing bonds, lines found in various chemical entities such as single bonds attached to triple bonds, dashed bonds and parts of atom labels commonly misclassified as lines (e.g. I and Cl) will be discussed together with our solutions to these problems. A key component of the presentation will be CLiDE Pro's approach to the interpretation of generic structures.
The chemical OCR tool which has 100% accuracy in all situations has yet to be developed, and indeed is unlikely to be developed in the foreseeable future. This exactly parallels the situation for text OCR, where despite decades of research, accuracy of recognition still falls a little short of 100% and requires some manual editing, but is still very useful. If chemical OCR can reach similar levels of accuracy then automated mining of the chemical literature will become a powerful and cost-effective process.
[1]: Ibison, P.; Jacquot, M.; Kam, F.; Neville, A. G.; Simpson, R. W.; Tonnelier, C.; Venczel, T; Johnson, A. P. Chemical Literature Data Extraction: The CLiDE Project. J. Chem. Inf. Comput. Sci. 1993, 33(3), 338-344.
[2]: Ibison, P.; Kam, F.; Simpson, R. W.; Tonnelier, C.; Venczel, T; Johnson, A. P. Chemical Structure Recognition and Generic Text in the CLiDE Project. In Proceedings on Online Information 92; London, England, 1992.
[3]: Simon, A.; Johnson, A. P. Recent Advances in the CLiDE Project: Logical Layout Analysis of Chemical Documents. J. Chem. Inf. Comput. Sci. 1997, 37(1), 109-116.