only search Keymodule website

Recent enhancements in the accuracy of CLiDE tool for extracting chemical structure data from patents and other documents

Aniko T. Valko, A. Peter Johnson
Presentation held at:
248th ACS National Meeting
CINF, Hunting for Hidden Treasures: Chemistry Text Mining in Patents and Other Documents
August 2014, Philedalphia, PA, USA

We present an enhanced version of CLiDE, which is a long-term project aimed at detecting chemical structure diagrams rendered in images and converting these diagrams into chemical connection tables. The enhancement was achieved by introducing a feedback mechanism into CLiDE's interpretation process. This mechanism makes use of a series of domain- and spatial-specific rules for identifying drawing features that convey a complex or an ambiguous meaning. Once such a feature is found, CLiDE automatically corrects the structural information being compiled and passed through subsequent interpretation steps.

This enhancement has a considerable effect on CLiDE's accuracy in reconstructing chemical structures and auto-detecting interpretation errors. A detailed study of CLiDE's performance on a large validation corpus will be presented. The validation corpus will include benchmark sets created by other projects and a set of non-Markush structures collected from patent documents.