Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Aniko T. Valko, A. Peter Johnson, Vilmos A. Valko
Presentation held at:
244th ACS National Meeting
CINF 35, Hunting for Hidden Treasures: Chemical Information in Patents and Other Documents
August 2012, Philedalphia, PA, USA

Chemists routinely communicate information about structures and reactions in the form of 2D structure diagrams, which are easily understood by readers but are not directly accessible for processing by chemical information systems which require a connection table format for chemical structures.

CLiDE is an established optical chemical structure recognition (OCSR) tool that aims to address this problem. Recent improvements to the CLiDE system will be presented, including the way CLiDE processes different types of documents such as patents, journal articles and internal reports. New methods for tackling problematic scenarios originating from document quality degradation and difficult drawing features will be discussed as will improvements in the chemical intelligence of CLiDE's structure checker and structure fixer modules.

These improvements have a considerable effect on CLiDE's accuracy and speed. A detailed study of CLiDE's performance on some widely available datasets will be presented alongside that of some publically available OCSR systems.