biwlogo.jpg (8K)


At last! Recognition for chemists

David Bradley on how Clide will make extracting chemical information from research papers that much easier

Chemistry is all about structure and, as readers of these pages well know, there are countless packages for drawing and editing chemical structures. But, what if you have a parcel of scanned reprint research papers, or a pile of downloads from which you would like to extract the chemical information? One way would be to work your way through each paper, redrawing the structures therein in one of those drawing packages. The problem, of course, is that printed pictures, embedded gifs and the like, carry with them none of the underlying chemical information essential for creating an accurate chemical structure - with the atoms and bonds in the right places that can be manipulated as a 'real' chemical model on the screen.

Clide - Chemical Literature Data Extraction project - does for chemists and drug discovery researchers what OCR (optical character recognition) does for wordsmiths. It takes flat, two-dimensional, semantics-free representations and turns them into the 'real' thing. Take a bitmap image or a PDF file of that molecular structure, run it through Clide, and the product is the more familiar, but crucially manipulable, chemical structure.

Growth potential
Although the application of chemoinformatics in the drug discovery process has grown over the past two decades, to the point that it is currently being applied at all stages of the process, there is still huge potential for growth. Integrated data storage and retrieval solutions that interconnect and provide feedback to the various links in the chain are increasingly desirable. Chemically intelligent solutions like Clide will help produce successful drugs faster, and with less cost.

The program started life as a project in the Department of Chemistry at the University of Leeds, UK, and is now a commercially available entity that comes in a full and a Lite version.

Clide extracts information from chemical literature, then stores it in a database. The internal representation of Clide stores all information included in the structure in such a way that it can be converted to other formats. The input of the program is a bitmap image (BMP) of a chemical document or PDF file.

The program outputs information as an ASCII file containing the recognised structures, the reactions, and any text associated with the image. This is stored in the Clide database. By default, information is held in Clide-file format, which contains all the information that has been processed by Clide. However, the information can be exported in other formats such as the commonly used MDL mol file and CambridgeSoft's ChemDraw v 6.0 format; Mac ChemDraw is also available as are two formatting and printing systems, Postscript and text only.

These formats can be handled by more sophisticated chemical spreadsheets and databases. They also allow the extracted structures to be easily edited, displayed and converted in a variety of useful ways. A mol file, for instance, can be rendered in three dimensions on screen using MDL's associated Chime program, or its precursor system Rasmol, in a web browser. However, these proprietary formats contain only the raw molecular structure information that their design allows. The mol file contains the connection tables of the structures, for example, so page numbers, titles and associated text and images are lost in converting to this format.

Clide document/image processing takes place in three constituent parts. First is the physical document-structure analysis. This consists of the identification of the connected components of the image, which Clide makes by loading an image, and the demarcation of the image, into its graphic and textual regions. Bibliographical information is extracted automatically.

Second is the recognition of the primitives. This starts with classification of the connected components into characters, lines, and graphics. This is followed by the application of the recognition process to these elements by Clide's sophisticated algorithms. The characters are recognised by the program's OCR module, while the graphic lines are recognised by the graphic recognition module.

The final piece
Finally, there is the logical document-structure recognition. The page is analysed and chemical structures recognised as logical units. This involves building the connection table of the structure that relates each point and line to an atom and a bond. At this stage, reactions too are recognised so that reactant and product can be differentiated. Generic text is also identified at this stage and checked, or parsed. Lastly, any reaction text is then identified and parsed so that it is retained within any reaction scheme. The system can also extract generic structure information from graphics and text, and either expand or condense such structures before they are entered into a database. The whole process takes seconds per 'printed' page.

It is a fantastic program in concept, and I can see it being developed much further for large pharmaceutical company databases and abstracting services, once the programmers have ironed out the residual bugs. Clide could also dovetail nicely with Simbiosys' simulated biomolecular systems, in which the company provides a software system for rational drug design and the optimisation of small, organic therapeutic molecules. One additional area in which Clide could lend a helping hand is in the creation of a truly chemically literate web search engine.

Clide is available for the MS Windows 2000/NT 4.0/XP/Me/98/95 platforms.

back to main features page