(Chem)DeTeX automatic generation of a markup language description of (chemical) documents from bitmap images

Aniko Simon, Jean-Christope Pret, A. Peter Johnson
School of Chemistry
University of Leeds
Leeds, LS2 9JT
United Kingdom
Presentation held at:
Third International Conference on Document Analysis and Recognition
August 1995, Montreal, Canada

This paper presents a novel view of document processing, as being the reverse process to TeX. This concept simplifies the analysis of the physical structure of documents, and also suggests the use of a style file for layout recognition. An algorithm is given for both phases, layout analysis and layout recognition. The bottom-up layout analysis method employed is based on the Kruskal's algorithm and uses the distances between the components to construct the physical page structure. The algorithm is linear with respect to the number of the connected components. For layout recognition, a document style description language (DSDL) is introduced. This helps a fault-tolerant, recursive parsing algorithm to label the blocks of the document. The presented methods were designed to be used for scientific publications (papers, reports, books), but could be applied to a broader range of documents.