Aniko Simon, Jean-Christope Pret, A. Peter Johnson
School of Chemistry
University of Leeds
Leeds, LS2 9JT
United Kingdom
Presentation held at:
Third International Conference on Document Analysis and Recognition
August 1995, Montreal, Canada
ABSTRACT
This paper presents a novel view of document processing, as being
the reverse process to TeX. This concept simplifies the analysis
of the physical structure of documents, and also suggests the use of
a style file for layout recognition. An algorithm is given for both
phases, layout analysis and layout recognition. The bottom-up layout
analysis method employed is based on the Kruskal's algorithm and
uses the distances between the components to construct the physical
page structure. The algorithm is linear with respect to
the number of the connected components. For layout recognition,
a document style description language (DSDL) is introduced.
This helps a fault-tolerant, recursive parsing algorithm to label
the blocks of the document. The presented methods
were designed to be used for scientific publications (papers,
reports, books), but could be applied to a broader range of documents.