A document recovery and improvement system is a system that takes a set of files representing images of text (and possibly other items as well such as pictures and diagrams), and produces output in various forms such as new image files with higher quality and/or resolution, information about positions of glyphs in the document, and various forms of textual interpretation of the contents.
We imagine a document recovery and improvement system which supports a scenario very different from that supported by most similar existing programs such as systems for optical character recognition (OCR).
We imagine a large document such as an entire book possibly with hundreds of pages that has been scanned into a large number of image files.
The system should be an interactive assistant to an operator that might spend hours or even days with a single book.
The document may contain scripts that are unknown to the system, and it may contain different fonts and sizes of these scripts. It is assumed, however that the number of different distinct glyphs is not too large, say at most a few thousand, and more often a few hundred, different glyphs. The system should not depend on prior knowledge of any scripts or fonts in order to work.
The operator may want output from the system at various stages of the processing. One such type of output might be a set of image files with improved resolution and repaired characters. Another type of output might be a set of distinct glyphs with a position (input image file + x/y coordinates) for each occurrence of the glyph. Yet another type of output might be some linear representation of the text in the form of Unicode characters.
One might for instance imagine working with the software like this:
Inform the software of a directory containing the image files.
Allow the user to apply various image-processing algorithms such as skew correction, projection corrections, etc.
Create a set of glyph categories, where a glyph category represents a set of glyphs belonging to the same font, and having the same face and size. Glyph categories can either be created manually or automatically. Manually creating a set of glyph categories involves inspecting the document for existing glyphs and creating a category for each distinct set of glyphs. The system should be able to create an initial set of glyphs by identifying areas that represent glyphs and comparing such areas for similarity according to some distance metric. At any time, it should be possible to merge two or more glyph categories, or split a glyph category by using a finer distance metric. It should be possible to select some area of any image file and associate that area with a glyph category. For scripts with relatively few glyphs (such as Latin-based scripts) and documents with relatively few distinct fonts, it is entirely reasonable for the user to manually create the categories. In fact, it is very hard to automate this process, because some heuristic would have to be used for when an area contains a single glyph instance, and that heuristic might not be desirable (such as turning diacritical marks into separate glyphs).
For each glyph category (presumably with several glyph instances in it), it should be possible to create a best representative of the glyphs inside it. This can be done automatically by a process of overlapping and averaging the image areas. It should also be possible to produce best representatives with higher resolution than the existing glyph instances.
Ask the system to scan image files for areas that are very likely to contain instances of existing glyph categories (according to some distance metric) and associate those areas with the glyph categories in question.
Ask the system to display glyph instances that cannot be associated with any existing glyph category (because the distance metric gives too large a distance from every existing glyph category). Allow the user to manually associate unaffected glyph instances with an existing category. At the end of this step, each glyph instance has an associated category, and each category has a best representative.
From the glyph categories, produce new image files containing the best representative (possibly with a better resolution than the original) of each glyph category.
Ask the system to produce a dictionary i.e. a set of words where a word is just a sequence of glyph categories that have instances that are close together. Have the system inform the user of the existence of words that occur very few times (which might indicate an error in an association of a glyph instance with a category). It might be possible to abandon the idea of a word as instances that are close together, which might create a more robust system. On the other hand, this concept is very useful for Latin-based scripts and would allow some OCR-like functionality.
Associate a sequence of Unicode characters with each glyph category, thus making it possible to translate words (according to the definition above) to sequences of Unicode characters. Use an existing dictionary to check such words for spelling. Words that do not occur in the existing dictionary might indicate an error in an association of a glyph instance with a category. Notice that, because of the existence of ligatures, a glyph category might represent several consecutive characters.
Produce output from the system in the form of Unicode, given the association of Unicode characters to glyph categories.
Allow the user to produce font, face, and size information with each glyph category, and produce output in the form of Unicode text with markups for different fonts, faces, and sizes.
(This is vague) Allow the user to indicate areas of different pages that contain different kinds of information (such as tab stops) and have the system produce information about what text belongs to different areas.