ISOMORPHIC PATTERN RECOGINITION

Title:

ISOMORPHIC PATTERN RECOGINITION

Document Type and Number:

WIPO Patent Application WO/1998/016897

Kind Code:

A1

Abstract:

The present invention is a system and method for character recognition using pattern recognition. In one embodiment of the invention, a plurality of representations of a plurality of characters is received by a system (306). A plurality of cipher labels are assigned to representations in the plurality of representations. The plaintext identity of cipher labels in the plurality of cipher labels is determined using isomorphic pattern recognition (316). The present invention includes a plaintext isomorphic dictionary (314) containing plaintext words organized by their root isomorphic patterns. Additionally, the present invention includes a Raster-Plaintext Library (318) which contains extensive data on raster letters, and their plaintext identities, from an exhaustive collection of fonts.

Inventors:

HALL FLOYD STEVEN JR (US)
HOWIE CAMERON TELFER (US)

Application Number:

PCT/US1997/019130

Publication Date:

April 23, 1998

Filing Date:

October 15, 1997

Export Citation:

Click for automatic bibliography generation Help

Assignee:

CONVEY CORP (US)
HALL FLOYD STEVEN JR (US)
HOWIE CAMERON TELFER (US)

International Classes:

G06V30/262; G06V30/10; (IPC1-7): G06K9/00; G06K9/18; G06K9/46

Foreign References:

US5384863A	1995-01-24
US5410611A	1995-04-25
US5539841A	1996-07-23
US5455871A	1995-10-03
US5557689A	1996-09-17
US5642435A	1997-06-24
US5689620A	1997-11-18

Attorney, Agent or Firm:

Haynes, Mark A. (650 Page Mill Road Palo Alto, CA, US)

Download PDF:

View/Download PDF PDF Help

Claims:

WHAT IS CLAIMED IS:

1.

A system for character recognition, comprising: a resource that receives a plurality of representations of a plurality characters; a resource that assigns a plurality of cipher labels to representations in the plurality of representations; and a resource that identifies a plaintext identity of cipher labels in the plurality of cipher labels using pattern recognition.

2.	The system of claim 1, wherein: the pattern recognition includes Isomorphic Pattern Recognition.

3.	The system of claim 1, wherein: the resource that assigns the plurality of cipher labels to representations in the plurality of representations assigns a different cipher label to each different representation in the plurality of representations.

4.	The system of claim 1, wherein: the resource that assigns the plurality of cipher labels to representations in the plurality of representations produces a plurality of cipher words.

5.	The system of claim 1, wherein: the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that creates sets of cipher labels selected from the plurality of cipher labels.

6.	The system of claim 5, wherein: the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that creates a Document Word List which contains cipher words in the plurality of cipher words.

7.	The system of claim 6, wherein: the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that selects a Selected Word from the Document Word List, the Selected Word containing at least one cipher label to be identified.

8.

The system of claim 7, wherein: the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that creates a candidate plaintext word list of plaintext words in a plurality of plaintext words that have a same isomorphic pattern as a Selected Word.

9.	The system of claim 8, wherein: the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that selects a Trial Word from the Document Word List, the Trial Word containing at least one cipher label in common with the Selected Word.

10.	The system of claim 8, wherein: the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that eliminates candidate plaintext words that cannot also form a valid plaintext Trial Word.

11.

The system of claim 8, wherein: the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that eliminates candidate plaintext words using a RasterPlaintext Library wherein the RasterPlaintext Library contains a plurality of raster letters and a plurality of plaintext identities for raster letters in the plurality of raster letters.

12.	The system of claim 6, wherein: the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that substitutes a plaintext identity for identified cipher labels in cipher words in the plurality of cipher words in the Document Word List.

13.	The system of claim 1, wherein: the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that attempts to identify cipher labels in the plurality of cipher labels until all cipher labels have been identified.

14.

The system of claim 1, wherein: the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that assigns a Pattern Tag to a word in a Document Word List which contains unrecognized cipher labels, wherein the Pattern Tag represents a plaintext word which has a same cipher pattern as the word in the Document Word List being assigned the Pattern Tag.

15.	The system of claim 1, wherein: the resource that receives the plurality of representations receives representations in the plurality of representations from a touch sensitive input device.

16.	The system of claim 15, wherein: the touch sensitive input device is capable of being hand held..

17.

A computer system for identifying plaintext words having a same pattern, the computer system comprising: a memory; a resource for receiving a received cipher pattern; and a processor, wherein the processor identifies plaintext words in a plurality of plaintext words stored in the memory which have a same cipher pattern as the received cipher pattern.

18.	The computer system of claim 17, wherein: the plaintext words in the plurality of plaintext words identified by the processor are isomorphic.

19.	The computer system of claim 17, including: an imaging device which generates an image of a document, and the processor includes a resource that translates the image into a plurality of cipher patterns, and supplies the received cipher patterns to the processor.

20.	The computer system of claim 17, wherein: the plurality of plaintext words includes words in more than one language.

21.

A process for pattern recognition, comprising: receiving a plurality of representations representing a document, the representations containing a plurality of characters; assigning cipher labels to characters in the plurality of character in representations in the plurality of representations; and identifying a plaintext identity of cipher labels in the plurality of cipher labels using pattern recognition.

22.	The process of claim 21, wherein: pattern recognition includes Isomorphic Pattern Recognition.

23.	The process of claim 21, wherein: assigning cipher labels includes assigning a different cipher label to each different representation in the plurality of representations.

24.	The process of claim 23, wherein: assigning cipher labels to characters in the plurality of characters in representations in the plurality of representations includes producing a plurality cipher words.

25.	The process of claim 24, wherein: identifying the plaintext identity of cipher labels in the plurality of cipher labels includes creating a Document Word List which contains cipher words in the plurality of cipher words.

26.	The process of claim 25, wherein: identifying the plaintext identity of cipher labels in the plurality of cipher labels includes selecting a Selected Word from the Document Word List, the Selected Word containing at least one cipher label to be identified.

27.	The process of claim 26, wherein: identifying the plaintext identity of cipher labels in the plurality of cipher labels includes creating a candidate plaintext word list of plaintext words in a plurality of plaintext words that have a same isomorphic pattern as the Selected Word.

28.	The process of claim 27, wherein: identifying the plaintext identity of cipher labels in the plurality of cipher labels includes selecting a Trial Word from the Document Word List, the Trial Word containing at least one cipher label in common with the Selected Word.

29.	The process of claim 27, wherein: identifying the plaintext identity of cipher labels in the plurality of cipher labels includes eliminating candidate plaintext words that cannot also form a valid plaintext Trial Word.

30.

The process of claim 27, wherein: identifying the plaintext identity of cipher labels in the plurality of cipher labels includes eliminating candidate plaintext words using a RasterPlaintext Library wherein the RasterPlaintext Library contains a plurality of raster letters and a plurality of plaintext identities for raster letters in the plurality of raster letters.

31.	The process of claim 25, wherein: identifying the plaintext identity of cipher labels in the plurality of cipher labels includes substituting a plaintext identity for the identified cipher labels in cipher words in the plurality of cipher words in the Document Word List.

32.	The process of claim 21, wherein: identifying the plaintext identity of cipher labels in the plurality of cipher labels includes attempting to identify cipher labels in the plurality of cipher labels until all cipher labels have been identified.

33.

The process of claim 25, wherein: identifying the plaintext identity of cipher labels in the plurality of cipher labels includes assigning a Pattern Tag to a cipher word in the Document Word List which contains unrecognized cipher labels, wherein the Pattern Tag represents a plaintext word which has a same cipher pattern as the word in the Document Word List being assigned the Pattern Tag.

34.	The process of claim 21, wherein: receiving the plurality of representations includes receiving representations in the plurality of representations from a touch sensitive input device.

35.	The system of claim 34, wherein: the touch sensitive input device is capable of being hand held.

36.

A character recognition system, comprising: an imaging device which generates a representation of a document, the representation including a plurality of images of characters in the document; a processor coupled to the imaging device capable of receiving the representation of the document, the processor assigning a plurality of cipher labels to images in the plurality of images; and the processor using pattern recognition to assign plaintext characters to cipher labels in the plurality of cipher labels.

37.	The system of claim 36, wherein: the pattern recognition includes Isomorphic Pattern Recognition.

38.	The system of claim 30, wherein: the processor includes a resource that constructs a plurality of cipher words composed of cipher labels in the plurality of cipher labels.

39.	The system of claim 38, wherein: the processor includes a resource that constructs a Document Word List which contains cipher words in the plurality of cipher words.

40.	The system of claim 39, wherein: the processor includes a resource that selects a Selected Word from the Document Word List, wherein the Selected Word contains a cipher label to be identified.

41.	The system of claim 40, wherein: the processor includes a resource which selects a list of candidate plaintext words in a plurality of plaintext words which have a same isomorphic pattern as the Selected Word.

42.	The system of claim 41, wherein: the processor includes a resource that selects a Trial Word from the Document Word List wherein the Trial Word contains at least one cipher label in common with the Selected Word.

43.	The system of claim 42, wherein: the processor includes a resource which selects a list of Trial Word candidate plaintext words in a plurality of plaintext words which have a same isomorphic pattern as the Trial Word.

44.	The system of claim 42, wherein: the processor includes a resource which eliminates candidate plaintext words from the list of candidate plaintext words which cannot form a plaintext word which is isomorphic with the Trial Word.

45.	The system of claim 39, wherein: the processor includes a resource that selects a plurality of Trial Words from the Document Word List wherein Trial Words in the plurality of Trial Words contain at least one cipher label in common with the Selected Word.

46.	The system of claim 45, wherein: the processor includes a resource which selects a plurality of lists of candidate plaintext words in a plurality of plaintext words which have a same isomorphic pattern as the plurality of Trial Words.

47.

The system of claim 46, wherein: the processor includes a resource which eliminates candidate plaintext words in the plurality of candidate plaintext words from the plurality of lists of candidate plaintext words which cannot form a plaintext word in the plurality of plaintext words in the plurality of lists of candidate plaintext words.

48.

The system of claim 47, wherein: the processor includes a resource which substitutes a single remaining candidate plaintext word, from the list of candidate plaintext words, for the Selected Word in the Document Word List when only one candidate plaintext word remains in the list of candidate plaintext words.

49.	The system of claim 48, wherein: the processor includes a resource which substitutes the plaintext identity for identified cipher labels in cipher words in the Document Word List.

50.	The system of claim 49, wherein: the processor includes a resource that continues to select Trial Words from the Document Word List until all cipher labels that can be identified are identified.

51.	The system of claim 50, wherein: the processor includes a resource which assigns a Pattern Tag to a cipher word in the Document Word List which contains unidentified cipher labels.

52.	The system of claim 51, wherein: the Pattern Tag represents a plaintext word which has a same cipher pattern as the word in the Document Word List being assigned the Pattern Tag.

53.

The system of claim 41, wherein: the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that eliminates candidate plaintext words using a RasterPlaintext Library wherein the RasterPlaintext Library contains a plurality of raster letters and a plurality of plaintext identities for raster letters in the plurality of raster letters.

54.	The process of claim 36, wherein: the imaging device includes a touch sensitive device.

55.	The system of claim 54, wherein: the touch sensitive input device is capable of being hand held.

Description:

Isomorphic Pattern Recognition Related Applications This application claims priority to the provisional U. S. patent application entitled "Isomorphic Pattern Recognition," by inventor Floyd S. Hall, having serial number 60/028,649, and having filing date October 16, 1996. This application also claims priority to the provisional U.S. patent application entitled "OCR Metafile Format," by inventor Floyd S. Hall, having serial number 60/028,578, and having filing date October 16, 1996. This application also claims priority to the provisional U.S. patent application entitled "Using Pattern Tags to Provide Full-Text Searching of Raster Words that are difficult to Recognize," by inventor Floyd S. Hall, having serial number 60/028,575, and having filing date October 16, 1996. Each of the aforementioned three provisional applications is hereby incorporated by reference. This application also claims priority to U.S. patent application entitled "A File Structure for Scanned Documents," by inventors Floyd Steven Hall, Jr. and Cameron Telfer Howie, having filing date October 8, 1997 and a serial number which has not yet been assigned, which is incorporated herein by reference.

Field of the Invention The present invention relates to systems and methods for pattern recognition, and more specifically to systems and methods for character recognition using Isomorphic Pattern Recognition.

Description of Related Art Traditional recognition technologies, such as Optical Character Recognition (OCR) technologies, attempt to determine the mapping between raster (image) letters within a raster document, and letters within a Character Encoding Scheme, such as ASCII, for example. Products based on these traditional technologies, however, have not been as successful as hoped in the mainstream marketplace. The reason is due to the extremely high degree of discontinuity inherent in their adoption by organizations, and the inability of companies to lower this barrier to the early majority of mainstream customers.

This discontinuity is caused by bottlenecks including the following: 1. Recognition speed is too slow to be practical in paper-intensive organizations.

2. The tremendous time required to recognize vast collections of differing fonts and type sizes.

3. Recognition accuracy is too sensitive to skewed, faded, and noisy raster (image) documents.

4. Page-by-page checking for recognition errors requires tremendous human resources. Cannot automate process.

5. The need for manual indexing to ensure that output files are full- text searchable.

All five of these bottlenecks can be traced to the same root cause: the need, by traditional recognition technologies, to use the optical features (i.e, font) of each raster letter in order to determine its proper mapping. This statement is true regardless of whether the recognition technology uses Matrix Matching, Feature Extraction, Structural Analysis, or Neural Networking technologies, for example. What is needed is a technology which can overcome the aforementioned deficiencies of traditional recognition technologies.

SUMMARY OF THE INVENTION The present invention is based upon the answer to an unintuitive question: Is it possible to determine the identity of a letter in a way that is independent of its optical features? For example, what if the letter "e" had its optical features altered within a raster document, and replaced everywhere with the symbol of a tiny rectangle having a silver dollar at its center. Would existing OCR technologies be able to determine that this unusual symbol represents plaintext "e" ? The answer is clearly no, since the optical properties of this symbol look nothing like what traditional OCR technologies would expect for the letter "e" The present invention, however, correctly identifies this randomly chosen symbol as plaintext "e", quickly and accurately. This subtle insight, along with many others, allows a truly automated approach to recognition that is invisible to users, and acceptable to mainstream customers.

One embodiment of the present invention works as follows. A resource receives a plurality of representations of a plurality characters. A resource assigns a plurality of cipher labels to representations in the plurality of representations, and a resource identifies a plaintext identity of cipher labels in the plurality of cipher labels using pattern recognition. In one aspect of the invention the pattern recognition includes Isomorphic Pattern Recognition. In another aspect of the invention the resource that assigns the plurality of cipher labels to representations in the plurality of representations assigns a different cipher label to each different representation in the plurality of representations. In yet another aspect of the invention the resource that assigns the plurality of cipher labels to representations in the plurality of representations produces a plurality of cipher words. In still another aspect of the invention the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that creates sets of cipher labels selected from the plurality of cipher labels.

In another aspect of the invention the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that

creates a Document Word List which contains cipher words in the plurality of cipher words. In still another aspect of the invention the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that selects a Selected Word from the Document Word List, the Selected Word containing at least one cipher label to be identified. In another aspect of the invention the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that creates a candidate plaintext word list of plaintext words in a plurality of plaintext words that have a same isomorphic pattern as a Selected Word. In still another aspect of the invention the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that selects a Trial Word from the Document Word List, the Trial Word containing at least one cipher label in common with the Selected Word.

In still another aspect of the invention the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that eliminates candidate plaintext words that cannot also form a valid plaintext Trial Word. In still another aspect of the invention the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that eliminates candidate plaintext words using a Raster- Plaintext Library wherein the Raster-Plaintext Library contains a plurality of raster letters and a plurality of plaintext identities for raster letters in the plurality of raster letters. In another aspect of the invention the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that substitutes a plaintext identity for identified cipher labels in cipher words in the plurality of cipher words in the Document Word List. In yet another aspect of the invention the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that attempts to identity cipher labels in the plurality of cipher labels until all cipher labels have been identified. In yet another aspect of the invention the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that assigns a Pattern Tag to a word in a Document Word List which

contains unrecognized cipher labels, wherein the Pattern Tag represents a plaintext word which has a same cipher pattern as the word in the Document Word List being assigned the Pattern Tag.

Another embodiment of the invention comprises a memory, a resource for receiving a received cipher pattern, and a processor, wherein the processor identifies plaintext words in a plurality of plaintext words stored in the memory which have a same cipher pattern as the received cipher pattern. In another aspect of the invention the plaintext words in the plurality of plaintext words identified by the processor are isomorphic. Still another aspect of the invention comprises an imaging device which generates an image of a document, and the processor includes a resource that translates the image into a plurality of cipher patterns, and supplies the received cipher patterns to the processor. In yet another aspect of the invention the plurality of plaintext words includes words in more than one language.

Another embodiment of the invention comprises receiving a plurality of representations representing a document, the representations containing a plurality of characters, assigning cipher labels to characters in the plurality of character in representations in the plurality of representations, and identifying a plaintext identity of cipher labels in the plurality of cipher labels using pattern recognition. In another aspect of the invention pattern recognition includes Isomorphic Pattern Recognition. In yet another aspect of the invention assigning cipher labels includes assigning a different cipher label to each different representation in the plurality of representations. In still another aspect of the invention assigning cipher labels to characters in the plurality of characters in representations in the plurality of representations includes producing a plurality cipher words. In another embodiment of the invention identifying the plaintext identity of cipher labels in the plurality of cipher labels includes creating a Document Word List which contains cipher words in the plurality of cipher words. A Selected Word is selected from the Document Word List, the Selected Word containing at least one cipher label to be identified. A candidate plaintext word list is created containing plaintext words in a plurality of plaintext words

that have a same isomorphic pattern as the Selected Word. A Trial Word is selected from the Document Word List, the Trial Word containing at least one cipher label in common with the Selected Word. Candidate plaintext words that cannot also form valid plaintext Trial Words are eliminated. In another aspect of the invention identifying the plaintext identity of cipher labels in the plurality of cipher labels includes eliminating candidate plaintext words using a Raster- Plaintext Library wherein the Raster-Plaintext Library contains a plurality of raster letters and a plurality of plaintext identities for raster letters in the plurality of raster letters. When the invention determines the plaintext identity of cipher labels in cipher words in the plurality of cipher words, these plaintext identities are substituted into the cipher words in the Document Word List. In another aspect of the invention identifying the plaintext identity of cipher labels in the plurality of cipher labels includes assigning a Pattern Tag to a cipher word in the Document Word List which contains unrecognized cipher labels, wherein the Pattern Tag represents a plaintext word which has a same cipher pattern as the word in the Document Word List being assigned the Pattern Tag.

Still another embodiment of the invention comprises an imaging device which generates a representation of a document, the representation including a plurality of images of characters in the document, a processor coupled to the imaging device capable of receiving the representation of the document, the processor assigning a plurality of cipher labels to images in the plurality of images, and the processor using pattern recognition to assign plaintext characters to cipher labels in the plurality of cipher labels. In this embodiment, the pattern recognition may include Isomorphic Pattern Recognition. In another aspect of this invention the processor includes a resource which assigns a Pattern Tag to a cipher word in the Document Word List which contains unidentified cipher labels. In still another aspect of the invention the Pattern Tag represents a plaintext word which has a same cipher pattern as the word in the Document Word List being assigned the Pattern Tag. In yet another aspect of the invention the resource that identifies the plaintext identity of cipher labels in the plurality of cipher labels includes a resource that eliminates candidate plaintext words using

a Raster-Plaintext Library wherein the Raster-Plaintext Library contains a plurality of raster letters and a plurality of plaintext identities for raster letters in the plurality of raster letters.

Another embodiment of the present invention works as follows. A raster document is converted into a ciphertext document wherein each raster character is related to a different ciphertext label. This converts words comprised of raster characters into cipher words comprised of ciphertext labels. A Document Word List is then created from the ciphertext document. The Document Word List contains one copy of each unique cipher word in the ciphertext document.

A word is chosen from the Document Word List which contains at least one cipher label which is desired to be identified. This word will be referred to as the Selected Word. The plaintext Isomorphic Dictionary is then used to identity candidate plaintext words that have the same isomorphic pattern as the Selected Word. Trial Words are then chosen from the Document Word List to eliminate unacceptable candidates from the list of candidate plaintext words.

The Trial Word should have at least one cipher label in common with the Selected Word in order to help eliminate candidate plaintext words. For each Selected Word, Trial Words continue to be selected in an attempt to eliminate unacceptable candidates from the list of candidate plaintext words until only one candidate plaintext word remains.

In one aspect of the present invention, if all candidate plaintext words are eliminated or there are not enough Trial Words to eliminate all but one candidate plaintext word then the Selected Word is skipped and a new Selected Word is chosen.

If Trial Words are able to be used to eliminate all but one candidate plaintext word, then the one remaining candidate plaintext word is substituted back into the Document Word List in the place of the Selected Word.

Additionally, the plaintext identities of the cipher labels identified in the Selected Word are then substituted throughout the ciphertext document and the Document Word List. A new Selected Word is then chosen and the process is repeated until all the cipher labels that can be recognized have been recognized.

In one aspect of the present invention, the longer words in the Document Word List are chosen first because they usually contain the greatest number of unique cipher labels to be identified . Also, longer Selected Words generally result in fewer candidate plaintext words, requiring fewer Trial Words to be selected and resulting in more cipher labels being identified. This speeds up the character recognition process.

In another aspect of the present invention, words in the Document Word List which still contain unrecognized cipher labels after all of the cipher labels that can be identified have been identified, are assigned a Pattern Tag. The Pattern Tag presents possible plaintext answers for the word and is useful, for example, when a full text search of the recognized document is performed.

Brief Description of Figures Figure 1 is a schematic diagram representing an embodiment ofthe present invention depicting various possible input devices.

Figure 2 depicts an embodiment of the system showing a document to be feed into a scanner attached to the computer system.

Figure 3 shows an embodiment of the system in which the computer system has an external connection with another computer system.

Figure 4 shows a sample multi-language ciphertext document and a document word list generated from the document.

Figure 5 shows relevant excerpts from a sample English German plaintext dictionary.

Figure 6 shows cipher label-plaintext mappings and a copy of the document word list depicting selected word I and Trial Word #I(1).

Figure 7 depicts selected word #I and Trial Word #I(1) and their respective Inherited isomorphic patterns along with Candidate plaintext words from the dictionary.

Figure 8 shows cipher label-plaintext mappings and a copy of the document word list depicting selected word #II and Trial Word #II(1).

Figure 9 depicts selected word #II and Trial Word #II(1) and their respective Inherited isomorphic patterns along with Candidate plaintext words from the dictionary.

Figure 10 shows cipher label-plaintext mappings and a copy of the document word list depicting selected word #III and Trial Word #III(1).

Figure 11 depicts selected word #III and Trial Word #III(1) and their respective Inherited isomorphic patterns along with Candidate plaintext words from the dictionary.

Figure 12 shows cipher label-plaintext mappings and a copy of the document word list depicting selected word #IV, Trial Word #IV(1), and Trial Word #IV(2).

Figure 13 depicts selected word #IV, Trial Word HIV( 1), and Trial Word #IV(2) and their respective Inherited isomorphic patterns along with Candidate plaintext words from the dictionary.

Figure 14 shows cipher label-plaintext mappings and a copy of the document word list depicting selected word ftV, and Trial Word IV(1).

Figure 15 depicts selected word #V, and Trial Word #V(1) and their respective Inherited isomorphic patterns along with Candidate plaintext words from the dictionary.

Figure 16 depicts the solution mappings found as a result of the recognition and the plaintext identity of cipher words in the document word list.

Figure 17 shows the "English" and "German" plaintext solutions.

Figure 18 depicts an example of the use of the Library function to help identify the plaintext identity of cipher labels.

Figure 19 depicts an embodiment of the invention in which a search engine recreates and searches a list of candidate plaintext words.

Figure 20 is a blob showing the letter "c." Figure 21 shows two blobs and the results of the XOR operation.

Figure 22 shows an image of a "t" touching an "n." Figure 23 shows an image of a non-composite "n." Figure 24 shows a sample Answer-Array.

Figure 25 shows another sample Answer-Array.

Figure 26 shows an array rearranged so that words having the most repeated letters appear first.

Figure 27 depicts a Trial Word array.

Figure 28 depicts an Isomorphic Array.

Figure 29 shows that the Isomorphic-Array matches when A=B.

Figure 30 shows that the Isomorphic-Array matches when A=B.

Figure 31 shows the matrices used to generate the candidate plaintext answers for the Pattern Tag.

Figure 32 shows a sample TW-Array.

Figure 33 shows a sample Touching~Array for "managem( t." Figure 34 shows the @-Array.

Figure 35 shows an example of an array grouping of common word both non- pattern and pattern words.

Detailed Description The following description is presented to enable a person skilled in the art to make and use the invention, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the invention. Thus, the present invention is not intended to be limited to the embodiments disclosed, but is to be accorded the widest scope consistent with the principles and features disclosed herein.

Figure 1 is a diagram illustrating recognition system 100. In this embodiment ofthe present invention, computer system 102 is connected through cable 104 to one or more input devices. Input devices may include, but are not limited to traditional scanner 106, multifunction product with scanner 108, storage device 110, network connection 112, digital photocopier 114, or touch sensitive device 116, or any other device which can provide input to computer

system 102. Without deviating from the present invention, any of theses devices could be incorporated into computer system 102. Computer system 102 contains plaintext Isomorphic Dictionary 118 and Isomorphic Recognition resource 120. Optionally, computer system 102 may include a Raster-Plaintext Library. In one aspect of the invention recognition system 100 may include a hand-held screen input device. Such a device may create a raster document or representations directly from the images input on the screen.

Figure 2 depicts another embodiment of the present invention, recognition system 200. In this embodiment computer system 202 is connected through cable 204 to scanner 206. Document 208 contains characters in a plurality of characters which it is desired to recognized. Document 208 is fed into scanner 206. Scanner 206 converts document 208 into a plurality of representations which corresponds to characters in the plurality of characters in document 208. Representations in the plurality of representations are then transferred to computer system 202 through cable 204.

Representations in the plurality of representations are stored in memory 210 in computer system 202. Then, as discussed in detail below, processor 212 in computer system 202 then processes representations in the plurality of representations stored in memory 210 and uses plaintext Isomorphic Dictionary 214 and Isomorphic Recognition resource 216 to help identify characters in the plurality of characters represented by representations in the plurality of representations. Additionally, computer system 202 may include a Raster- Plaintext Library.

Yet another embodiment of the present invention is shown in Figure 3.

Figure 3 depicts networked recognition system 300. Networked recognition system 300 includes computer system 302 connected through cable 304 to digital photocopier system 306. Document 308 contains characters in a plurality of characters which it is desired to recognized. Document 308 is fed into digital photocopier 306. Digital photocopier 306 converts document 308 into representations in a plurality of representations which corresponds to characters

in a plurality of characters in document 308. Representations in the plurality of representations are then transferred to computer system 302 through cable 304.

Representations in the plurality of representations are stored in memory 310 in computer system 302. Then, as discussed in detail below, processor 312 in computer system 302 processes representations in the plurality of representations stored in memory 310 and uses plaintext Isomorphic Dictionary 314 and Isomorphic Recognition resource 316 to help identify characters in the plurality of characters represented by representations in the plurality of representations. Additionally, computer system 202 may include a Raster- Plaintext Library, and the Raster-Plaintext Library may also be used to help identify characters in the plurality of characters represented by representations in the plurality of representation.

In one aspect of the present invention, the identified characters in the plurality of characters are used to construct electronic document 318 corresponding to document 308. Electronic document 318 can be sent over network connection 320 to other computers 322. Network connection 320 can include local-area networks, wide-area networks, the Internet, or any other connection to a system external to computer system 302.

In one embodiment of the present invention, the mapping between cipher labels and plaintext letters in order to perform character recognition using the present invention proceeds as follows. A comprehensive plaintext Isomorphic Dictionary is constructed. This dictionary should be arranged so that all plaintext words in the dictionary having the same Root isomorphic pattern are grouped together. This plaintext Isomorphic Dictionary may also be multi- language, i.e. it may contain a superset of the words from many different languages, where each language is alphabetical in construction. This will allow multi-language ciphertext documents to be solved as easily as ciphertext documents in one language only.

A document is scanned into a recognition system, such as that shown in Figure 2. The representations of the characters in the document are stored in memory 210. As discussed in detail below, processor 212 then uses the

representations to construct a ciphertext document containing cipher words. A sample ciphertext document is shown in Figure 4. A Document Word List, as shown in Figure 4, is created which contains one copy of each unique cipher word in the ciphertext document. This embodiment of the invention only has the information about the document contained in the representations, thus the cipher labels are unique, assigned to the representations based only on information contained within the representations.

A word is chosen from the Document Word List that contains at least one cipher label that we wish to recognize. This chosen word will be referred to here as the Selected Word. The comprehensive plaintext Isomorphic Dictionary is used to find all plaintext words that have the same Root isomorphic pattern as the Selected Word. Figure 5 shows a sample of relevant excepts from an embodiment of a plaintext Isomorphic Dictionary containing English and German words. In Figure 5, Root isomorphic patterns are represented by capital letters and plaintext words are represented by lower case letters. This list of plaintext words will be referred to here as the candidate plaintext words.

The goal is to attempt to determine which of these candidate plaintext words is the one correct answer being sought, i.e. the correct plaintext substitute for the Selected Word. To accomplish this goal Trial Words are used. The Trial Words are chosen to help eliminate unacceptable candidates from among the list of candidate plaintext words. Each Trial Word is chosen from the Document Word List as depicted in Figure 6. In addition, in multi-language ciphertext documents, each Trial Word is allowed to come from a language different from the language of the Selected Word. The only requirement is that each Trial Word must have at least one cipher label in common with the Selected Word. In the present embodiment, the solution methodology consists of examining each candidate plaintext word, one at a time, and eliminating those candidates that cannot also form a valid plaintext Trial Word. A plaintext word is considered valid if that word is contained within the comprehensive plaintext Isomorphic Dictionary.

This process is depicted in detail in Figure 7. The Selected Word #I on the top line above is converted into its Inherited Isomorphic pattern. In the present embodiment of the invention, the Inherited isomorphic pattern is equal to Root isomorphic pattern for this Selected Word. Using Group #1 in Figure 5 all of the candidate plaintext words are listed for the Inherited isomorphic pattern.

Each cipher label in Trial Word #I(1) that occurs also in the Selected Word #I (shown in bold), inherits its identities from the Selected Word HI. Since "F"' occurs in the Original patterns for both words, all entries in the vertical column below "F" in the Trial Word HE(1) are copied identically (i.e., inherited) from the Selected Word HI. Similarly for the other labels in "bold" In the Inherited isomorphic pattern for Trial Word HE(1), "K"is chosen to represent "S" from the Original pattern, to distinguish it from the inherited labels ("L,M, etc." could also be used). Using Trial Word HE(1), we now see if its possible to eliminate any unacceptable candidate plaintext words. In the present embodiment of the invention, a candidate (for Selected Word) is eliminated whenever no valid Trial Word exists which fit the pattern located at the intersection of the row containing this specific candidate plaintext word, and the vertical Trial Word column. We stop choosing Trial Words when one candidate plaintext word remains (for the Selected Word). In one aspect of the invention Trial Words continue to be selected even after only one candidate plaintext word remains in order to verify that the remaining candidate plaintext word is the correct one.

Unlike the Selected Word HI, the Inherited isomorphic pattern for the Trial Word must first be converted into its equivalent Root isomorphic pattern before using the dictionary.

Inherited isomorphic pattern J C D B A C K F C F Root isomorphic pattern A B C D E B F G B G This Root isomorphic pattern implies that Group H2 in Figure 5 (the dictionary) should be used. Therefore, Group H2 of Figure 5 shows that the second (shaded) entry in Figure 7 does not represent a valid Trial Word, so candidate H2 must therefore be eliminated. Thus, candidate #1 is the one correct answer for

the Original pattern of the Selected Word HI. The mappings for the solution of the Selected Word HI are shown in Figure 8.

For each Selected Word, Trial Words continue to be chosen in an attempt to eliminate additional unacceptable candidate plaintext words until one and only one candidate plaintext word remains. If only one plaintext word remains, then it is assumed to be the one correct answer being sought, and it is substituted in place of the Selected Word. This establishes a mapping between the cipher label(s) in this Selected Word and their plaintext identities (in the candidate word), i.e., we have performed character recognition on these cipher labels.

The plaintext identities of these cipher labels just recognized are then substituted throughout the ciphertext document (and the Document Word List), into all cipher words containing these cipher labels. This type of substitution process is equivalent to performing character recognition instantaneously, instead of sequentially, as is done in alternative recognition technologies. This approach helps to skirt the slow processing speed limitation of traditional recognition technologies.

If one of the scenarios in (a) or (b) below occurs, for a difficult-to- recognize Selected Word, then this difficult word is skipped, and a new Selected Word chosen, if one exists, and the recognition process described above is repeated. The plaintext identities of the cipher labels contained within this difficult-to-recognize Selected Word may be found when other Selected Words, containing these same cipher labels, are recognized. The scenarios are: a. all candidate plaintext words are eliminated (perhaps because the one correct answer for the Selected Word is not in the comprehensive plaintext Isomorphic Dictionary) b. there do not exist enough Trial Words to eliminate all but one candidate plaintext word.

The process of selecting Selected Words and Trial Words from the Document Word List is repeated until all cipher labels that can be recognized have been recognized. This is depicted in Figures 9-16. In Figure 9 the Selected Word HII on the top line above is converted into its Inherited isomorphic pattern.

Then, using Group H3 of the dictionary, list all appropriate candidate plaintext words for the equivalent Root isomorphic pattern of this Inherited isomorphic pattern. Then, using Trial Word HII(1), see if its possible to eliminate any unacceptable candidate plaintext words. Group H2 of the dictionary in Figure 5 shows that the second (shaded) entry above does not represent a valid Trial Word, so candidate H2 must therefore be eliminated. Thus, candidate H1 is the one correct answer for the Original pattern of the Selected Word #II. The solution mappings are shown in Figure 10.

In Figure 11, the Selected Word HIll on the top line above is converted into its Inherited isomorphic pattern. Using Group H4 of the dictionary, all appropriate candidate plaintext words for the equivalent Root isomorphic pattern ofthis Inherited isomorphic pattern are listed. Then, using Trial Word HIII(1), we now see if its possible to eliminate any unacceptable candidate plaintext words. Group H4 in the dictionary shows that the second and third (shaded) entries do not represent valid Trial Words, so candidates H2 and H3 must therefore be eliminated. Thus, candidate #1 is the one correct answer for the Original pattern of the Selected Word HIll. The solution mappings are shown in Figure 12.

In Figure 13 the Selected Word HIV on the top line above is converted into its Inherited isomorphic pattern. Then, using Group H4 of the dictionary in Figure 5, all appropriate candidate plaintext words are listed. Using Trial Word HIV(1) and Trial Word IV(2), we now see if its possible to eliminate any unacceptable candidate plaintext words. Group H7 of the dictionary shows that both entries above represent valid Trial Words for Trial Word H IV(1), so no candidates are eliminated by this Trial Word. Thus, we must choose another Trial Word and continue. Group H6 of the dictionary shows that the first (shaded) entry in Figure 13 does not represent a valid Trial Word for Trial Word

HIV(2), so candidate H1 must therefore be eliminated. Thus, candidate H2 is the one correct answer for the Original pattern of the Selected Word HIV. The solution mappings are shown in Figure 14.

In Figure 15, the Selected Word HV on the top line above is converted into its Inherited isomorphic pattern. Then, using Group H6 of the dictionary, all appropriate candidate plaintext words are listed. Using Trial Word HE(1), we now see if its possible to eliminate any unacceptable candidate plaintext words.

Group H5 of the dictionary shows that the first (shaded) entry in Figure 15 does not represent a valid Trial Word, so candidate Hl must therefore be eliminated.

Thus, candidate H2 is the one correct answer for the Original pattern of the Selected Word HV. The solution mappings are shown in Figure 16.

Finally, each word in the Document Word List containing (unrecognized) cipher labels is assigned a Pattern Tag, which present possible plaintext answers for that word. Pattern Tags allow these words to be found during a full-text search of the recognized document. For example if a search is conducted in the document for a plaintext word. The use of Pattern Tags allows a search engine to return the appropriate plaintext result. If any cipher word in the Document Word List still has unidentified cipher labels, but is isomorphic with the plaintext word being searched for and the plaintext letters in the cipher word are the same as the plaintext letters in the search word then the Pattern Tag will return the appropriate plaintext word.

In Figures 7, 9, 11 an "English" Selected Word was recognized (i.e., solved) using "German" Trial Words. In Figure 13, a "German" Selected Word was recognized using "English" Trial Words. In Figure 15, an "English" Selected Word was recognized using "English" Trial Words. This demonstrates that the present invention can recognize multi-language documents as easily as documents in one language only, for languages that are alphabetical in construction. Though not shown in the present example, it is possible, in certain circumstances, to extract plaintext identities from Trial Word results as well.

Figure 17 shows the results of character recognition, both the "English" plaintext solution and the "German" plaintext solution.

For performance reasons, the process described above involves choosing longer Selected Words first, which usually contain the greatest number of unique cipher labels. Then a greater number of plaintext identities can be recognized in less time. This is because solving the Selected Word reveals the plaintext behind a greater number of cipher labels, and longer Selected Words generally result in fewer candidate plaintext words, hence less Trial Word processing. However, without deviating from the present invention, the Selected Words could be chosen based on any suitable criteria, such as frequency, number of plaintext entries in the dictionary fitting the root isomorphic pattern, etc.

Another embodiment of the present invention uses a comprehensive Raster-Plaintext Library in order to attempt to identify which of the candidate plaintext words is the correct one. In one aspect of this embodiment, the Raster- Plaintext Library is used in addition to the use of Trial Words. In another aspect of the invention the Raster-Plaintext Library is used instead of Trial Words.

In this embodiment of the invention, the Raster-Plaintext Library contains extensive data on raster letters, and their plaintext identities, from an exhaustive collection of fonts. In one embodiment of the invention, a "Library Function" L(R,P) can be defined as follows: 1 = 1 if raster letter 'R' has plaintext identity 'P' in the Raster-Plaintext Library. 0 = 0 if raster letter 'R' does not have plaintext identity 'P' in the Raster-Plaintext Library.

The raster letter 'R' and its equivalent cipher label 'C' will be used interchangeably in the "Library Function", i.e. L(C,P) implies L(R,P) and L(R,P) implies L(C,P), with the understanding that cipher label 'C' is just a convenient placeholder for the actual raster letter 'R'.

In this embodiment, when using the "Library Function" to analyze candidate plaintext word 'i', the following notation will be used:

L(Cj, Pjj), where 1 s i < Ncandidates, and 1 < j < Nsw C - one instance of a cipher label in the Selected Word under consideration.

Pi j = candidate plaintext letter, within candidate plaintext word 'i', for cipher label C.

Ncandidates = total number of candidate plaintext words for the Selected Word.

Nsw = total number of instances of cipher labels in the Selected Word.

This "Library Function" is used to help eliminate unacceptable candidates from among the list of candidate plaintext words generated as discussed above.

Whenever a candidate plaintext word 'its' (1 < i* <Ncandidates) produces the value L(Cj, Pi*j) = 0 for any j' (1 <j < Nsw), it will be eliminated as a possible plaintext answer for the Selected Word under consideration. The one surviving candidate plaintext word 'i+', for which L(Cj, Pi+j) + 0 for all 'j' (1 <j < Now), is assumed to be the correct plaintext substitute for the Selected Word under consideration, and it establishes the mappings between the cipher labels contained within this Selected Word and their plaintext identities, i.e. character recognition has been performed just as is done using the Trial Words and Isomorphic Pattern Recognition.

In this embodiment, the plaintext identities of the cipher labels recognized are then substituted throughout the Document Word List, into all cipher words containing these cipher labels. This type of substitution process is equivalent to performing character recognition instantaneously, instead of sequentially (as is done in alternative recognition technologies). Selected Words continue to be chosen from the Document Word List until all cipher labels that can be recognized have been recognized.

Figure 18 is an example showing the use of the "Library Function". In this example there are 3 cipher labels yet to be identified remaining in the Selected Word, cipher labels 04, 09, and 08, so Nsw = 3. There are 3 Candidate plaintext words: "business", "cosiness" and "rosiness", so Ncandidates = 3. From the sample library result shown in Figure 18, the one surviving candidate plaintext

word is i=1. This establishes the following mappings for the cipher labels: 04orb, 09ou and 08e.

In the case in which the Raster-Plaintext Library does not contain one or more raster letters represented by the cipher labels in the Selected Word under consideration, then the "Library Function" will eliminate all candidate plaintext words. In this case, the "Library Function" is not able to isolate the one correct answer from among the list of candidate plaintext words. In one aspect of the invention, however, to ensure that this one correct answer can be found during a full-text search of the document, a search engine will dynamically recreate and search the list of candidate plaintext words by using the Unrecognized Selected Word's cipher pattern, which will be stored in the scanned document's output file. This Unrecognized Selected Word's cipher pattern will be referred to here as its Pattern Tag.

Figure 19 depicts one embodiment of the invention in which a search engine would recreate and search a list of candidate plaintext words for the Pattern Tag under consideration in accordance with the present invention. In this embodiment, if the Root isomorphic pattern of the Pattern Tag is identical to the Root isomorphic pattern of the Input Search Word and the plaintext letters in the Pattern Tag are identical to the plaintext letters in the Input Search Word then the search engine of this embodiment of the present invention will recreate and search the list of candidate plaintext words generated.

As shown in the example in Figure 19, since the criteria for the search engine to create the list are met, the search engine will use the plaintext Isomorphic Dictionary to recreate the list of candidate plaintext words fitting the criteria. When this list is then searched, the Input Search Word is found even though cipher labels remained unidentified in the document. Of course, in certain situations, the Raster-Plaintext Library could be used to eliminate some of the spurious candidate Plaintext words recreated from Pattern Tags stored in the scanned documents output file.

Cipher Label Construction and Character Recognition Below is a detailed description of another embodiment of the invention.

First we will discuss the pre-processing stage in which the pixel images are isolated.

Pre-processing In the discussion that follows, a 'blob' (or connected-component) is defined as a group of non-white pixels (assuming a white image background) that touch one another. For example, all of the black pixels in the shape of the raster letter 'A'. Also, an 'image' is defined as a raster image, perhaps generated by scanning a paper document, or it may be a pre-existing raster image.

Objectives: 1) Strip out figures and other graphical elements (e.g., borders of a table) from a raster image.

2) Extract 'blobs' from a raster image.

3) For each blob pattern, identify its instances across the raster images of a raster document file.

4) Identify which blobs form words.

5) Identify which blobs are punctuation symbols.

6) Segment blobs which comprise more than one character.

These obiectives are met using the following methods.

1) Stripping Graphical Elements from the Image: Several techniques have been published in the Image Processing literature for locating the different elements (e.g., tables, figures, paragraphs) with a raster image of a document. Some or all of these techniques will be implemented.

2) Extracting Character Blobs from the Image: The public-domain algorithm and supporting software provided by National Institutes of Standards and Technology (NIST) will be used.

3) Identifying Instances of a Blob: Each blob is returned by the NIST algorithm as a rectangular raster image.

The dimensions of the blob image depend on the size of the printed character and the resolution at which the page was scanned. For example, a blob for the letter 'c' might have the image, as shown in Figure 20 where the '0' values are white (background) pixels and the '1 's are black pixels defining the letter. Given the raster image of a blob, it is possible to tell whether the image of a second blob is either: (a) another instance of the first blob, or (b) an instance of a different blob.

This matching of blobs is accomplished using several fundamental image processing algorithms: i.) 'Exclusive-OR' Matching: Exclusive OR (written 'XOR') is a mathematical operator that is used to compare pixels in binary images (i.e., images where pixels are either black or white). By overlapping the images of two blobs, the XOR operator returns a 'result' image that shows which pixels in the two images matched and which mismatched. In the XOR image, a '1' shows where pixels in the original images did not match and a '0' where they did. For example, consider the following two blob images shown in Figure 21.

By examining the XOR result of pairs of blobs extracted from the image, it is possible to determine how closely they match one another. If a pair of blobs matches after the XOR operation within a defined tolerance, the pair is deemed lo be instances of the same blob pattern. If the blobs mismatch, then further analyses is performed (see below) to determine whether or not they match.

ii.) Histogram Matching: Compute 'vertical' and 'horizontal' histograms of blob images and then compare these histograms for two blobs to determine whether they match or not. The histograms are simply measures of the number of black pixels in the image rows (in the vertical histogram) and columns (in the horizontal histogram). For example, the vertical and horizontal histograms for the two sample blobs shown in i.) above would be: blob 1: vertical histogram: 5 11 5 horizontal histogram: 22422 blob 2: vertical histogram: 111 3 horizontal histogram: 01410 If the histograms show significant similarity, the pair of blob images is considered to be instances of the same blob, otherwise a potential mismatch is declared and further tests are applied to be certain.

iii.) Scanline Counts: A scanline count is defined to be the number of transitions from a black pixel to a white pixel while moving from left to right (vertical

scanline measure) and from top to bottom (horizontal scanline measure) along a given row/column of pixels within a blob image.

When reaching the end of a scanline at the edge of the image, the next pixel, were it there, is considered to be white. For example, the scanline counts for the two sample images above are: blob 1: vertical scanline counts: 1 1 1 1 horizontal scanline counts: 22122 blob 2: vertical scanline counts: 1 1 1 1 horizontal scanline counts: 01110 Scanline information is very useful, for example, in distinguishing between the letters 'o' and 'c' since the vertical scanline measure will highlight the gap on the right side of the image for 'c', the feature that distinguishes it from an 'o' Similarly, the horizontal scanline can show the differences between an 'h' and a iv.) Scatter Measure of the XOR Image: If the previous tests have not indicated a clear (mis)match between two blob images, then the scatter is measured in the XOR image returned in i.) above. 'Scatter' is a measure of how spread out the black pixels are in the XOR image. In image processing terminology, a convolution of the filter is used to measure this 111 101 111

over the XOR image. This filter returns the number of black pixels within the eight neighboring positions of the pixel. Applying it to the XOR image in i.) above would yield the following matrix: 11211 .. 22222... 11011... 01010 By summing the elements in the matrix, (in this case 22), a measure of the scatter is obtained for the XOR image. Blob pairs which mismatch will record higher scatter values than those found in the XOR images of matching blobs, permitting us to use a threshold in deciding upon a (mis)match classification for the blob pair.

4) Building Words from Blobs: 'Baselines' are located for the characters within a paragraph using histograms that measure the locations of blob images relative to one another. This is standard practice in document recognition.

By examining the inter-blob spacing on each baseline, it is trivial to determine which blobs constitute the words on that line. The word structure information is stored in the Metafile (output file) so that the cipher words of the document are solved the moment the plaintext identities for the individual blobs are recognized.

5) Identifying Blobs that Represent Punctuation: All punctuation symbols exhibit distinct typesetting behavior. For example, any blob with an image that looks like a black circle, and where that blob lies on or close to a baseline (determined in 4 above), is considered to be a 'period' or 'decimal point'. Another example is a 'hyphen' which lies roughly midway between successive baselines and has a rectangular shape

with a horizontal dimension considerably greater than its vertical one.

Therefore by examining the relationships between blobs and baselines, it can be determined which blobs are punctuation symbols and which are not.

6) Segmenting Composite Blobs: Due to the typesetting layout and insufficient resolution in the document scanner, it is frequently the case that a blob image comprises more than one touching character. For example, the image in Figure 22 shows a 't' touching an 'n' By using our standard set of algorithms for finding matching blobs, it is possible to separate such touching characters using the blob image of one of the characters as it appears in individual form. For example, as shown in Figure 23, the non-composite image for the above 'n' would be similar to the image shown in Figure 23.

Matching this image over the right-hand side of the composite image above leaves the remaining image of the 't' which will can be verified by matching it against its non-composite form.

Answer-Array 1.) Answer-Array contains exactly 1 copy of each disguise (or cipher label).

2.) The contents of Answer-Array should remain available, even after recognition subroutine has ended, to ease analysis of future text in document. The contents are deleted only after the entire document has been processed.

3.) The moment that a "plaintext identity" is entered into the "Plaintext" column, this identity information is made available throughout the

software instantly (globally): nP-Array, nNP-Array, LA-Array, etc., everywhere.

A sample Answer-Array is shown in Figure 24.

No entry in any of the "lower case" or "upper case" possibilities columns can duplicate an answer in the "Plaintext" column in Answer-Array.

Hence, there should be only a few "lower case" possibilities in this array, since there will be so many in the "Plaintext" column.

Box is a description of the physical attributes of the "Bounding Box" surrounding the raster connected-component character (disguise).

1 = character in bounding box has an ascender. Ex: [ b d h k 1 t "f' (sometimes) ] 2 = character in bounding box has a descender. Ex: [g j p q y "f & z" (sometimes)] 3 = character in bounding box has neither an ascender nor a descender. Ex: [ a c e i m n o r s u v w x "z" (sometimes)] 4 = character in bounding box has both an ascender and a descender. Ex: [f (times italic) ly (touching characters) ] 5 = character in bounding box is most probably an upper case (Capital) letter. Ex: [A B C D E ...] (but could also represent [ 1 234567...]) 6 = character in bounding box is most probably a punctuation mark, or a "special" symbol easily identified from box. Ex: [() ( } [] . , : ; ' " ' " ! ? ---= % division sign (2 dots and a horizontal bar) multiplication dot ] 7 = character "mav" be a Capital letter not found by preprocessing analysis, or it may be a low-frequency lower case letter with an ascender.

Plaintext is that specific plaintext identity that is either 1.) determined uniquelv from TW-Array analysis, or 2.) selected from "lower case" or "upper case" possibilities, using traditional methods. This traditional method is not used, however until after the last line of the last page of the entire document has been processed. We strategically delav using this traditional method until the end because our hope is that, by finding future words containing the disguise in question, we can use the TW-Array analysis to more "confidently" uncover the plaintext identity of this disguise. So, we leave this "Plaintext" entry blank, until we encounter a future word containing the disguise in question, or until we reach the end of the document and use the traditional method. If, after the entire document has been analyzed, any "Plaintext" entry is left "blank" in the Answer- Array, then this implies that the disguise should be left as a raster Image.

If, for a given Disguise, the entire Row of Answer-Array is emptv, then it is overwhelmingly probable that the disguise is a Symbol! Plaintext identities using traditional techniques: To determine "font metric" characteristics, use the Bounding Box, histogram, and pixel densities for the letter "e" By using this information, we can select the closest matching font, and we can synthesize it as well, i.e., we can select the 1 font that most closely approximates the font of the present text. Now that the specific font has been chosen, synthesize the entire alphabet associated with this font: This includes lower case letters, upper case letters, numbers, punctuation marks, and symbols. Using this alphabet, we can now complete the verification using traditional techniques, as required in the above Answer-Array analysis.

Unlike the approaches by other vendors, we know in advance what the only acceptable plaintext possibilities are for each disguise!! This further increases the accuracy of results, and also the computational speed ofthe software! If for any disguise, the entries for "lower case" and "upper case" possibilities are empty, this implies that the disguise is most probably a Symbol, since punctuation marks are already known.

Example: If LA-Array has the word "Qt", and Q appears nowhere else in document, then we first try to limit plaintext possibilities by using the physical attributes of the Bounding Box of this character. If this gives a unique answer we stop. Otherwise this word is verified at the end of the document.

The sample answer array for this example is shown in Figure 25.

Separating Fonts la.) Create a Triliteral Frequency Distribution for ALL the disguised text, repeated words as well.

lib.) Access the Preprocessing-Array. This array contains ALL "assumed" identities of punctuation marks and "some" special characters in incoming text.

1c.) Select 1 sot the highest frequency disguise in distribution (if it's not a punctuation mark). This disguise should be a central disguise, with both left and right contacts.

lid.) Put this 1st disguise in Answer-Array, and note that it is in font #1.

le.) Analyze all contacts of all occurrences of 1st disguise, excluding contacts with itself (this would create an infinite loop). Put 1 copy of each new disguise (from contacts, if any exist) in Answer-Array. Contacts are all in font Hi.

(Note: Punctuation disguises are allowed in Answer-Array only when they are entered as contacts of non-Punctuation-Array disguises, not as central disguises! The only way that fonts can be accidentally intermixed is when punctuation marks are used as central disguises. This is because punctuation marks, like periods and commas, are often indistinguishable between fonts, and each can have contacts from more than one font simultaneously.) lf.) These new disguises, from contacts of 1st disguise (if any exist), am referred to here as 2nd, 3rd,..., kth.

1g.) Choose 2nd disguise as central disguise (if it's not a punctuation mark).

Put 1 copy of each new disguise, from its contacts, in Answer-Array.

They are font #1.

1h.) These new disguises, from contacts of 2nd disguise (if any exist), are referred to here as k+lth, k+2th, . . ., jth (k<j). All of which are in font # 1.

li.) Continue sequentially choosing central disguises, 3rd, 4th ..., and analyzing the contacts of each, until ALL contacts for font #1 have been evaluated.

lj.) Now, onlv at this point, do we consider punctuation marks as central disguises. But only in the very specific scenario listed below! ! 1k.) We consider onlv those punctuation marks as central disguises which have, as their left contact, a disguise previously classified as being in font #1 and which is not itself a punctuation mark. These very specific central disguises are referred to here as P 1, P2, .., Pi. From this collection of disguises, select only those disguises (if anv exist) that also have a right contact. Refer to this sub-collection of disguises as P1 P3. P3. and P6 (6=<i). Analvze the right contact of each. If a new disguise is encountered, tag it as a member of font #1 in Answer-Array. Now, consider the contiguous right contact of each of the previous right contacts (if any exist) ofP1, P3, and P6. If any new disguises are encountered, tag them also as members of font #1. Continue considering

contiguous right contacts of right contacts of right contacts... ofP1, P3, and P6, until no further right contacts exist.

11.) This concludes the process of collecting ALL the disguises which are members of font H 1.

ism.) Continue repeating steps la - 11 until ALL fonts are identified.

ion.) Now, for each font, rearrange all disguises, with highest frequency disguises occurring first and lowest frequency disguises occurring last in Answer-Array.

lo.) Be aware that single-letter Capital-letter words (that appears nowhere else in text), single symbols, and punctuation with no contacts, ALL appear as new fonts! But these new fonts may, in fact, be "identical" to neighboring fonts. We must determine if this is indeed the case at the end of the document, before any font metric information is calculated.

Note: There may exist "Conflicts Among Fonts", created for example when punctuation marks, like parentheses, contact both letters and digits. The letters are in font Hi, say, and the digits are in font #2. So the parentheses, therefore, provides <BR> <BR> <BR> <BR> a bridge between these two fonts, and have both fonts #1 Hi and H2 listed beside them.

This clearly implies that fonts #1 and H2 should, in fact, be the same font (font #1 = font #2). We will not, however, unify these fonts until the conclusion of the analysis of the entire documents, when font metric information and document formatting is being determined. The reason is because these conflicts of fonts help us to identify which fonts (representing "letters") will be sent to the Recognition software, and which fonts (representing "digits") will not.

Separating Letters. Numbers & Symbols A tiny array called a "Letter-Array" will contain each unique "number" of the font that will be sent to the Recognition software. Ex: Letter-Array = [ 1 2 4 7 ]. This

implies that onlv words having disguises in fonts H 1, 2, 4, and 7, will be sent to the Recognition software.

Note that it is only necessary to check the initial letter of each word, since it is assumed that entire word is in the same font! Now, sequentially analyze each font in the Answer-Array. In ALL of the steps that follow (steps 1-4), we will be excluding from consideration ALL punctuation marks and symbols.

1.) Count the number of unique disguises in font Hj (1=<j=<k), by counting the unique labels (unique "histogram" identities). Here, "k" is the last font in the Answer-Array.

2.) If for font Hj, unique disguises = 1, then skip it. Pattern software cannot handle only one disguise.

(Svmbols occur most often where unique disguises = 1. Must uncover identities of symbols using traditional techniques. Ex: & G' + @ + # < > \ /1 etc.) 3.) If for font Hj, unique disguises > 13, then this font contains "letters" Put this specific font H in the Letter-Array.

(The threshold 13 is chosen because digits and associated symbols will almost never exceed 13 disguises, excluding punctuation [ - , ( ) . ] and symbol[% ]).

4.) If for font Hj, 2=< unique disguises =< 13, then disguises can represent letters or digits (and perhaps symbols as well). To help decide which of the two choices we are dealing with, we ask (and answer) the following question, "Are all Bounding Boxes = 1 or 5 for all disguises in font Hj, Yes or No?".

If the answer is No. then this font most likely contains "letters" Put this specific font H in the Letter-Array.

If the answer is Yes, then this font can contain either "digits" or ALL Capital letter disguises. We must analyze this scenario further before deciding.

Check the following clues that point towards digits. If at least 1 match is made (from the list a-g below), then we "assume" the disguises are digits.

Otherwise, we assume they are letters.

a. Is there a oercent sign % contacting at least 1 of these disguises? Ex: 5.2% b. Is there a decimal point (period) whose "left contact" and "right contact" are both disguises within the group we are presently considering? Ex: 6.4 c. Is there a comma whose "left contact" and 3 contiguous "right contacts" are ALL disguises within the group we are presently considering? And whose 4th contiguous "right contact" is not in this group? Remember that this group contains no punctuation marks or "space" characters.

Ex: 1,945 $13,078,929 7,451.57 -16,001 (98,702) d. Is there a hyphen whose 3 contiguous "left contacts" and whose 4 contiguous "right contacts" are ALL disguises within the group we are presently considering? Ex: 743-1000 1-800- 324-8663 (415) 733-9674 (This approach won't find (800) 552-IDEA, for example). We must also require that the 4th contiguous "left contact" and the 5th contiguous "right contact"

Not; be disguises within the group we are presently considering. Ex: Hence, the phrase "connected- components" will be excluded as a possible choice.

However, this extra requirement will not exclude phrases like "NEVER-THE-LESS" But phrases like "Never-the-less" and "never-the-less" will never be considered, since these phrases do not contain ALL Capital letters! e. Is there a hyphen with No "left contact" and whose contiguous "right contact" is a disguise within the group we are presently considering? Ex: -743,100 -83.2 -.00415E76 f. Is there a colon whose 2 contiguous "right contacts" are both disguises within the group we are presently considering? Ex: 2:30 9:16:22.

(We can't find digits in the following format very easily, because we can't isolate the "backslash" from Bounding Boxes. Ex: 4/8/96 11/15/96).

g. Are there 2 hyphens, separated by exactly 2 disguises, whose 3 contiguous "left contacts" of the left hyphen, and whose 4 contiguous "right contacts" of the right hyphen, are ALL disguises within the group we are presently considering? Ex: 489-84-9956 (Social Security #). We must also require that the 4th contiguous "left contact" of the left hyphen, and the 5th contiguous "right contact" of the right hyphen Not be disguises within the group we are presently considering.

If the answer to any of the above questions is Yes, i.e., if at least 1 match was made from the above a-g clues, then put this specific

font # in the Digit-Array described below. If no match was made (ALL answers are No), then put this specific font # in the ALLCap-Array also described below.

A tiny array called a "Digit-Array" contains each unique "number" of the font that contains only digits. Font #s in this Digit-Array will not be sent to Recognition software. Rather, we must uncover the identities of disguises representing digits using the "traditional techniques" Ex: Digit-Array = [ 3 5 8 ].

A tiny array called an "ALLCap-Array" contains each unique "number" of the font that contains only Capital letters. Words in this font will be sent to Recognition software, but we must be careful to record the "upper case" plaintext answers! Ex: ALLCap-Array = [ 6 9 10 ].

Capital Letters The preprocessing analysis has already identified most, if not all, of the disguises representing upper case (Capital) letters, by using the physical attributes of bounding boxes (box = 5). These " assumed" Capital letter disguises are contained in the Preprocessing-Array. The preprocessing analysis, however, will find onlv those Capital letter disguises located a) at the beginning of a paragraph b) immediately following a period and c) following a period and other punctuation marks. The preprocessing analysis will not find other Capital letter disguises. We must determine these other Capital letter disguises separately (if any exist), before sending text stream to the Recognition subroutine. This special handling of disguised Capital-letter words arises because these words create an opportunity for 1 "plaintext" letter to be represented by 2 totally different disguises, perhaps within the same word. And our pattern word software prohibits this scenario. Once the assumed disguises representing Capital-letters are isolated, a 7 will be placed in the

"Box" column in the Answer-Array (replacing the pre-existing 1 in the Box), to notify us that this disguise may be a Capital letter.

To isolate the "assumed" disguises for Capital letters, which are not already in Preprocessing-Array (if any exist), the following steps are performed: 1.) Analyze all "right contacts" of hyphens (if any exist) for possible Capital letters.

2.) Access the Preprocessing-Array, which contains attributes of Bounding Boxes of individual characters.

3.) Analyze all of the initial disguises of each word in text stream.

Preprocessing analyzed only some initial disguises of words (like those occurring at the beginning of paragraphs, or after a period, etc.) 4.) If the Bounding Box for this initial disguise is not equal to 5, but it is = 1 or 3, then put it in Collection I, and record which font it belongs to. Onlv 1 copy of each valid disguise is allowed in Collection I.

5.) If the Bounding Box of this initial disguise is equal to 5, then skip it. Move on to the next word in text stream, if one exist.

6.) If the Bounding Box of this initial disguise is equal to 6, which means that it is an "assumed" punctuation mark, then skip over this and all other contiguous "assumed" punctuation mark disguises, until a non-punctuation mark disguise is encountered, within this same word (if one exists). If the Bounding Box for this non- punctuation disguise is equal to 5, then skip to next word in text stream. Otherwise, if it is = 1 or 3, then put this disguise in Collection I as well, and record which font it belongs to.

7.) Rearrange ALL disguises in Collection I: For each font separately, place highest frequency disguises first, and lowest frequency disguises last. Refer to this new arrangement as Collection II.

8.) For each disguise in Collection II, analyze its left contacts from triliteral frequency distribution (excluding any contacts with itself which would create an infinite loop). If the Bounding Box of at least 1 ofthese "left contacts" is not equal to 5 or 6, then eliminate immediately from consideration this original disguise from Collection II (i.e., an acceptable disguise (for a Capital letter) can only be preceded by another Capital letter or punctuation mark.

Nothing else).

9.) Sequentially analyze all disguises in Collection II (which were the "central" disguises in step 8), eliminating unacceptable disguises as we proceed.

10.) Any remaining disguises in Collection II, if any exist, "mav" be Capital-letters that were not captured in Preprocessing stage, or they may be low-frequency lower case letters with ascenders. For each of these disguises in Collection II, change the Bounding Box information so it's equal to 7.

Note: Words made up "entirelv" of ALL Capital-letters can be analyzed in the same way as words made up "entirelv" of ALL lower case letters (except that all their Bounding Boxes equal to 1 or 5.) There can exist anywhere from 26-52 total disguises (representing "letters"), per font. However, since the pattern word software supports only 26 total disguises, we must convert ALL words containing these disguises into their isomorphs, to reduce the # of disguises which need to be supported by pattern word software.

The following example text will be used to motivate the above methodology:

By examining the perceptions and attitudes of individual managers, we can also expose inconsistencies and conflicts among company executives, and we can get a feel for the company's expectations. After finishing the internal audit, we typically conduct an "external audit" of key industry observers and potential customers to determine whether the company's goals and expectations are, in fact, realistic.

(External audits will be discussed in greater detail later in this chapter.) Preprocessing-Array = [ B,'.A'() ] Only showing the relevant Bounding Boxes, which = 5 for A and B, and = 6 for the others in array.

Collection I = [ t f k d E 1] Collection II (Initially) = [ t d 1 f k E ] Unacceptable Left contacts = [ s o a n ~ ~ ] Collection II (Final) = [ k E ] Each disguise in the final form of Collection II "mav" represent an additional disguise for a possible Capital-letter.

For [ k E ], replace the 1 in each of their Bounding Boxes so that Bounding Boxes = 7.

Input Text Stream 1.) Do not read into Recognition software any word that is entirely plaintext.

2.) Strip off ALL punctuation marks from the "overall" text stream.

3.) Remove all hyphens from the "overall" text stream in the following way: a) if hyphen occurs at the end of the sentence, "delete" it, and "combine" its contiguous left contact with its contiguous right contact, to form a single

word. b.) otherwise, "delete" the hyphen and replace it with a "space character", thereby breaking all compound words and contractions into separate words. If any single letter words result from this process, then they should be eliminated (not read in), since the pattern software cannot handle them.

4.) The "overall" text stream can contain words from multiple fonts. Separate the "overall" text stream into "j" new text streams. There will be one text stream for each font # listed in the Letter-Array and the ALLCap-Array.

Text Stream #1, Text Stream #2, . ., Text Stream #j. Here, Text Stream #1 contains "only" words having all of their letters in font If 1. Text Stream #2 contains "only" words having all of their letters in font #2, etc.

5.) Each Text Stream will be analyzed separately in the Recognition software.

After each Text Stream is finished being processed by the Recognition software, sequentially input the next Text Stream into the Recognition software until ALL text in the document has been processed.

6.) For Text Stream Ifq (1=<q=<j), check to ensure that each word is not too long for the pattern software. If so, skip it. Do not read it in! 7.) Separate Text Stream Ifq into pattern words and non-pattern words: Pattern words go into nP-Arrays, where n = 3-Pmax: 3P-Array contains 3-letter pattern words, 4P-Array contains 4-letter pattern words, etc.

Any 2-letter pattern words must be an Acronym, since no 2-letter intelligible pattern words exits in English.

Skip ALL 2-letter pattern words. Do not read them in! Non-Pattern words go into nNP-Arrays, where n = 2-NPmax: 3NP- Array contains 3-letter non-pattern words, etc.

Non-Pattern 1-letter words are excluded, since they cannot be analyzed with pattern word software.

Skip ALL 1-letter non-pattern words. Do not read them in! Only 1 copy of each pattern or non-pattern word goes into each of the nP-Arrays or nNP-Arrays.

8.) If there exist more than 1 pattern word in any one of the nP-Arrays, then rearrange each such array so that words having that most repeated letters appear first, as shown in Figure 26.

All pattern words "tagged" as containing at least 1 "segmented XL of characters" are stored as the last entries in each nP-Array.

9.) If there exist more than 1 non-pattern word in any one of the nNP-Arrays, then rearrange each such array so words having the most "plaintext" letters appear first.

All non-pattern words "tagged" as containing at least 1 "segmented pair of characters" are stored as the last entries in each nNP-Array.

10.) Anv pattern or non-pattern word that is entirelv "plaintext". however. is simplv deleted from the array. Words appearing first in arrav must have at least 1 disguised letter! Solution Methodology For each Text Stream Ifq (1=<q=<j), we seek to perform our analysis by selecting first only "pattern" words, if possible. Of these pattern words being considered, the longest pattern words are selected first and the shortest pattern words are selected last. We proceed in this fashion because the number of words which fit a longer pattern will be limited when compared to those which fit a shorter pattern word.

In addition, a longer word contains more letters, making it easier to check the validity of answers.

The analysis begins by selecting the longest available pattern word, sending it through our pattern word software, and generating all the possible plaintext answers which fit this pattern. Our goal, essentially, it to find which of these possible plaintext answers is in fact the correct one. To prove or disprove the validity of an answer, we look for other words (focusing first on pattern words) which contain the same disguised-letters as those in Selected Word. The most desirable case, of course, would be to find a word made up entirely of disguised letters contained in the Selected Word. However, we proceed by looking first for the next longest pattern word which contains at least 1 disguised-letter in Selected Word. This, again, is done to increase computational speed, since longer pattern words have fewer answers which fit its pattern. These additional words, chosen strictly for verification purposes, are called Trial Words.

Our overall analysis will be performed within a Trial Word-Array (TW-Array).

Using this array, we will examine each of the possible plaintext answers for the Selected Word, and eliminate those answers which cannot also form an acceptable Trial Word. We continue selecting additional Trial Words, and eliminating possible answers, until only one of the answers remains for the Selected Word. This is the one correct answer that we have been seeking. If, however, ALL possible answers are eliminated in this process, then something went wrong. Perhaps one of the words (the Selected Word or Trial-Words) was an acronym. Since we don't know which of these words caused the problem, we put them all into the LA-Array to be reconsidered later.

When considering non-pattern words as the Selected Word in TW-Array (disguised below), we do the following:

1.) choose the "longest" non-pattern word which has its initial letter already in "Plaintext" column of Answer-Array. In the event that no such non-pattern words exist, then 2.) reverse our earlier procedure for this moment only, and choose the "shortest" non-pattern word as the Selected Word.

When considering non-pattern words as Trial-Words in TW-Array (disguised below), we do the following: Choose the "longest" non-pattern word that has either 1.) its initial letter already in "Plaintext" column and it has at least 1 disguise that is the same as a disguise in the Selected Word, or 2.) its initial letter (disguise) is not already in "Plaintext" column, but it does appear in the Selected Word, or 3.) if the first 2 options are not available, then we simply choose the "shortest" non-pattern word that contains at least 1 disguise that is the same as a disguise in the Selected Word. Note that steps 1-3 apply separately to each new Trial-Word being selected, for a given Selected Word. All this is done strictly to increase computational speed of the software.

Also, non-pattern words are selected only after ALL pattern words, containing at least 1 disguise that is the same as a disguise in the Selected Word, have been exhausted. See Figure 27.

0 = pattern word software found no acceptable plaintext words for this pattern.

plaintext TW = pattern word software found 1 -and-onlv- 1 acceptable plaintext word for this pattern.

2= pattern word software found two-or-more acceptable plaintext words for this pattern.

Isomorph = Isomorphic Word, if initial letter of word is "lower- case" (Box not equal to 5), or if initial letter of word is Capital (Box = 5) and word is exactly 2-letters in length.

Isomorph = Isomorphic-Array, if initial letter of word is "Capital" (Box = 5), and word is > 2-letters.

Important: Isomorphs (Words or Arrays), for each Trial-Words must be calculated for each Row, separately, of TW-Array! ! ! ! Note: Whenever 0 occurs, that entire Row of TW-Array is eliminated immediately from further consideration.

The correct plaintext answer, for the Selected Word presently under consideration, is in the 1-and-only-i surviving Row that did not get eliminated by any of the Trial- Words. It is the only Row containing no 0 entries. When we have isolated this one correct answer, look further along this same Row to see if any "plaintext TW" Trial-Words exist. If so, and if this Trial Word contains additional disguised-letters not in the Selected Word, then we have uncovered additional identities at no extra computational cost. If, however, we use ALL available Trial-Words (if any exist) and still have not reduced the TW-Array to one unique plaintext answer, then this specific Selected Word, presently under consideration, is eliminated from the TW- Array and placed in the LA-Array for further analysis later. The associated Trial- Words are not however, put in the LA-Array. At this point, we choose another Selected Word (if one exists) and continue our analysis. Whenever a word is put into the LA-Array, it must be deleted from its original nP- or nNP-Array.

If ALL possible plaintext answers for Selected Word are eliminated (i.e., all Rows contain a "0" entry), then something went wrong. There are six likely possibilities: (1) Incorrect Segmentation of characters contained in TW-Array (Selected Word or Trial-Words). (2) Touching Characters (usually involving low frequency letters).

(3) Broken Characters. (4) Acronyms. (5) Words with Apostrophes (which must be "tagged" also). (6) Other (for example-preprocessing errors, such as assigning two different histogram labels (disguises) to the same lower-case (or upper-case) letter, for the same font).

Possibility (1) can arise only if there exist at least 1 "tagged" word in the TW-Array, and All Rows contain a "0" entry. When this scenario is encountered, we move immediately to the Resegmentation Algorithm discussed below, for further analysis,

with the pre-existing words in TW-Array intacted. If, however, no "tagged" word exist, then we are dealing with scenario (2), (3), (4), (5), or (6). In this case, we put the Selected Word and its associated Trial-Words into the LA-Array. Within the LA-Array, the 1 of 5 specific scenarios that we are presently faced with will be determined, and referred to the appropriate algorithm for further analysis.

Note: In (1), the incorrectly segmented disguises, after they are resegmented correctly, may now have identities in "Plaintext" column, since those correct disguises may have appeared in other words. In (2) the disguises will always remain disguises, until their plaintext identities are uncovered with a Broken-Array analysis.

All plaintext identities uncovered are put immediately into the "Plaintext" column, at the successful conclusion of each TW-Array analysis, for each Selected Word.

The contents of Answer-Arrav should remain available. even after recognition subroutine has ended, to ease analysis of future text in document.

Isomorphic-Array The following explains how the pattern word software can be used to analyze disguised variant words, where the variant-cipher letter may create 2 disguises for the same plaintext letter within the same word, even though the software was not designed to handle this scenario. Create an Isomorphic-Array for each variant-letter word greater than 2-letters, whether it be a Capital letter word, in the "Selected Word" or "Trial-Word", or a Pattern Tag, with 1 or 2 possible variants represented as raster letters, or a composite Touching Character representing two individual characters.

We use the notation: Isomorphic Array (Variant word, Number of variant letters (1 or 2), location of variant #1 (from left), location of variant #2 (from left)).

Example: Consider Isomorphic Array (A B C D ED F G H I B, 2, 8, 10). We create the Isomorphic-Array shown in Figure 28 in the following way. Create an outer loop for the first variant label at location 8: "Outer Loop-G". For this outer loop, we need 1 matrix for each unique letter [A B C D E F G H], i.e., G=A, G=B, G=C, G=D, G=E, G=F, G=G, G=H. Notice that in the "outer loop", "G is not equal to F', since this would only duplicate information in the matrix G=G. For the "Inner Loop', we need the column vector (for each matrix) to have 1 entry for each unique letter [A B C D E F G H I], i.e., I=A, I=B, I=C, I=D, I=E, I=F, I=G, I=H, I=I. Notice that in the "Inner Loop", "I is equal to G"! ALL of the "possible" plaintext answers for "G" and "I" are generated from the pattern words in the matrices shown below. Note: "# of matrices = # of unique letters in variant word (excluding 2nd variant letter)" "# of rows in each matrix = Total # of unique letters" In the special case when only 1 variant exists, as with Capital Letters, the entry for the 'location of 1st variant' if left blank, implying only 1 matrix is considered!" Example: Capital Letter If for "Z", the Bounding Box = 5, then we assume it may be a Capital Letter. So we consider the Isomorphic Array I(ZntHmHdatHon, 1 , , 1). When the location for the first variant is let blank, this implies that there will be only 1 outer loop matrix (by default), and hence only 1 matrix total. Also, only 1 unique variant exists (for the Inner Loop), and it occurs at location 1 (first location in array).

Note: In the special case of Capital letter word, made up of 2-letters only, Isomorph = Isomorphic Word, in TW-Array (not Isomorphic-Array!) The reason no array is used is because there are no 2 letter pattern words in English.

I(ZntHmHdatHon) ZntHmHdatHon = AntAmAdatBon

The Bounding Box = 5 for A. Thus I(ZntHmHdatHon, 1 ,~, 1) is equivalent to the fully expanded array!! As seen in Figure 29, the only possible answer to this pattern occurs when A=B. So, this initial letter Z must necessarily be the disguise for the plaintext Capital-letter represented by the disguise H! Using our pattern word software on the rows ofthis array, we find that: A=B=i, which is the same as Z=H=i. Here, Bounding Box=3 for i.

Note: Without this Isomorphic-Array, the above pattern will be viewed as AntBmBdatBon. With this array, we can also consider the additional pattern AntAmAdatAon (and others).

Example: Pattern Tag (1 variant) If"Z" is the 1 cipher label used in the Pattern Tag, then we consider the Isomorphic Array I(HntHmHdatZon, 1 ,~, 1). When the location for the first variant is let blank, this implies that there will be only 1 outer loop matrix (by default), and hence only 1 matrix total. Also, only 1 unique variant exists (for the Inner Loop), and it occurs at location 9 (9th location in array).

I(HntHidatZon) HntHmHdatZon = AntAmAdatBon The I(HntHmHdatZon, 1 , 1) is equivalent to the fully expanded array! ! As seen in Figure 30, the only possible answer to this pattern occurs when A=B.

Using our pattern word software on the rows of this array, we find that: A=B=i, which is the same as Z=H=i.

Note: Without this Isomorphic-Array, the above pattern will be viewed as AntAmAdatBon. With this array, we can also consider the additional pattern AntAmAdatAon (and others).

Example: Pattern Tag (2 variants) Assume we have a Pattern Tag that contains 2 cipher labels (Pattern Tags containing 1 cipher label is treated just like a Capital Letter). The search engine will put this Pattern Tag into the Pattern word software using the Isomorphic Array I(managemABt,2,8,9). The list of candidate plaintext answers for this Pattern Tag are generated from the matrices shown in Figure 31. We seek to uncover the identities of A and B. For the "outer loop-A", we need 1 matrix for each unique letter [mangeAt], i.e., A=m, A=a, A=n, A=g, A=e, A=A, A=t. Notice that in the "outer loop", "A is not equal to B", since this would only duplicate information in the matrix A=A. For the "inner loop-B", we need the column vector (for each matrix) to have 1 entry for each unique letter [mangeABt], i.e., B=m, B=a, B=n, B=g, B=e, B=A, B=B, B=t. Notice that in the "inner loop", "B is equal to A"! Restrictions Placed Upon Plaintext Entries in TW-Arrav The most important points to be made in this section are the following: i. Only disguises representing lower-case letters can place restrictions on the plaintext identities of other disguises representing lower-case letters.

ii. Only disguises representing Capital letters can place restrictions on the plaintext identities of other disguises representing Capital letters.

iii. Only disguised letters within the Selected-Word can place restrictions on disguised letters within any Trial-Words that follow it.

iv. Disguised letters in any one Trial-Word cannot place restrictions on the disguised letters within any other Trial- Word.

6. Assume a disguise, say "L", represents a lower-case letter (Box not equal to 5) in font Ifj where 1=<j=<k. Assume further that "L" appears in at least 1 word in the TW-Array. Now consider the following: a. If"L" appears in the Selected Word, then it is not allowed to have the same plaintext identity as a lower-case letter (Box not equal to 5) appearing in "Plaintext" column, for font Ifj.

Note: Pattern software automatically prevents different disguises, within the same word and each representing lower-case letters (Boxes not equal to 5), from having the same plaintext identities.

b. If"L" appears in a Trial-Word, then it is not allowed to have the same plaintext identity as a different disguise, say "B" (Box not equal to 5), where "B" has appeared in the Selected Word. In addition, "L" is not allowed to have the same plaintext identity as a lower-case letter (Box not equal to 5) appearing in "Plaintext" column, for font Ifj.

Note: The "Plaintext" column is updated onlv after the successful conclusion of a TW-Array analysis for a "specific" Selected Word. That is to say, onlv when there exist exactly 1 and only 1 surviving Row in TW-Array do we then update "Plaintext" column.

7. Assume a disguise, say "C", represents a Capital letter (Box = 5) in font Ifj, where 1=<j=<k. Assume further that "C" appears as the initial letter in at least 1 word in the TW-Array. Now consider the following: a. If"C" appears as the initial letter in the Selected Word, then it is not allowed to have the same plaintext identity as a Capital letter (Box = 5) appearing in "Plaintext" column, for font Ifj.

Note: Isomorphic-Array, which uses pattern word software, does not prevent the Capital letter and a lower-case letter, within the same word, from having the same plaintext identities.

b. If "C" appears in a Trial-Word, then it is not allowed to have the same plaintext identity as a different Capital letter disguise, say "B" (Box = 5), where "B" has appeared as the initial letter in the Selected Word. In addition, "C" is not allowed to have the same plaintext identity as a Capital letter (Box = 5) appearing in "Plaintext" column, for font Ifj.

Consider the Following Example: A Sample TW-Array is shown in Figure 32.

Note: It is always true that pattern word software automatically prevents different disguises representing lower-case letters, within the same word, from having the same plaintext identities.

Selected Word: A B i. Neither A nor B is allowed to have the same plaintext identity as a lower-case letter (Boxes not equal to 5) already in the "Plaintext" column, for font Ifj. (pattern word software automatically enforces the restriction: A cannot equal B)

Trial-Word #1: BAC i. "A" and "B" in Trial-Word #1 have the same plaintext identities as "A" and "B" in Selected-Word, for each Row separately.

ii. C is not allowed to have the same plaintext identity as a lower-case letter (Boxes not equal to 5) already in the "Plaintext" column, for font Ifj. (pattern word software automatically enforces the restriction: A cannot equal B cannot equal C) iii. C cannot equal A or B, since C cannot have the same plaintext identity as a different disguise, also representing a lower-case letter (Box not equal to 5), in the Selected-Word.

Trial-Word #2: C a D B i. "B" in Trial-Word #2 has the same plaintext identity as "B" in Selected-Word, for each Row separately.

ii. Neither C nor D is allowed to have the same plaintext identity as a lower-case letter (Boxes not equal to 5) already in the "Plaintext" column, for font Ifj. Note that lower-case "a" is alreadv in "Plaintext" column.

(pattern word software automatically enforces the restriction: B cannot equal C cannot equal D cannot equal a) iii. Neither C nor D can equal A, since C and D cannot have the same plaintext identity as a different disguise, also representing a lower-case letter (Box not equal to 5), in the Selected-Word.

Trial-Word #3: I(C AD E e) i. "A" in Trial-Word #3 has the same plaintext identity as "A" in Selected-Word, for each Row separately.

ii. C is not allowed to have the same plaintext identity as a Capital letter (Box = 5) alreadv in the "Plaintext" column. for font Ifi. Note that lower-case "e" is alreadv in "Plaintext" column.

iii. C cannot equal the Capital letter in Selected-Word (if it contains one), since C cannot have the same plaintext identity as a different disguise, also representing a Capital letter (Box = 5). In the present case, however, no Capital letter occurs in the Selected-Word.

iv. Neither D nor E is allowed to have the same plaintext identity as a lower-case letter (Boxes not equal to 5) already in the "Plaintext" column, for font Ifj. (pattern word software automatically enforces the restriction: A cannot equal D cannot equal E cannot equal e). Note that C is missing from this list.

v. D and E cannot equal B, since D and E cannot have the same plaintext identity as a different disguise, also representing a lower-case letter (Box not equal to 5), in the Selected-Word.

Trial-Word #4: C D A E F i. "A" in Trial-Word #4 has the same plaintext identity as "A" in Selected-Word, for each Row separately.

ii. Neither C, D, E nor F is allowed to have the same plaintext identity as a lower-case letter (Boxes not equal to 5) already in the "Plaintext" column, for font Ifj. (pattern word software automatically enforces the restriction: A cannot equal C cannot equal D cannot equal E cannot equal F) iii. Neither C, D, E, nor F can equal B, since C, D, E, and F cannot have the same plaintext identity as a different disguise, also representing a lower-case letter (Box not equal to 5), in the Selected-Word.

Trial-Word #5: I(C A D E F) i. "A" in Trial-Word #5 has the same plaintext identity as "A" in Selected-Word, for each Row separately.

ii. C is not allowed to have the same plaintext identity as a Capital letter (Box = 5) already in the "Plaintext" column, for font Ifj.

iii. C cannot equal the Capital letter in Selected-Word (if it contains one), since C cannot have the same plaintext identity as a different disguise, also representing a Capital letter (Box = 5). In the present case, however, no Capital letter occurs in Selected-Word.

iv. Neither A, D, E nor F is allowed to have the same plaintext identity as a lower-case letter (Boxes not equal to 5) already in the "Plaintext" column, for font Ifj. (pattern word software automatically enforces the restriction: A cannot equal D cannot equal E cannot equal F). Note that C is missing from this list.

v. Neither D. E. nor F can equal B. since D. E. and F cannot have the same plaintext identitv as a different disguise. also representing a lower-case letter (Box not equal to 5), in the Selected-Word.

Resegmentation Algorithm Resegmentation is required when there exist at least 1 "tagged" word in the TW- Array and ALL Rows in contain a "0" entry. When the Resegmentation algorithm is called, it must go to work on a TW-Array with "pre-existing" Selected Word and/or Trial-Words. There are two cases that we must consider:

Note: The instant when Re segmentation of a pair of letters occurs, transmit this information immediately throughout the entire document: All Algorithms and All Arrays! All "tagged" words containing "j" incorrectly segmented pairs will have their "word-lengths" reduced by 'j" letters as well, after Resegmentation has occurred. Therefore, these words must be "shifted" to new nP-Arrays or nNP-Arrays (assuming that Resegmentation does not render word entirely plaintext, in which case we can eliminate it from Arrays).

After Resegmentation, remove the "tags" from the Resegmented words.

Note: The "tag", which is attached to each word containing segmented pairs of letters, must specify the number # and the location of each segmented pair within each tagged word. It is "assumed" in (a) and (b) below, that the Resegmented plaintext letters [m w b W] are not already in "Plaintext" column! If this "assumption" is not satisfied, then immediately put the associated "tagged" word into the LA-Array.

1. ) The Selected Word is the 1 sot "tagged" word encountered in TW-Array.

When Selected Word is a "taged" word, we eliminate immediately All pre- existing Trial-Words from TW-Array, and choose new Trial-Words after Resegmentation of Selected Word is completed. This is done since all Trial- Words must contain at least 1 disguise that is "identical" to a disguise in the Selected Word. However, once Resegmentation of the Selected Word occurs, original disguises within Selected Word may be eliminated. This could render earlier Trial-Words as inappropriate for this newly Resegmented Selected Word.

(a) If 1 or both of the letters are plaintext in the incorrectly segmented pair, then we perform Resegmentation in the following way: If Bounding Boxes Pattern = 3 3 and segmented pair has plaintext r ~, n n, or r n then Resegment this pair as the single plaintext letter "m"

If Bounding Boxes Pattern = 3 3 and segmented pair has plaintext y y, then Resegment this pair as the single plaintext letter "w" If Bounding Boxes Pattern = 3 1 and segmented pair has plaintext c ~ 1, or g l, then Resegment this pair as the single plaintext letter "d" Bounding Boxes Pattern = 5 5 and segmented pair has plaintext V V then Resegment this pair as the single plaintext letter "W,'.

(b) If neither of the letters are plaintext in the incorrectly segmented pair, then we perform Resegmentation in the following way: If Bounding Boxes Pattern = 3 3 and segmented pair has identical disguises A A, then Resegment this pair as the single plaintext letter "w" If Bounding Boxes Pattern = 3 3 and segmented pair has different disguises A B then Resegment this pair as the single plaintext letter "m".

If Bounding Boxes Pattern = 3 1 and segmented pair has disguises A B then Resegment this pair as the single plaintext letter "d" If Bounding Boxes Pattern = 5 5 and segmented pair has identical disguises A A, then Resegment this pair as the single plaintext letter "W" 2.) Trial-Word Ifi is the 1st "tagged" word encountered in TW-Array, where j=<l. (Selected Word is not "tagged") When Trial-Word Ifi is a "tagged" word. we eliminate immediatelv All Trial-Words that follow Trial-Words Ifi in the TW-Array. if any exist (i.e.. we eliminate Trial-Words Ifi+1, Trial-Words ifs+2, .). Now, perform Resegmentation on Trial-Words Ifi itself. If, after Resegmentation, Trial-Words Ifi still contains at least 1 disguise identical to a disguise in Selected Word, then proceed to find all acceptable plaintext answers as usual, subject to the appropriate restrictions. If, after Resegmentation, Trial-Words jIf no longer contains any disguises identical to a disguise in Selected Word, then remove this word from the TW-Array (through not from nP- or nNP-Arrays), choose a new Trial-Word, and proceed as usual. We continue choosing Trial-Words until a unique plaintext answer emerges for Selected

Word, or the TW-Array analysis fails again, etc. Resegmentation is performed as described in l(a) and (b).

Touching-Array 1.) Delete all words that are entirelv plaintext! 2.) The only words allowed in this Touching algorithm are those words containing exactly 1 and only 1 disguise. Words containing 2 disguises are "assumed" to represent a single character which has been "Broken" into exactly 2-pieces only (i.e., 1 character is represented by 2 disguises). In words having exactly 1 disguise, this disguise may represent a "single letter", or it may represent a composite (touching) "pair of letters" In addition, this "Touching-Array" algorithm works onlv with "All Lower Case" or "All Capital Letters" not with "mixed" letters! 3.) Prioritize all words to be considered for "Touching-Array" analysis. We consider the longest words first.

4.) A word "suspected" of containing 1 "pair" of touching letters may, in fact, just contain a "single letter", where the word containing it was tossed out of the TW-Array because there was no Trial-Word to verify its identity (perhaps because disguise is a low frequency letter). We have to allow for both possibilities.

5.) We proceed by creating an @-Array, which contains 2 vertical columns "A" and "B".

Put this "suspect" word in the pattern word software without modification, i.e., we assume here that @ is a "single letter" disguise. For those plaintext answers which emerge, if any exist (excluding those plaintext answers already

in "Plaintext" column), put these identities in the "B" column of the (0-Array.

Column "A" is intentionally left "blank" in this scenario. Next, create a "Touching-Array" (discussed below) for this word, by replacing ( by by the pair "AB" i.e., we assume here that ( ) is a "pair of touching letters" For those plaintext answers which emerge if any exist. put exactly 1 copv of these identities in both the "A" and "B" columns of the @-Array. It is important to note here that both "A" and "B" are allowed to have the same identities as letters already in "Plaintext" column! We also use Bounding Boxes information to restrict the allowable plaintext answers for these "pairs of letters" If the "Touching-Array" analysis produces more than 1 answer, then using the "Postprocessing" analysis (discussed below), we can either identify the 1 correct "pair" of answers, or even further restrict the number of acceptable plaintext choices. These choices are then put into either the "lower case possibilities" or "upper case possibilities" columns for even further analysis later using traditional techniques.

Ex: If pattern word software is used on the "suspect" word "managemi).t", no answer emerges if @ # is is viewed as a "single letter" Now we must see if there exist answers when ( is viewed as a "pair of letters" So we perform the "Touching-Array" analysis for this word, which is shown below.

6.) If at the conclusion of the Touching-Array analysis, the -Array contains exactly 1 and only 1 answer, then this 1 unique plaintext answer is put in "Plaintext" column. If the (B-Array contains more than 1 answer, even after "Postprocessin,g", then put these choices into either the "lower case possibilities" or "upper case possibilities" column.

7.) To distinauish between the "single letter" possibilities and the "pairs of letters" possibilities. coexisting in the same columns in Answer-Arrav the "pairs of letters" will be surrounded bv parentheses!

Example: Consider "managem )t". Assuming that @ represents 1 pair of touching characters, we create the Touching-Arrav for the Alternative word "manaemABt", and set Touching Array=I{managemABt, 2,8,9). We seek to uncover the identities of A and B. For the "outerloop-A", we need 1 matrix for each unique letter [mangeAt], .i.e., A=m, A=a. A=n, A=g, A=e, A=A, A=t. Notice that in the "outer loop", "A is not equal to B", since this would only duplicate information in the matrix A=A. For the "innerloop-B", we need the column vector (for each matrix) to have 1 entry for each unique letter [mangeABt], i.e., B=m, B=a, B=n, B=g, B=e, B=A, B=B, B=t. Notice that in the "inner loop", "B is equal to A"! ALL ofthe "possible" plaintext answers for "A" and "B" are shown in the @-Array (including "single letter" answers, if any exist). Since there is exactly 1 unique answer in this array, put "(=en" in "Plaintext" column. If @- Array had had more entries for "pairs of letters", then we would have used the "Postprocessing" analysis to either choose the 1 correct "pair" representing the answer, or further limit the number of acceptable answers. A sample Touching- Array for "managemgt" is shown in Figure 33.

Note: Use the Bounding Box information on (0 to limit the plaintext choices for "A" and "B" (be aware that "A" and "B" can equal letters already in "Plaintext" column).

Example: Consider the phrase "the whig paper" We suspect that whi( may contain 1 pair of touching characters. Create the Touching-Arrav by considering 1.) (0 as a "single letter" disguise in whi@, and 2.) @=AB as 1 pair of touching characters, with Touching Array=I(whiAB,2,4,5). We seek to uncover the identities of A and B. In step 2, for the "outer loop-A", we need 1 matrix for each unique letter [whiA], i.e., A=w, A=h, A=i, and A=A. Notice that in the "outer loop", "A is not equal to B", since this would only duplicate information in the matrix A=A. Also in step 2, for the "inner loop-B", we need the column vector (for each matrix) to have 1 entry for each unique letter [whiAB], i.e., B=w, B=h, B=i, B=A, and B=B. Notice that in the "inner loop", "B is equal to A"! ALL of the

"possible" plaintext answers for "A" and "B" are shown in the @-Array (including "single letter" answers, if any exist).

Touching-Arrav for "whi(" 1.) # as a "single letter" in whi# =[dgmnprtz]. However, if we assume for the sake of discussion that [dgmnrt] are already in the "Plaintext" column of Answer-Array, then they must be excluded as possible answers for (0. The remaining 2 acceptable "single letter" answers are put into the "B" column of the #-Array.

2.) z = = AB in whi( ).. "whiAB" See Figure 34 AB=[ch sh ffrr zz le ms ne ns ny ps rl rs sk st te ts ty]. For @, the Bounding Box = 1.

Using this information, we can further reduce the number of acceptable plaintext answers for the AB pair. We eliminate immediately the single-letter plaintext answers [p z]. This reduced set of choices is placed in @-Array. Indeed, we can reduce this list of choices even further by performing a "Postprocessing" analysis.

If, after Postprocessing, there is still not 1 unique answer, then put these remaining choices into either the "lower case possibilities" or "upper case possibilities" column of the Answer-Array.

Note: When # is considered a "single letter", the plaintext answers are not allowed to be the same as letters already in "Plaintext" column in the Answer-Array. This

restriction is removed, however, when (0 is considered as a "pair of letters". Each letter of pairs is allowed to be the same as letters already in "Plaintext" column! ! "Broken-Arrav" 1.) Delete all words that are entirely plaintext! 2.) The onlv words allowed in this Broken algorithm are those words containing exactlv 2 and onlv 2 contiguous disguises. Thus. words containing 1 pair of continuous disguises are "assumed" to represent a "single" character which has been "Broken" into 2-pieces only (i.e., 1 character is represented by 1 pair of disguises). In words having exactly 1 pair of contiguous disguises, these disguises may represent "2 single letters", or they may represent "1 single letter", once this pair of disguises is "recombined". In addition, this "Broken- Array" algorithm works onlv with "All Lower Case" or "All Capital Letters", not with "mixed" letters! 3.) Prioritize all words to be considered for "Broken-Array" analysis. We consider the longest words first.

4.) A word "suspected" of containing the 2 pieces of a broken letter may, in fact, just contain "2 single letters, where the word containing these letters were tossed out of the TW-Array because their was no Trial-Word to verify their identities (perhaps because these disguises are low frequency letters). We have to allow for both possibilities.

5.) We proceed by creating an AB-Array, which contains 2 vertical columns "A" and "B" Put this "suspect" word in the pattern word software without modification, i.e., we assume here that "A" and "B" are "2 single letter disguises". For those plaintext answers which emerge, if any exist (excluding those answers having letters already in "Plaintext" column), put these identities in the respective "A" and "B" columns of the AB-Array.

Next, create a "Broken-Array" (disguised below) for this word, by replacing the "pair" of contiguous letters "AB" by the "single" letter disguise ( , i.e., we assume here that "AB" is a "single" letter, which has been broken in half.

For those plaintext answers which emerge, if any exist, put exactly 1 copy of these identities in the "B" column of the AB-Array. Column "A" is intentionally left "blank" in this scenario. It is important to note here that the "recombined" single letter disguise @ fts allowed to have the same identity as letters already in "Plaintext" column! We also use Bounding Boxes information to restrict the allowable plaintext answers for this "single letter" disguise @. If the "Broken-Array" analysis produced more than 1 answer, then by performing a "Postprocessing" analysis, we can either identify the 1 correct "single" letter answer, or even further restrict the number of acceptable plaintext choices. These choices are then put into either the "lower case possibilities" or "upper case possibilities" columns for even further analysis later using traditional techniques.

6.) If at the conclusion of the Broken-Array analysis, the AB-Array contains exactly 1 and only 1 answer, then this 1 unique plaintext answer is put in "Plaintext" column. If the AB-Array contains more than 1 answer, even after "Postprocessing", then put these choices into either the "lower case possibilities" or "upper case possibilities" column.

7.) If the AB-Array contains exactly 1 answer, and this answer is a "single letter" answer, then the original 2 Rows containing the "broken pieces" must be "collapsed" into 1 Row, to accommodate this correct "single letter" answer! Example: We "suspect" that the word "haABe" may contain a pair of contiguous disguises "AB" representing 1 broken character. Create the "Broken-Arrav" by considering 1.) "A" and "B" as 2 separate "single letter" disguises, and 2.) AB=@ as 1 recombined (originally broken) character in "ha(â)e", and we seek to uncover the identities of A, B, and (0. Notice that we need the column vector-B in step 2

to have an entry for each unique letter onlv [haBe], i.e., B=h, B=a, B=B, B=e. The plaintext answers for "A", "B", and (0 are shown in the AB-Array.

Broken-Arrav for "haABe" 1.) "A" and "B" as 2 separate "since letter" disguises in "haABe" AB = [gu ls lv nc ns rl rp st ut vr ws]. However, if we assume for the sake of discussion that [usncrptw] are already in the "Plaintext" column of Answer-Array, then they must be excluded as possible answers for both A and B (i.e., A cannot equal u, s, n, c, r, p, t, or w, and B cannot equal u, s, n, c, r, p, t, or w).

The only remaining "2 separate single letter" answers are AB=[Iv].

However, using Bounding Boxes = 3 ("ad hoc" guess), all of the "2 separate single letter" choices get eliminated! 2.) AB = ( ) as 1 singe letter in Broken-Arrav for: "ha( )e" (or "haBe") B AB-Arrav ha he ha ae ha Be ha e e m r V z B=[dklmrtvz]. For the recombined "AB", the Bounding Box = 3 ("ad hoc" guess).

Using this information, B=[mrvz]. Here, the plaintext choices for "B" are allowed to be the same as letters in the "Plaintext" column, because the connected-

components of this character is "damaged", and may not have been matched properly by the Preprocessing algorithm! ! We can reduce this list of choices even further by performing a "Postprocessing" analysis. If, after Postprocessing, there is still not 1 unique, then put the remaining choices into either the "lower case possibilities" or "upper case possibilities" column of the Answer-Array.

"Postprocessing Analvsis" Touching Characters When a plaintext letter is placed in the "Plaintext" column of the Answer-Array, it is immediately "linked" to its equivalent and unique-x-y histogram label. We can use this information to eliminate unacceptable answers when analyzing touching characters. In the Touching-Array analysis, we assume that 1 single composite disguise, once separated, can represent a collection of "pairs of letters" (@=AB).

Our goal in Postprocessing is to eliminate the unacceptable pairs from this collection, leaving the 1 correct par as the answer. It is possible that many of the plaintext identities of these "newly" separated letters (AB) may already be known, and sitting in the "Plaintext" column. All that is required to eliminate unacceptable pairs is to compare the known histogram label for the 1 single composite disguise @, with the "combined" known histogram labels for each letter appearing in both the "Plaintext" column, and in the "pair of letters" under consideration. Indeed, it may very well be possible to eliminate unacceptable pairs by comparing the known histogram label for the 1 composite disguise with only 1 of the 2 letters in any pair under consideration. We can continue to enhance our confidence in our ultimate choice even further by analyzing where it and all other alternatives occur in the prioritized list of the top 100 digraphs appearing later in this section. This frequency digraph is the one most likely to be the correct answer.

Broken Characters In the Broken-Array analysis, we assume that 2 contiguous disguises, once recombined, can represent a collection of"single letter" choices (AB=@). It is important to note that these recombined single letters are allowed to be the same as letters in the "Plaintext" column of the Answer-Array. No single letter can be excluded just because it has already appeared in Plaintext" column! Out goal in Postprocessing is to eliminate the unacceptable single letters from this collection, leaving the 1 correct letter as the answer. All that is required to eliminate unacceptable single letters it to compare the "combined" known histogram labels for the 2 contiguous disguises (AB), with the "known" histogram label for each letter (@) appearing in both the "Plaintext" column, and in collection of"single letter" choices under consideration. Indeed, it may very well be possible to eliminate unacceptable single letters by comparing the known histogram label for only 1 of the 2 contiguous disguises with the histogram for the single letter under consideration. Note, however, that if the single letter is not in the "Plaintext" column, then we do not know the histogram label for this single letter under consideration. Therefore, we cannot analyze this single letter by comparing histograms. The only remaining option available is to use traditional techniques.

Validating: Plaintext Words Throughout the document, each plaintext word (in font #j), which contains a Resegmented problem letter [m, w, d, or W], must be verified as a legitimate English word, by the pattern word software.

Example: For font Ifj, assume Resegmentation occurred for the problem letters = [md]. Therefore all words in the document (in font #j), containing "m" and "d" must be verified. There are two ways to proceed: (1) Check each word separately in pattern word software (even each repeated words), or (2) Create a cache of already verified words containing "m" and "d", arranged according to word-length.

A new m or "d" word is compared to words in the cache. If a match is made, the new word is verified. If no match is made, put this new word through the pattern software for verification. If verified, put this word in cache as well. Otherwise, skip it! It is important to note that we must be careful when using the pattern word software for verification. If there exist words containing the "touching" letters "rn", the Postprocessing software could incorrectly substitute the letter "m" in all words in the document containing this touching pair of letters. For example, suppose that the "touching" letters "rn" occur in the word "churn", and the Postprocessing software incorrectly substitutes "m" to product the incorrect word "chum" Since both words are valid English words, we cannot use the pattern word software alone to verify which of the two is correct. Therefore, in all verifications, we must consider the following two scenarios: 1) If no verified answer emerges, once plaintext word is put through pattern software, then skip this word. It could be an acronym.

2) If 1 verified answer does emerge, once plaintext word is put through pattern software, then we must check to see if a 2nd verified answer exists as well.

(a) If the letter "m" was Resegmented, and if the plaintext word contains an "m", then substitute an "rn" to see if a 2nd answer emerges.

If no 2nd answer emerges, then we have verified the original plaintext word containing "m" If a 2nd answer does emerge, then we have no way to choose which of the two plaintext words is correct: the one containing "m", or "rn".

Thus, we keep the plaintext word with the "m", and link to this word the 2nd choice containing the "rn", so a database search will hit either word! This is similar to our approach in handling "a raster word that is difficult to recognize", where a Pattern Tag, which provides full-text searching of this difficult raster word, is linked to the Mixed-Plaintext word (which contains both raster and plaintext letters) generated from this difficult raster word, and preserves page typography.

(b) If the letter "w" was Resegmented, and if the plaintext word contains an "w", then substitute a "w" to see if a 2nd answer emerges.

If no 2nd answer emerges, then we have verified the original plaintext word containing "w" If a 2nd answer does emerge, then we have no way to choose which of the two plaintext words is correct: the one containing "w", or "w" Thus, we keep the plaintext word with the "w", and link to this word the 2nd choice containing the "w", so a database search will hit either word! (c) If the letter "d" was Resegmented, and if the plaintext word contains a "d", then substitute a "cl" to see if a 2nd answer emerges.

If no 2nd answer emerges, then we have verified the original plaintext word containing "d" If a 2nd answer does emerge, then we have no way to choose which of the two plaintext words is correct: the one containing "d", or Thus, we keep the plaintext word with the "d", and link to this word the 2nd choice containing the "cl", so a database search will hit either word! (d) If the letter "W,' was Resegmented, and if the plaintext word contains an "my', then substitute a "VV" to see if a 2nd answer emerges.

If no 2nd answer emerges, then we have verified the original plaintext word containing "W,'.

If a 2nd answer does emerge, then we have no way to choose which of the two plaintext words is correct: the one containing "W", or "V V" Thus, we keep the plaintext word with the "W", and link to this word the 2nd choice containing the "VV", so a database search will hit either word!

Automatic Indexing & The Most Common Words in English The Metafile should be able to discriminate between the relative importance of words on each page of a document. An efficient database search engine will not allow users to search for common words. These words, when entered on the command line, will be ignored. Using this insight, the Metafile can be designed with an extremely efficient data structure for indexing words on each page of a document, with common words given the lowest priority. This will increase tremendously the speed with which database searches can be accomplished. The 100 most common words in English are listed below. The figures give occurrences in 242,432 words.

1. THE . 15,568 26. OR . 1,101 51. WHEN 603 76. ONLY. 309 2. OF 9,767 27. HER 1,093 52. WHAT 570 77. ANY. 302 3. AND. 7,638 28. HAD. 1,062 53. YOUR 533 78. THEN. 298 4. TO . 5,739 29. AT 1,053 54. MORE . 523 79. ABOUT 294 5. A 5,074 30. FROM 1,039 55. WOULD 516 80. THOSE . 288 6. IN 4,312 31. THIS 1,021 56. THEM 498 81. CAN. 285 7. THAT . 3,017 32. MY 963 57. SOME 478 82. MADE 284 8. IS . 2,50933. THEY. 95958. THAN. 44583. WELL. 283 9. I. 2,292 34. ALL 881 59. MAY 441 84. OLD . 282 10. IT. . 2,255 35. THEIR 824 60. UPON. 430 85. MUST 280 11. FOR . 1,869 36. AN . . 789 61. ITS . 425 86. US 279 12. AS 1,853 37. SHE . 775 62. OUT . 387 87. SATD . 276 13. WITH . 1,849 38. HAS 753 63. INTO 387 88. TIME 273 14. WAS 1,761 39. WERE 752 64. OUR. . 386 89. EVEN. 272 15. HIS. 1,732 40. ME. 745 65. THESE . 385 90. NEW 265 16. HE . 1,727 41. BEEN . 720 66. MAN 383 91. COULD 264 17. BE . 1,535 42. HIM 708 67. UP 369 92. VERY. . 259 18. NOT. 1,496 43. ONE. 700 68. DO. 360 93. MUCH . 252 19. BY . 1,392 44. SO . 696 69. LIKE . 354 94. OWN 251 20. BUT 1,379 45. IF . . 684 70. SHALL 351 95. MOST 251 21. HAVE. 1,344 46. WILL 680 71. GREAT. 340 96. MIGHT . 250 22. YOU. 1,336 47. THERE . . 668 72. NOW 331 97. FIRST . 249 23. WHICH. 1,29148. WHO . 66473. SUCH. 32898. AFTER . 247 24. ARE . 1,222 49. NO . 658 74. SHOULD 327 99. YET . 247 25. ON . 1,155 50. WE. 638 75. OTHER. 320 100. TWO . 244

Sugoestions on Metafile Format At the very beginning of Metafile, store exactly 1 and only 1 copy of each word occurring throughout the document. Words should be separated into 4 groups: (1) Non-Common Pattern Words (2) Non-Common Non-Pattern Words (3) Common Pattern Words (4) Common Non-Pattern Words. The words in each group are ordered alphabetically, and according to word-length. Each word has pointers to every x-y location, and line, on each page where it occurs within the document.

Each word may also be expressed in many different fonts depending on its x-y location. The main benefit to this ordering is that the database engine is only required to uncompress and search the very beginning of the Metafile, and not the entire document, to know everything contained in that document. Figure 35 shows an example of an array grouping of common word (non-pattern and pattern).

Procedure to Accelerate the Substitution of Plaintext Characters Throughout the Document After the Recognition Process has been Completed Substituting single-letter plaintext characters is straightforward. The acceleration in the time required to process a document, however, occurs when substituting for plaintext touching characters (composite blobs). If the composite blobs, assumed here to be digraphs (representing 2 touching characters), within the cache are prioritized based on the frequency that they occur in the English language, then blobs located at the beginning of the cache will appear in far more words throughout the document than blobs located at the end of the cache. Some of the digraphs below may not represent touching characters (composite blobs) in the specific font, within the document, presently under consideration. If this is indeed the case, simply ignore irrelevant digraphs below, and prioritize only the touching blobs, based on where they do occur in the list below.

1. TH 1,582 26. RO . 275 51. WI . 188 76. SA. 146 2. IN. 784 27. LI. 273 52. HO. 184 77. NI. 142 3. ER 667 28. RI 271 53. TR 183 78. RT. . 142 4. RE . 625 29. IO. 270 54. BE . 181 79. NA 141 5. AN. 542 30. LE . 263 55. CE 177 80. OL. 141 6. HE . 542 31. ND 263 56. WH 177 81. EV. 131 7. AR . 511 32. MA 260 57. LL 176 82. IE . 129 8. EN 511 33. SE 259 58. FI . 175 83. MI 128 9. TI 510 34. AL . 246 59. NO. 175 84. NG . 128 10. TE 492 35. IC 244 60. TO . 175 85. PL. 128 11. AT . 440 36. FO 239 61. PE 174 86. IV 127 12. ON. 420 37. IL 232 62. AS 172 87. PO 125 13. HA. . 420 38. NE 232 63. WA 171 88. CH 122 14. OU. 361 39. LA . 229 64. UR . 169 89. EI 122 15. IT 356 40. TA 225 65. LO 166 90. AD 120 16. ES 343 41. EL . 216 66. PA 165 91. SS 120 17. ST . 340 42. ME . 216 67. US 165 92. IL . 118 18. OR . 339 43. EC 214 68. MO 164 93. OS 117 19. NT 337 44. IS 211 69. OM 163 94. UL 115 20. HI.. 330 45. DI 210 70. AI . 162 95. EM 114 21. EA . 321 46. SI . 210 71. PR 161 96. NS 113 22. VE 321 47. CA 202 72. WE. 158 97. OT 113 23. CO . 296 48. UN . 201 73. AC 152 98. GE 112 24. DE . 275 49. UT . 189 74. EE 148 99. IR 112 25. RA . 275 50. NC . 188 75. ET 146 100. AV 111 LA-Array Analvsis 1.) Delete all plaintext words.

2.) There must exist at least 2 words containing any specific disguise for it to be considered for analysis.

If there exist only 1 word, use pattern word software to limit choices for disguised letters. If in addition this word < 2-letters, create a Self~Id- Array to check for additional plaintext possibilities. Put results in Array for verification.

If there exist only 1 word, and if it is a 2-letter word, then the disguise could represent a lower case plaintext letter (not in Answer-Array) or upper case plaintext letter (not in Answer-Array). Put results in Answer-Array for verification.

7.) Words not assumed to begin with Capital-letters are probably acronyms. Put in Answer-Array.

8.) If there exist any unidentified disguises, check to see whether any of these disguises also appear in lNP-Array. If so, that assume disguise is "i" or "a" If a unique plaintext answer emerges, put it in Answer-Array. If not, put disguise in Answer-Array: Ex. Q and Qt. Q=i and Q=a are both possibilities, but no unique answer emerges.

9.) When a plaintext identity is uncovered, it is substituted simultaneously into ALL Arrays: nP-, nNP-, Answer-, etc.

Use Bounding. Boxes information on each character of each word in LA-Arrav to eliminate possible plaintext choices. BEFORE anv other analvsis is performed! If, at the conclusion of LA-Array analysis, we must leave a word or character as a raster image, then we should also attach both the Isomorph Pattern" and the At this point, all Font Conflicts must be resolved! Access Digit-Array now. Use it limited the number of comparisons required, with synthesized alphabet, to uncover the identities of digits.

Before synthesizing any fonts, either for verification using traditional techniques or for document formatting, check to see if any "known" plaintext letters (in the "Plaintext" column) also appear in any "lower case" or "upper case" columns anvwhere in the Answer-Array (not necessarily on the "same" Row as that of "plaintext letter in the "Plaintext" column, presently under consideration"). If so, "delete" all copies of this letter appearing in "lower case" or "upper case" columns.

If in any Row where deletion occurs, there now exists only 1 surviving plaintext choice in the "combined" lower case and upper case columns, then this 1 letter is put into "Plaintext" slot, since it has now been "verified" If (total number of disguises (excluding punctuation and symbols)) - (total number of disguises in Collection II and with Bounding Boxes=5) > 26, then this implies that digits (and perhaps symbols) are contacting letters!

Ex: "Bounding Boxes Pattern" (for all lower case letters) information for the phrase "the good guys" = 113 2331 2323", allows us the eliminate possibly unacceptable plaintext characters.

Ex: "the" = 113 : plaintext possibilities = [the thp] Note: Mayhavetouse traditional technique to choose between "e" and "o" Ex: "good" =2331: plaintext possibilities = [good goof gook jaal peed peek peel pooh pook pool psst] Ex: "guys" = 2323 : plaintext possibilities = [gape gays gips goya guys jags jape jays jupe pega pego pegs pegu page pigs pago puyo pugo] We proceed as follows to try and isolate a "unique" answer: a.) Create a "Bounding Boxes Pattern" for this Selected Word (This type of pattern is discussed in more detail in the LA-Array section). Combine this Bounding Boxes Pattern information with the "surviving" Selected Word possibilities (in those Rows containing no "0" entries). If a "unique" answer results, then we are finished analyzing this specific Selected Word. If no "unique" answer emerges, then b.) Check to see if any of the "disguises" in this Selected Word have pre-existing entries in the "lower case possibilities" and "upper case possibilities" columns the in Answer-Array (This information would have been put in those columns when the disguise in question were encountered in an earlier paragraph of text). This "lower case" and "upper case" information, when combined with the information in part a. above, will hopefully either isolate a "unique" answer, or "further" reduce the acceptable plaintext choices for these disguises in question (by eliminating some of the choices in the pre-existing "lower case" and "upper case" possibilities columns).

If ALL possible plaintext answers for Selected Word are eliminated, then something went wrong. Since we don't know which word caused the problem, they all get put into LA-Array.

Previous Patent: MAIL NOTES INTERFACE WITH AN AUTO-HIDEABLE TRAY FUNCTIONING WITH VARIOUS E-MAIL SYSTEMS

Next Patent: LOSS-LESS COMPRESSION AND DECOMPRESSION OF BITMAPS