Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD AND APPARATUS FOR CONVERTING TEXT OF A DOCUMENT INTO COMPUTER-INTELLIGBLE TEXT
Document Type and Number:
WIPO Patent Application WO/2007/016974
Kind Code:
A1
Abstract:
A computer-implemented method of converting a plurality of portions of a document into a computer-intelligible digital form, the document having a first portion of text on one page and a second portion of text on another page of the document, and in which a copy o f the document is held in at least one electronic image, the method comprising performing the steps of: i. analysing the portions of the document to ascertain whether text content of the first portion should be connected to content of the second portion; ii. upon determining that the first and second portions should be connected, converting the text content of the first and second portions together so as to form a logical combined computer-intelligible text equivalent to the text content of the first and second portions of the document.

Inventors:
SANCHEZ JOSE ANTONIO (ES)
ABAD PEIRO JOSE (ES)
YACOUB SHERIF (ES)
Application Number:
PCT/EP2005/053699
Publication Date:
February 15, 2007
Filing Date:
July 28, 2005
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HEWLETT PACKARD DEVELOPMENT CO (US)
SANCHEZ JOSE ANTONIO (ES)
ABAD PEIRO JOSE (ES)
YACOUB SHERIF (ES)
International Classes:
G06F17/27
Foreign References:
US5848184A1998-12-08
Other References:
KIM J ET AL: "Automated Labeling in Document Images", PROCEEDINGS OF THE SPIE, SPIE, BELLINGHAM, VA, US, vol. 4307, 2001, pages 111 - 122, XP002355493, ISSN: 0277-786X
SONG MAO ET AL: "Document structure analysis algorithms: a literature survey", PROCEEDINGS OF THE SPIE, SPIE, BELLINGHAM, VA, US, vol. 5010, January 2003 (2003-01-01), pages 197 - 207, XP002355496, ISSN: 0277-786X
Attorney, Agent or Firm:
LEADBETTER, Benedict (S.L. Legal Dept., Avda Graell, 501 Sant Cugat del Valles, ES)
Download PDF:
Claims:

CLAIMS

1. A computer- implemented method of converting a plurality of portions of a document into a computer-intelligible digital form, the document having a first portion of text on one page and a second portion of text on another page of the document, and in which a copy of the document is held in at least one electronic image, the method comprising performing the steps of: i. analysing the portions of the document to ascertain whether text content of the first portion should be connected to content of the second portion; ii. upon determining that the first and second portions should be connected, converting the text content of the first and second portions together so as to form a logical combined computer-intelligible text equivalent to the text content of the first and second portions of the document.

2. A method according to claim 1 in which the portions are converted into computer-intelligible digital form before they are logically connected to another portion.

3. A method according to claim 2 in which the computer-intelligible digital form is modified after a portion of text has been logically connected to another portion of text.

4. A method according to any preceding claim in which the steps are performed until a degree of certainty of the accuracy with which the document has been converted into an intelligible digital form meets predetermined criteria.

5. A method according to any preceding claim in which the method comprises joining electronic computer-intelligible text versions of images of pages to form a further, single combined, electronic computer- intelligible text version of the images of the text content of the pages that have been connected.

6. A method according to any preceding claim which builds a list of candidate portions which may be suitable for connecting to one or more other portions and

which then analyses portions in that list in order to determine which portion should be connected to which other portion.

7. A method according to any preceding claim in which parameters of at least one portion are used in the determination of to which other portions that portion should be connected.

8. A method according to claim 7 in which the parameters comprise at least one of any of the following: the position of portions within a document; the colour of the portion (whether text or image); semantics of a portion; font size; font style; point size; other typographic properties of the portion; content of the page (e.g. whether image or text); orientation of content; and relationships of the portions of the documents, perhaps other pages.

9. A method according to any preceding claim in which the portions of a document are obtained by scanning or otherwise capturing an electronic image of a physical document.

10. A method according to any preceding claim in which the content of any one portion is analysed to see if it has been split between that portion and another portion and the result of this analysis is used to determine the certainty with which those portions should be connected.

11. A method according to any preceding claim in which the location of a portion is used to determine the certainty with which that portion should be connected to another portion.

12. A processing circuitry arranged to convert a first portion of text on a first page of a document and a second portion of text on a second page of a document into a computer-intelligible digital form, the processing circuitry comprising an analyser arranged to analyse portions of the document to determine whether text content of the first portion should be connected to text content of the second portion and a converter arranged to convert an image of the content of the first portion together with the connected second portion into a computer- intelligible digital form.

13. A processing circuitry according to claim 12 which further comprises a page joiner arranged to join a computer-intelligible record of an electronic image corresponding to a first page of a document to a computer-intelligible record of an electronic image corresponding to a second page of a document in order to form a joined computer- intelligible record corresponding to a joined electronic image of the first and second pages.

14. A processing circuitry according to claim 13 in which the page analyser and/or the converter are arranged to process the joined computer-intelligible record so as to produce a modified, processed, computer-intelligible record corresponding to the images of the first and second pages.

15. A system arranged to convert a plurality of portions of a document into an intelligible digital form comprising the processing circuitry according to any of claims 12 to 14 and which further comprises a scanning means arranged to generate one or more electronic images of at least one physical document.

16. A computer program product encoded with instructions arranged to convert a plurality of portions of text from different pages of a document into a computer- intelligible digital form, the instructions being arranged to cause the following steps to be performed: i. analysis of the portions of the document to ascertain whether the content of a first portion should be connected to any other portion; and ii. conversion of the content of a portion together with any portion to which that portion has been connected into a computer intelligible digital form.

17. A machine readable medium containing instructions which when loaded onto a machine cause that machine to perform the method of any of claims 1 to 11.

18. A method of improving the accuracy of a character recognition process performed on a document having a plurality of regions of text disposed on different pages of the document, the method comprising performing a character recognition process on images relating to the plurality of regions to produce machine- intelligible electronic text versions of the plurality of regions of text, and performing a document understanding process on the machine intelligible electronic text such that machine- intelligible text relating to one region of text and the machine intelligible electronic text produced by the character recognition process of an image relating to another region of text influence the production of a modified machine- intelligible electronic text that is equivalent to the document text relating to the one region and the other region of the document.

19. A method of manufacturing a computer-searchable digital electric archive equivalent to a paper archive comprising scanning paper pages into computer memory to create digital images of the paper pages and operating on the digital pages in accordance with the method of any of claims 1 to 11 so as to produce a digital archive of computer- intelligible text corresponding to text on the paper pages.

Description:

METHOD AND APPARATUS FOR CONVERTING TEXT OF A DOCUMENT

INTO COMPUTER-INTELLIGBLE TEXT

Field of the invention This invention relates to converting text of a document into a computer intelligible digital form of that text, and also to apparatus arranged to perform such conversions.

Background to the invention

The automatic, machine-performed, conversion of physical documents, typically paper documents, into a computer - intelligible digital form that is suitable for electronic archival purposes and digital libraries is becoming more of a possibility. However, a number of technical problems exist that make the machine-performed conversion of such physical documents problematic. It is desirable to increase the accuracy with which text of a physical document can be automatically machine- converted into machine- intelligible text in order that the speed at which the process can be performed can be increased, thereby increasing the throughput and increasing the rate at which physical documents can be converted.

In general, the conversion process is a multi-step process which has as a first step the scanning of the physical document, using an image capture device such as a scanner, camera, copier, or the like, to generate an electronic image representing the document. Although the documents will generally be paper, they may be any physical medium such as paper, card, plastic and the like. The electronic image representing the or each physical document is then converted, in a second conversion process, into another electronic version where text of the document is meaningful to machines and to human beings and which may be thought of as a computer intelligible digital form. In such a second conversion process, a set of analysis and recognition processes are performed on the image. Moreover, it is desirable if the second recognition process is able to accurately reproduce the contents of the or each physical document since this will reduce the amount of human checking that is required, and the amount of human correction of errors that is required. It will be appreciated that if large volumes of physical documents are to be converted into digital form that it may not be possible

for a human to check each digital form of the each physical document due to time constraints.

Techniques such as OCR (Optical Character recognition) and ICR (Intelligent Character Recognition) are well known as second conversion processes that allow electronic images of a physical document to be converted in digital form. However, the accuracy of such systems may only approach a 90% to 95% level which still leaves a significant amount of manual corrections to text to be made, following manual checking of the computer intelligible converted text. There is therefore a technical drive to increase the accuracy of these processes.

Summary of the invention

According to a first aspect of the invention there is provided a method of converting a plurality of portions of a document into an intelligible digital form, the document comprising a plurality of pages, wherein a portion may occur on any one of the pages making up the document, and in which a copy of the document is held in at least one electronic image, the method comprising performing the following steps: i. analysing the portions of the document to ascertain whether content of a first portion should be connected to content of any other portion; and ii. converting the content of a portion together with any portion to which that portion has been connected into an intelligible digital form.

By "computer intelligible form" is meant a form of representation of text where the image of text is represented as characteristics from a character set (e.g. ASC II) which represents alphanumeric characters in coded bits/bytes. This often, or practically always, makes a computer-intelligible form of text computer - searchable to look for specified words or phrases. "Computer intelligible form" includes representing text as a language (e.g. HTML or XML). "Computer - intelligible form" is not storing the text as an image (e.g. bitmap or jpeg or TIF).

Embodiments of the invention are useful in obtaining, in computer- intelligible digital form, information that is split across pages in a document.

The conversion of the text regions to digital machine- intelligible text versions of the text may take place before or after the determination is made that the text relating to the two regions should be linked.

According to another aspect the invention comprises a method of improving the accuracy of a character recognition process performed on a document having a plurality of regions of text disposed on different pages of the document, the method comprising performing a character recognition process on images relating to the plurality of regions to produce machine- intelligible electronic text versions of the plurality of regions of text, and performing a document understanding process on the machine- intelligible electronic text such that machine- intelligible text relating to one region of text and the machine intelligible electronic text produced by the character recognition process of an image relating to another region of text influence the production of a modified machine- intelligible electronic text that is equivalent to the document text relating to the one region and the other region of the document.

Optionally the computer- intelligible (language) version of the text is stored in computer memory and/or telecommunicated to a remote location. The character recognition process effectively operates on a logical larger image relating to a plurality of smaller spatially separated regions of text, as opposed to effectively operating solely separately on each image relating to each separate region of text, with no interaction between. This enables logical rules that operate in the character recognition process to operate with words in the text in a better contextual framework, especially at the end, start, or border of a region (e.g. a block) of text, which can improve the efficiency and accuracy of the character recognition process.

Of course, "effectively operates" means that in many embodiments a system does not really OCR a larger body of text, but really OCR's two smaller bodies of text and combines the two computer- intelligible texts so produced to form a logical whole computer-intelligible text, with or without modification by document-understanding software. OCR software may be thought of as recognising letters or words (or a probability of letters or words) from images and also document understanding software choosing between possible options in the light of a lexicon of allowable words and a set of grammar and/or syntax rules.

Suitable computer-intelligible language versions of the text may be held in any of the following formats, Word, WordPerfect, text files; rich text format. The information provided by the language version may be stored in any suitable text such as HTML, XML, etc.

The machine- intelligible language, version will usually be searchable by an appropriate search command. The language version of a piece of text will typically be significantly fewer bits of information than the image version of the same text stored as an image (e.g. a bitmap, or gif or tif, or pdf). Storing a code representing alphanumeric characters uses less memory than storing an image, and there is less data in a data record of the text to transmit in telecommunication situations. The character recognition process may comprise OCR or ICR, or the like.

Optionally the method may comprise performing a character recognition process on the, or at least some of the, images relating to discrete separate regions of text disposed on different pages of the document separately, without treating the images relating to the regimes as a logical whole, putatively linking a first processed machine- intelligible language version of a first region with one or more other processed machine- intelligible language versions of other regions and evaluating a degree of fit to determine a best way of joining the text regions, and then performing a re-evaluation document understanding process on the combined machine- intelligible text versions from the two pages, joined as previously determined, so as to produce a revised machine- intelligible language version equivalent to the selected two images joined in the selected best way.

The selection of which two images relating to which regions of text should be joined is controlled by a set of image sequence and/or text sequence or alignment rules.

It will be appreciated that the selection and sequence and/or alignment of the images relating to discrete separate text areas to be logically combined for character recognition processing as a whole may take place with, or without needing to character recognise text of regions separately first: rules can determine the way in which images relating to separate text regions will be combined prior to character

recognition, or after character recognition. After character recognition the rules for linking parts of text may use the machine- intelligible versions of portions of text in order to determine how to link them.

According to another aspect of the invention there is provided a document processing system comprising processing circuitry arranged to process a multi-page document having text portions on different pages of the document, at least one electronic image of that document being held on computer memory, the processing circuitry comprising a converter arranged to convert the content of text portions into a computer intelligible digital form, and a page analyser arranged to analyse images of a plurality of text portions to determine which text portions should be connected together, including determining which text portion from one page should be connected to which text portion on another page to form a logical combined page of those text portions, the converter being arranged to output the intelligible digital form to the page analyser and the page analyser being further arranged to make an input to the converter in order that the accuracy with which the text portions are converted into the intelligible digital form is increased.

According to another aspect of the invention there is provided a computer program product encoded with instructions arranged to convert a plurality of portions of text from different pages of a document into a computer-intelligible digital form, the instructions being arranged to cause the following steps to be performed: i. analysis of the portions of the document to ascertain whether the content of a first portion should be connected to any other portion; and ii. conversion of the content of a portion together with any portion to which that portion has been connected into a computer intelligible digital form.

According to another aspect of the invention there is provided a processing circuitry arranged to convert a first portion of text on a first page of a document and a second portion of text on a second page of a document into a computer-intelligible digital form, the processing circuitry comprising an analyser arranged to analyse portions of the document to determine whether text content of the first portion should be connected to text content of the second portion and a converter arranged to convert an

image of the content of the first portion together with the connected second portion into a computer-intelligible digital form.

According to a further aspect of the invention there is provided a machine readable medium containing instructions which when loaded onto a machine cause that machine to perform any of the method aspects of the invention.

According to a further aspect of the invention there is provided a machine readable medium containing instructions which when loaded onto a machine cause that machine to function as the system of the system aspect of the invention or as the processing circuitry according to the processing circuitry aspect of the invention.

The machine readable medium of any of the above aspects of the invention may be any one or more of the following: a floppy disk; a CDROM/RAM; a DVD ROM/RAM (including +R/RW,-R/RW); any form of magneto optical disk; a hard drive; a memory; a transmitted signal (including an internet download, file transfer, or the like); a wire; or any other form of medium.

Brief description of the drawings

There now follows by way of example only a detailed description of embodiments of the current invention of which:

Figure 1 schematically shows a computer programmed to provide an embodiment of the present invention;

Figure 2 shows a flow chart outlining one embodiment of the present invention;

Figure 3 shows one arrangement of portions of a physical document in an electronic image for processing by an embodiment of the present invention;

Figure 4 shows a further arrangement of portions of a physical document in a plurality of electronic images for processing by an embodiment of the present invention;

Figure 5 shows an example of a physical document, which may be processed by embodiments of the present invention, containing articles spanning two pages and having multiple reading flows there through; and

Figures 6 to 8 show further examples of a physical document, which may be processed by embodiments of the present invention, containing portions spanning two physical pages.

Detailed description of the drawings

Some embodiments of the invention are used to convert physical documents having human discernible information thereon, such as text, or the like, into an intelligible digital form in which the human discernible information becomes processable by a processing circuitry. The term physical document is intended to cover any document that may be handled by a user and includes mediums such as paper, card, plastics, glass and the like, although the medium may generally be paper.

Other embodiments of the invention may be used to convert electronic documents into a machine intelligible form. Electronic documents may be held in a format such that they are represented as a bit map, vector representation, or the like, in which the content is not machine- intelligible although they will contain human discernible information. Examples of such formats include any of the following JPEG, TIF, non- editable PDF, and the like.

Figure 1 shows a computer 100 arranged to accept data and to process that data. The computer 100 comprises a display means 102, in this case an LCD (Liquid Crystal Display) monitor, a keyboard 104, a mouse 106 and processing circuitry 108. It will be appreciated that other display means such as LEP (Light Emitting Polymer), CRT (Cathode Ray Tube) displays, projectors, televisions and the like may be equally possible.

The processing circuitry 108 comprises a processing means 110, a hard drive 112, memory 114 (RAM and/or ROM, for example), an I/O subsystem 116 and a display driver 117 which all communicate with one another, as is known in the art, via a system bus 118. The processing means 110 (often referred to as a processor) typically comprises at least one INTEL PENTIUM series processor, (although it is of course possible for other processors to be used) and performs calculations on data. Other processors may include processors such as the AMD ATHLON , POWERPC™, DIGITAL™ ALPHA™, and the like.

The hard drive 112 is used as mass storage for programs and other data and may be used as a memory. Use of the memory 114 is described in greater detail below.

The keyboard 104 and the mouse 106 provide input means to the processing means 110. Other devices such as CD ROMS, DVD ROMS, scanners, etc. could be coupled to the system bus 118 and allow for storage of data, communication with other computers over a network, etc.

The I/O (Input/Output) subsystem 116 is arranged to receive inputs from the keyboard 104, mouse 106, printer 119 and from the processing means 110 and may allow communication from other external and/or internal devices. The display driver 117 allows the processing means 110 to display information on the display means 102.

The processing circuitry 108 further comprises a transmitting/receiving means 120, which is arranged to allow the processing circuitry 108 to communicate with a network. The transmitting/receiving means 120 also communicates with the processing circuitry 108 via the bus 118.

The processing circuitry 108 could have the architecture known as a PC, originally based on the IBM™ specification, but could equally have other architectures. The processing circuitry 108 may be an APPLE™, or may be a

RISC system, and may run a variety of operating systems (perhaps HP-UX, LINUX, UNIX, MICROSOFT™ NT, AIX™, or the like). The processing circuitry 108 may also be provided by devices such as Personal Digital Assistants (PDA' s), mainframes, telephones, televisions, watches or the like.

The computer 100 also is linked to a printer 119 which may be thought of as a printing means which connects to the I/O subsystem 116. The printer 119 provides a printing means and is arranged to print documents 180 there from.

Figure 1 shows a scanner 122, which may be referred to as a scanning means, which is well known in the art, and which, in this embodiment, has been daisy chained through the printer to connect to the I/O subsystem 116. In the

Figure, the scanner 122 is shown as being a flat bed scanner in which a physical document placed on the glass 124 is illuminated and the reflected light measured such that an electronic image representing the physical document is generated. Although the scanner is shown as a flat bed scanner it is likely that, as the volumes of physical document increase, a scanner having bulk medium handling facilities will be used. (e.g. not necessarily flat bed).

It will be appreciated that although reference is made to a memory 114 it is possible that the memory could be provided by a variety of devices. For example, the memory may be provided by a cache memory, a RAM memory, a local mass storage device such as the hard disk 112 (i.e. with the hard drive providing a virtual memory), any of these connected to the processing circuitry 108 over a network connection such as via the transmitting/receiving means 120. However, the processing means 110 can access the memory via the system bus 118, accessing program code to instruct it what steps to perform and also to access the data. The processing means 110 then processes the data as outlined by the program code.

The memory 114 is used to hold instructions that are being executed, such as program code, etc., and contains a program storage portion 150 allocated to program storage. The program storage portion 150 is used to hold program code that can be used to cause the processing means 110 to perform

predetermined actions and in embodiments of the present invention in particular provides a document analysis and understanding system 151, a page analyser 154 (which may be referred to as an analyser), a next page analyser 160, a converter 156, and a page joiner 158.

The memory 114 also comprises a data storage portion 152 allocated to holding data and in embodiments of the present invention in particular provides a document store 155 which is used to hold electronic images and also a computer-intelligible digital form of text portions that have been converted into computer-intelligible form from images of text.

Figure 2 shows one possible method for providing an embodiment of the invention. In a first stage 200 of the method physical documents e.g. sheets of paper, any 10, 50, 100, 200 sheets of paper, are loaded into a medium handler of a scanner 122 and in stage 202 each of the physical documents is scanned, in a first conversion process, to generate an electronic image of that document which is stored in the document store 155 of the data storage 152 portion of the memory 114. For example one image per sheet of paper may be stored. The electronic image may be any suitable format such as any of the following non-exhaustive list: GIF (Graphics Interchange Format), a TIFF (Tagged Image File Format), JPEG (Joint Photographic Expert Group), PNG (Portable Network Graphics), or the like.

If the embodiment of the invention is being used to process electronic documents, rather than physical ones, then this first conversion process that generates a non-machine intelligible copy of the physical document may be omitted.

In this embodiment the electronic image held in the document store 155 is then processed with some pre-processing steps to enhance its quality and remove defects such as scanning artefacts and the like. An advantage of such pre-processing steps is that it can enhance the robustness of the subsequent conversion from the electronic image to the intelligible digital form. Once the pre-processing has been performed, each digital image is made available

to a document analysis and understanding system 151 (which is provided by code held in the program storage portion 150) to convert the electronic image or images held in the document store 155 into an intelligible digital form. One particular suitable document analysis and understanding system is described in the applicant' s US patent application number 10/964094 (HP Ref. 200402381) filed on 13 October 2004, Entitled, "System And Methods For Articles Extraction And Text Reading Order Identification".

In other embodiments the document analysis and understanding system may be provided by a processing circuitry other than the processing circuitry which caused the electronic image to be created. The electronic image may be transferred between processing circuitries by a network connection, via transport of a machine readable medium, or by any suitable means. Of course, images may be loaded to the documents store 155 remotely, e.g. from a remote scanner, or from another memory which already has images of the document pages, for example pre-scanned images.

In the OCR (Optical Character Recognition) or ICR (Intelligent Character Recognition) of images of text of the pages of the document is performed in the document analysis and understanding system 151. The context in which letters and words are placed is used to help increase the accuracy of the conversion from image format into the computer intelligible document form. For example, if initial analysis of a letter gave an equal probability of it being a 'c' or an 'e' but the word in which occurred existed if it were an 'e' but not if it were a 'c' it would generally be determined that the letter were an 'e'. Similar determinations can be used for words. In this embodiment the conversion from the electronic image to the intelligible digital form is performed by the converter 156. OCR/ICR software systems are known which have sets of grammatical rules, dictionaries of allowable words, and appropriate contextual algorithms to convert an image into computer- intelligible text.

In certain documents, such as any of the following non-exhaustive list: magazines, newspapers, books, and the like, articles may be spread across

multiple pages, or across multiple columns within a page. It is possible for portions of an article to be separated from one another by images, or the like, being inserted into columns, pages of other matter, or by other interruptions. In order to increase the accuracy of the conversion from the electronic image to the intelligible digital form it is useful to convert an entire article at once, or a larger part of an article, rather than treating the article in too-small sections so the context of the whole article can be used to aid the conversion. Grammar syntax/logical language processing operates better for longer, less broken up, portions of text. It helps to know what is before and after a word to put it in context. Performing OCR on a few words is prone to error. A particular example is the end of a page of text. The last few words, or even just the last single word, could be the start of a new sentence. Without the context of the start of the next page the OCR/ICR software has little to work with in terms of syntax/grammar/logical context rules and there can be a higher probability of error in converting a short disjoined bit of text than if the same words were converted in context with the text before and after them. The same issue applies to the start of a page that can have just the last word, or few words, of a sentence from the previous page, with similar problems. The different pages that should be read together do not have to be consecutive: sometimes an advert, or insert article, breaks up a larger article. Sometimes an article is carried on a few pages later, rather than on the next page.

Similar issues exist in horizontal cross-page articles which span a centre fold/spine of a magazine or book if this is scanned in as two separate pages (e.g. two A4 pages from an A3 spread). A sentence that begins at on the left hand page, (and is scanned as one image) and which ends on the right hand page (and scanned in as a second image) may be difficult to OCR/ICR due to lack of contextual information.

In some embodiments of the invention further pre-processing is performed in order to connect converted text corresponding to logically connected portions of the physical document such that the decision process deciding how to convert articles that span multiple pages takes into account information from text on other pages, as if the conversion process were performed on linked pages as a whole, or at least with portions of the pages linked. The phrase

"connected" may not mean that the images are actually joined together and may simply mean that links are noted between portions of converted text, or so that the conversion to an image to computer-intelligible text is performed as a logical image which includes at least part of a text image from one page and at least part of a text image from another page. Determination of the portions occurs in step 204 of Figure 2.

It will be appreciated that in the art of OCR and ICR it is known to determine the presence of zones within a document. A zone will typically correspond to a block of text, an image or the like within a document. The portions discussed herein may or may not correspond to such zones.

Each portion provides a portion of the original document and as such provides information that is held in the original document. This information may be referred to as content and includes text (including words and/or numbers), and may include images, or any other graphical element.

Determination of whether information spans portions 206 and machine- conversion into an intelligible digital form 208 may be an iterative process. The certainty at which the conversion has been made may increase as iterations occur. For example, it may be difficult to determine whether a portion is connected to a further portion until the content of each portion has been converted into a computer-intelligible digital form. Once it has been determined that the two portions should be linked then the assessment of what the computer-intelligible form should be can be re-performed using the two separately converted electronic computer-intelligible versions of the image, using syntax and/or grammar rules on the text that extends over the join between the two pages. This can cause a re-evaluation of the electronic text, and some words may be changed in the computer-intelligible form to create a revised, combined computer-intelligible document. The images do not need re-processing as such: just the decisions regarding what characters are allocated to the images re-evaluating in the light of wider contextual information. Another possibility is to perform OCR again on logically

converted portions of the images of the images in order to increase the accuracy of the conversion.

In some embodiments, information need not be converted into an intelligible form before it can be determined that it spans more than one column, more than one page, etc. This would be the case if the portion comprised an image in which case it would not be possible to convert the portion into an intelligible form. Further, this would be the case if the portion that was to be converted were a title, heading, or the like, in which cases it may be apparent that it has been split between portions (such as across two pages) without having to convert those portions into an intelligible form.

This may also be the case when page numbers, or other portions that occur in known positions within a document, are considered. For example it will generally be known that a page number will occur on roughly the same place on each page of a document and therefore, it may be safe to assume that a block of information that occurs in roughly the same place on each page is a page number without converting that information into an intelligible digital form.

The use of position of a portion may be thought of as using the semantics of the portion as opposed to the content of the portion in order to determine whether it should be connected to another portion.

A portion of a document may be provided by a section of the electronic image generated from the physical document as represented by Figure 3 wherein the portions are represented by the sections a to g and the image of the physical document is represented by the rectangle 300. Alternatively, or additionally, a portion may be represented by separate electronic images of physical documents as shown in Figure 4 in which each of the rectangles 400 to 412 represents a separate electronic image of a physical document.

To aid this contextual understanding it may be advantageous for pages to be detected as consecutive. This may entail reshuffling pages or connecting

portions which are provided on different pages. Semantic analysis algorithms can applied to a logical page composed by two or more different physical pages. For example, an article starting on a page may sometimes, particularly in magazines, comprise a large photograph with the title centre amongst the two pages. It is therefore possible that a page which may, on its own, be thought of as an advertisement due to its high image and low text content becomes thought of as being part of an article when it is combined with a subsequent page. The semantics of the portions contained in the pages aid the determination of how the portions should be connected to one another.

Further, text strings (an example of human discernible information) that do not have any meaning per se can appear as complete when combined with other remaining text strings from another page. For example, a sentence may be split across two pages, two columns, be presented as randomly oriented phrases on a page, etc. Should the sentence by split across two pages it may have a portion touching the rightmost border of a first page and a second portion that touches the leftmost border of a second page. It can therefore be advantageous to locate sentences, or other strings of text such as titles and the like, that have been split across two or more pages or otherwise split (e.g. in different columns). The conversion of the electronic image into an intelligible form is statistically more accurate when the sentence is considered as a whole when compared to portions being considered independently.

In order to join portions in this manner various analysers may be used. A page analyser 154 may be used which is arranged to detect article flow across pages of the physical document. In perhaps its simplest form the page analyser 154 creates a new logical page by joining, by using the page joiner 158, the combination of every two consecutive physical pages from an issue. Therefore, a new electronic image may be created that contains all portions contained in the first page plus all portions of the second page shifted a distance equivalent to the page width in their coordinates. In other embodiments the page analyser 154 may combine more than two pages and may perhaps join 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 50, 100 or more pages. The skilled person will appreciate that the number of pages that may be joined

may be limited by the processing power available to process the joined image. Joining of the images may not physically combine two or more images and could simply generate a look up table, database or other memory based reference as to what would be contained within the joined electronic image.

Algorithms may be run which sequence and cluster portions on the joined pages within the electronic image created for the joined pages in order to determine how articles, or indeed other portions, flow across multiple pages. The pager analyser may be thought of as a pre-process component that prepares the double page to be analyzed by the conversion to the intelligible form.

In order to facilitate the joining of pages the page analyser 154 may be arranged to detect and utilise page numbers on each page. It is generally the case that if a page number is provided on pages of a document it will appear in the same position on each page of that document. Therefore, the page analyser 154 may be arranged to utilise order information, such as number, letters, etc., that appear in substantially the same place on pages a page number information.

In some embodiments, the page analyser 154 may be arranged to discard pages that it determines are entirely advertisements. In other embodiments the page analyser 154 may be arranged to process advertisements. In some embodiments, the page analyser 154 may be arranged to move the order of pages so that that pages of advertisements do not split articles that span pages. In other embodiments the order of the pages may not be altered and the page analyser 154 could simply generate a look up table, database or other memory based reference as to what the structure of the document would be if the pages were re-ordered.

A further analyser, the next page analyser 160, may be arranged to take the output of the page analyser 154 and find articles which span double, or any other number, of pages. The electronic image generated in the page analyser 154 by joining, or otherwise linking, two pages is processed to cluster

information to allow searching for candidate portions to be connected as a result of reading flows across pages. It is possible that there may be more than one single flow as can be see in Figure 5 which shows an electronic image 500 which has been generated by the page analyser 154 comprising a first page 501 and a second page 502. Both of the pages are divided into columns but two articles are present with each article being spread across both pages 501 and 502. The lines 504 and 506 represent the article flow across the pages 501, 502. Thus, these pages contain two flows. It is of course possible that further flows could be provided - for example any of the following non-exhaustive list may be possible: 3, 4, 5, 6, 7, 8, 9, 10 or more.

Each column of the pages 501, 502 in the electronic image 500 may be thought of as a series of portions. For example the article represented by (e.g. sentences, words and numbers) the second flow 506 comprises five portions (508 to 516); four of these portions 508, 510, 512, 516 are sections of human discernible information held in a column and the fifth portion 514 is an image.

The next page analyser 160 may be arranged to determine the content of portions within the or each electronic image which may have been output by the page analyser 154. The next page analyser 160 may be arranged to try and connect portions within the same article flow and may also be arranged to discard any portions which comprise an advertisement.

Further, the next page analyser 160 may be arranged to identify the last and/or first (or delete) portions from each flow of article within the electronic image 500. Once a list of candidate consecutive portions is built, the next page analyser 160 may be arranged to try and purge portions from both of the pages 501, 502 based on reading order analysis of the portions. This may involve any of the following non-exhaustive list of strategies: Discarding portions placed at the bottom left of any other candidate on the first page 501.

Discarding portions placed at the bottom right of any other candidate on the second page 502.

Discarding portions with a section portion above.

Other strategies will be readily apparent and these are provided by way of example only.

In this embodiment the next page analyser 160 assesses the possibility of connecting candidate portions from the first page 501 with candidate portions from the second page 502 based on any of the following non-exhaustive list: page positioning (such as nearest portions being connected first, etc.); connecting insert portions only with insert portions; and connecting portions with a neighbouring column.

Again, in this embodiment, the next page analyser 160 is arranged to determine headings within the/or each electronic image. For example, headings may have a larger font, different font, or the like and may therefore be discernible from the body of the article. As such the next page analyser 160 may be arranged to isolate headings to aid conversion of the electronic image. As discussed previously the headings may or may not be converted to intelligible information in order that they may be used to infer information about connection of the portions. In some embodiments the semantics of the heading and/or portions of the heading may be utilised.

Again, in this embodiment, the next page analyser 160 is additionally arranged to consider the section of the document in which the flow occurs. Sections may be identified by a heading located within the electronic image, or by use of a table of contents, or the like. For example, it is known that a magazine may be split into sections such as the International section and the people section. It is perhaps unlikely that an article flow will span multiple sections and as such the next page analyser 160 may be arranged not to try and combine portions within different sections. A section of a document may be provided by any of the following non-exhaustive list: a chapter, a sub- document (for example it is known for documents as papers to be provided as a collection of separate documents; i.e. sub documents), a section, or the like.

Embodiments may be arranged to determine text and/or images which are split across pages, which may or may not be consecutive. For example, it is known in some publications that the text of a first fragment of an article is separated from the text of a second fragment of the article by a plurality of intermediate pages. In order to make this determination any of the following non-exhaustive list of parameters of a page may be used: the position of portions, the colour of the portion (whether text or image), semantics, font size, font style, point size, content of the page (e.g. whether image or text), orientation of content, or the like. Embodiments of the invention may be arranged to convert information into intelligible form in a portion to aid this determination. For example, if the text within a portion contains the phrase 'continued on page XX' it may inferred that a second fragment of an article will appear later on in the document and that portions representing the two portions should be connected to one another.

In a further example, a portion containing a single line of text or an image close to the right hand border on a first page and close to a left border on a second page may be determined that the portions should be combined. Such an example is highlighted in the example of Figure 6 which shows an electronic image of a document 600 having a first page 601 and a second page 602. Each of the first page 601 and the second pages 602 contain a portion 604, 606 of an image. The portion 604 is on a rightmost border of the first page 601 and the portion 606 is on a leftmost border of the second page 602 and thus embodiments may infer that the two portions 604, 606 should in fact be joined to one another.

Further, embodiments of the invention may be arranged to determine that both of the pages 601 and 602 belong to the 'people' section. This may be inferred from the presence of the portion, which provides a heading 608, at the top of the first page. Some embodiments of the invention may be arranged to use information from elsewhere in the document to increase the certainty of the location of portions of the document. For example, a table of contents page may be used to identify the pages on which sections start and/or finish.

Embodiments of the invention may determine that the electronic image of Figure 6 comprise seventeen portions of text - a to q - each of which is highlighted by a rectangular border in the Figure. Further portions may provide the images shown on the pages 601, 602. Embodiments of the invention may try to connect each of these portions, perhaps using rules (such as the order on which the portions appear on the page and the like) to facilitate this connection process. The two pages 601 and 602 of Figure 6 share substantially the same layout, both in terms of font and layout. This similar layout may be used to reinforce the certainty that text on the two pages should be connected.

As a further example another document 700 is shown in Figure 7 which again comprises a first page 701 and a second page 702 which have been joined by the page analyser 154 into a single electronic image. The two pages 701 , 702 that appear in Figure 7 do not have the same semantics and as such the certainty that the pages should be linked may be decreased.

Embodiments of the invention may be arranged to use the parameters of a page to determine the nature of a page. For example, the parameters of a page may cause embodiments of the invention to determine that certain pages are advertisements.

In the example shown in Figure 7 certain embodiments of the invention may determine that the first page 701 is an advertisement, based on parameter of the page. It can be seen for example that the page primarily comprises images with little text occurring on the page. Based on the parameters of the page embodiments of the invention may determine that the second page 702 is a table of contents page.

In some embodiments each of the images on the pages 701, 702 may be considered individual portions. However, in other embodiments, the images bearing parts of the page may be considered as a portion and the human discernible portions may be considered as other portions (such as the portions 704 to 720 highlighted by the borders on pages 701 and 702).

However, due to the similar arrangements occurring on the two pages 701, 702 embodiments of the invention may infer that the two pages should in fact be linked despite the determination that one is an advertisement page and the other is a table of content page. Therefore, the determination that the first page is an advertisement may be over-ruled and both pages may be determined to be a table of content page. As discussed above it is advantageous to correctly determine the table of content page since information obtained there from can be used to ascertain information about other parts of the document.

As a further example another document 800 is shown in Figure 8 which again comprises a first page 801 and a second page 802 which have been joined by the page analyser 154 into a single electronic image. In the example of Figure 8 the first page 801 has what is determined to be a portion providing a title 804 which may be used to increase the certainty that the two pages should be considered together. This may be inferred from parameters of the page and/or from analysis of a table of contents page. The fact that the parameters of the page are similar for both of the pages 801, 802 may be used to increase the certainty that the two pages should be considered together and that text provided in the portions containing human discernible information on the first page 801 should flow into text provided in the portions containing human discernible information on the second page 802.

Both pages contain a portion providing a section of an advertisement 806, each of which provides half of an image. The fact that a similar portion, in this case an image occurs on a rightmost side of the first page 801 and on a leftmost side of the second page 802 can be used to increase the certainty that the image should be combined. Thus, the semantics of the portions are used to increase the certainty that the portions should be connected.

As represented at 210 in Figure 2 it is regularly determined whether enough iterations of the conversion and connecting of portions has been performed.

Once it has been ascertained that this is the case then further processing is stopped 212.

Various mechanisms should be used to determine whether enough iterations have been performed in order that the intelligible digital form of the document meets predetermined criteria. For example, if the conversion from the document to the intelligible digital form is being performed iteratively a suitable predetermined criteria may be that no changes were made to the conversion in the last iteration. Alternatively, the predetermined criteria may be that the fewer than a predetermined number of changes were made in an iteration, which may be judged on a percentage basis, absolute basis or the like.

In yet a further embodiment, the predetermined criteria may be that the number of changes made between iterations had reached a steady state. This may be advantageous since some conversions may oscillate between two possible conversions and so may change each time the conversion process is run.

Thus, a document processing system may be provided by embodiments of the invention which use several recognition processes, which are generally automatic, and which may be used to extract the document structure and content. These processes include for example any of the following non- exhaustive list: identification of portions; text recognition (whether OCR and/or ICR); structure analysis; logical and semantic analysis; extraction of articles and advertisements; and the like. It will be appreciated that in many embodiments the intention is to create a digital text record of images of text that represents the text images as alphanumeric characters from a character set, that is searchable (e.g. by searching for matches of key words or combinations of key words). The machine-intelligible, language based, digital version of the text may or may not have advertisements/extraneous material omitted from it.

In many embodiments the aim is to detect items that are consecutive, such as text flows, across separated pages of a document to increase the overall

accuracy of document analysis systems and tools. Double pages may be cut in half at some stage during the data input process either manually before the scanning or automatically after the scanning by using cropping techniques. This cutting operation usually happens before the Optical Text Recognition (OCR) stages take place. The reason is that actual scanning processes as well as text recognition operations are faster when smaller images are used. In addition, the cutting operation helps increase the accuracy of scanning and OCR, for example, if a double page scan contains two single pages that come from different parts of the document (e.g. out of order pages as it is the case with e-stapled documents) the OCR engine could wrongly link sentences from a left page with sentences from its right page assigning them both to the same zone.

Embodiments of the invention operating in the above scenario combine the two pages that were cut during scanning or during pre-OCR analysis into a single logical page. We combine two consecutive pages (in the semantic space) excluding those in the middle that are not of value to the document structure and semantic analysis (e.g. advertisement and inserts). For that, pages need to be detected as consecutive; as they could be out of order or contain intermediary appendixes and inserts. Many existing one-page semantic analysis algorithms can then be applied to a logical page composed by two different pages, (or parts of two different pages) e.g., starting article pages can sometimes consist of a large photograph with the title center amongst the two pages. Many pages that will be detected as advertisement, e.g., containing large photographs without significant text, make sense when placed in combination with their consecutive page. Algorithms related to Table-of-Context detection can corroborate some of these findings. Text strings that do not have any meaning per se can appear as complete when combined with other remaining text strings from another page. For article flow extraction this stage in the flow detection is of importance therefore having more than one text-flow that crosses from page to next page.

One aim of at least some embodiments of the invention is to automatically and accurately obtain information that is split across pages in a document. An example is the extraction of articles that extend across multiple pages.

A large volume of content is to be scanned e.g. a back catalogue, say 20 or 50 years worth, of a magazine. Scanning usually happens in batch- scanning facilities where high-speed scanners are located and used. The output from the scanning process may be produced as a single raster image for every page

(using a standard imaging format such as TIFF or JPEG). The page images may be processed with some preprocessing image analysis techniques to enhance their quality and remove some of the scanning artefacts. The scanned page images are then analysed using a fully automated document analysis and understanding system in accordance with the invention. Because there are fewer errors in the automatic conversion of text images to machine intelligible language text, the step of checking the converted text is easier and faster, and fewer corrections (manual corrections) need to be made.

An important part in a magazine automatic recognition process is the cross page article extraction and cross page advertisement detection. We have developed a method to manage that detection. That method, as implemented in one specific embodiment of our invention, has three main components:

From the previous test it will be appreciated that the detection of text on different pages that is to be linked has three main components (which are usually used together, but could be provided separately, or only two of them may be used).

A.) Double page analyzer

This component' s role is to support an article flow analysis across pages that are side by side in the document. The component, in one embodiment, creates a new logical page from the combination of every two consecutive (page numbering are also detected) pages from an issue - and discarding advertisement pages. For each page file there will be a new page file that

contains all zones contained in the first page, plus all zones of the second page (e.g. shifted a distance equivalent to the page width in their coordinates). The idea is to run algorithms for sequencing and clustering of zones in these special pages to see how articles flow across multiple pages. The double pager component is a preprocess component that prepares the double page to be analyzed. A rule of discarding advertisement pages that are in the middle of text may be used. However, for other applications it may not be necessary. For example, it may be desirable to link advertisement that span multiple pages.

B.) Next page analyzer

This component has a role of finding cross-page article flows from the output of the double page analyzer, i.e., the new logical double page containing two pages combined in one. Next page analyzer uses cluster information from previous algorithms to search candidate zones to be connected in reading flows across pages. There could be more than one single flow as we can see in Figure 5.

The algorithm assumes that within the same page the different flows have been identified. Hence each page has a set of flow clusters. The algorithm to find the candidate zones cross-page discards any advertisement cluster, then identifies the last zones from the first page flow clusters, and the first zones from the second page flow clusters. Once a complete list of candidate consecutive zones is built, the algorithm tries to purge zones from both pages based on reading order analysis:

§ Discarding zones placed at the bottom left of any other candidate on the first page § Discarding zones placed at the bottom right of any other candidate on the second page § Discarding zones with a section zone above.

Within the result list of candidate zones the component analyses the possibility of connecting candidate zones from the first page with candidate

zones in the second page based on page positioning (nearest zones first connected), connecting insert zones only with insert zones and connecting zones with the same column width, as we can see in the Figure 5.

Additionally, the component considers section information to discard or discount connections across pages; for example, if zones in the first page belong to the International section and second page begins with People section, none of the algorithms will be applied.

C.) Page Joiner

This component finds continuous pages that may be considered as a single page for article extraction purposes, for instance if one of them is advertisement and the other one is magazine text, the system will choose to allocate the scanned material as magazine text or advertisement based on the confidence level of detections.

The set of algorithms to perform this will detect images or zone texts (e.g., titles) that are split over the two pages. The algorithms use zone position, colour, size and semantics. For example, single line text zones or images close to the right border on the first page and close to the left border on the second page will make the component decide that the two pages are part of a single message. Some layouts used in the magazine industry are clearly double pages, i.e., a single logical page widespread over two physical ones.

The proposed system is used during the conversion of paper based legacy documents such as magazines and books into electronic searchable digital repositories. Due to the high volume of the material, it is essential that the process be as automated as possible. Perhaps of the order of 100,000, 500,000 pages, or more, may be scanned in.

One option is to perform character recognition processing on a logical whole image that does not contain image relating to the whole text of two discrete separate text blocks, but which does contain image relating to the text of parts of the text blocks

which are recombined to be logically adjacent each other. For example, if a first block of text had 20 lines of words, and a second block of text, say on the next page, had 30 lines, it may be desirable to perform a character recognition operative on a logical image comprising the last line, or few lines, of the first page and the first line, or few lines, of the last page. It is at the interface between blocks of text where errors in character recognition may be removed. The influence of what is at the top of the second page and how to interpret an image relating to the text at the top of the first page may be negligible. Thus instead of performing a character recognition process on a 20 + 30 = 50 line block of text we could character recognise 20 lines, character recognise 30 lines (each page separately), and character recognise as a single image the last 5 lines of one page and the first 5 lines of the next page (10 lines), with the character recognition decisions of the cross-over logical image taking precedence over decisions made by the character recognition process, for those lines, in connection with each page separately. This may avoid having to character recognise too large images as a single logical image (which might be slow computationally).

Using similar concepts, instead of re-performing OCR on a corporate image we can take the machine- intelligible text of the first page, and the machine- intelligible text of the second page and perform a document understanding processing operation on the electronic texts, or parts of them, so as to enable logically adjacent parts of the two separate electronic texts to influence each other and produce a revised, combined, electronic text that is equivalent to the combined images of the two pages.

The technique of ICR/OCR (or other processing technique such as a syntax/grammar based document understanding process) performed on a logical image, or electronic text equivalent to a logical image, made up of at least areas of image relating to logically adjacent or consecutive areas of text allows for more accurate automatic machine conversion of printed documents to electronic text that is searchable. Having fewer errors to correct when the electronic document/text is reviewed in a manual checking stage (if it is even reviewed at all in a manual checking stage) speeds up the manufacturing process of making the electronic document/text.

Many embodiments have a processor which uses images stored (possibly temporally) in computer memory to produce a machine- intelligible language version of equivalent

text and/or pictures and which stores the language version in a computer memory, the language version taking up less memory space that the equivalent image version.