Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DOCUMENT IMAGE PROCESSING
Document Type and Number:
WIPO Patent Application WO/2015/044187
Kind Code:
A1
Abstract:
A system for processing an image of a document comprises means for providing an image frame of the document and means for pre-processing the image frame to identify regions of interest to produce image samples. The effect of reflective highlights within image samples is then reduced by applying a non-linear process that reduces the difference in luminance values between pixels representing highlights and pixels that do not represent highlights. The system then has means for classifying the image samples using machine learning techniques. The non-linear process may be a threshold algorithm or other non-linear process such as a process that has greater effect on high pixel values than low pixel values.

Inventors:
DE VEGA RUIZ JAVIER (GB)
HEGARTY DANIEL (GB)
Application Number:
PCT/EP2014/070344
Publication Date:
April 02, 2015
Filing Date:
September 24, 2014
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
WONGA TECHNOLOGY LTD (IE)
International Classes:
G06V30/10
Foreign References:
EP2383970A12011-11-02
SG191434A12013-07-31
Other References:
FENG BO-YUAN ET AL: "Automatic recognition of serial numbers in bank notes", PATTERN RECOGNITION, vol. 47, no. 8, 28 February 2014 (2014-02-28), pages 2621 - 2634, XP028648917, ISSN: 0031-3203, DOI: 10.1016/J.PATCOG.2014.02.011
JUNG K ET AL: "Text information extraction in images and video: a survey", PATTERN RECOGNITION, ELSEVIER, GB, vol. 37, no. 5, 1 May 2004 (2004-05-01), pages 977 - 997, XP004496837, ISSN: 0031-3203, DOI: 10.1016/J.PATCOG.2003.10.012
SHAN DU ET AL: "Automatic License Plate Recognition (ALPR): A State-of-the-Art Review", IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, IEEE SERVICE CENTER, PISCATAWAY, NJ, US, vol. 23, no. 2, 1 February 2013 (2013-02-01), pages 311 - 325, XP011492786, ISSN: 1051-8215, DOI: 10.1109/TCSVT.2012.2203741
"Handbook of character recognition and document image analysis, Chapter 1: Image Processing Methods for Document Image Analysis; Chapter 2: Pattern classification techniques based on function approximation; Chapter 11: Machine printed Chinese character recognition; Chapter 23: Algorithms for Automati", 1 January 1997, HANDBOOK OF CHARACTER RECOGNITION AND DOCUMENT IMAGE ANALYSIS, WORLD SCIENTIFIC, SINGAPORE [U.A.], PAGE(S) 17 - 24,49, ISBN: 978-981-02-2270-3, XP002717249
"The Image Processing Handbook", 4 July 2011, CRC PRESS, ISBN: 978-1-43-984045-0, article JOHN RUSS: "Image Enhancement in the Spatial Domain", pages: 269 - 336, XP055155775, DOI: 10.1201/b10720-6
Attorney, Agent or Firm:
REDDIE & GROSE LLP (London WC1X 8PL, GB)
Download PDF:
Claims:
CLAIMS

A system for processing an image of a document, comprising:

means for providing an image frame of the document; means for pre-processing the image frame to identify regions of interest to produce image samples;

means for processing the image samples to reduce the effect of reflective highlights by applying a non-linear process that reduces the difference in luminance values between pixels representing highlights and pixels that do not represent highlights; and

means for classifying the image samples using machine learning techniques.

A system according to claim 1 , wherein the means for processing the image samples is arranged to operate a threshold algorithm to reduce values of pixels above a threshold to a lower value. 3. A system according to claim 2, wherein the lower value is a variable defined for the system.

4. A system according to claim 2, wherein the lower value is derived in relation to the average luminance of pixels in the image frame.

5. A system according to any preceding claim, wherein the means for processing the image samples is arranged to operate a non-linear process on all pixels in the image samples that has a greater effect on high pixel values than low pixel values.

6. A system according to claim 5, wherein the non-linear process comprises a root, mean square process.

7. A system according to any preceding claim, wherein the means for providing an image of a document comprises a client device having a camera, and the means for pre-processing, processing and classifying comprises a server remote from the client device.

8. A system according to any preceding claim, wherein the means for processing is arranged to filter with each of multiple filters of a filter bank to produce filtered image samples, wherein the filter bank comprises filters that emphasise structures in each of multiple orientations.

9. A system according to claim 8, wherein the means for processing is arranged to convert the filtered image samples for an image frame to a vector and classify the vector using machine learning. 10. A system according to claim 8, wherein the filters comprise filter pairs, each filter in each pair being offset relative to the other filter in the pair.

1 1 . A system according to claim 8, wherein each filter is elongate. 12. A system according to claim 8, wherein each filter is wider at a central portion and narrower at each end.

13. A system according to any of claims 8 to 12, wherein each filter comprises alternating high and low portions.

14. A system according to claim 13, wherein each filter comprises two low and a central high portion. A method of processing an image of a document, comprising:

receiving an image frame;

pre-processing the image frame to identify regions of interest to produce image samples;

processing the image samples to reduce the effect of reflective highlights by applying a non-linear process that reduces the difference in luminance values between pixels representing highlights and pixels that do not represent highlights; and

classifying the image samples using machine learning techniques.

A method according to claim 15, wherein the processing the image samples comprises operating a threshold algorithm to reduce values of pixels above a threshold to a lower value.

A method according to claim 16, wherein the lower value is a variable defined for the system.

A method according to claim 16, wherein the lower value is derived in relation to the average luminance of pixels in the image frame.

A method according to any of claims 15 to 18, wherein the means for processing the image samples is arranged to operate a non-linear process on all pixels in the image samples that has a greater effect on high pixel values than low pixel values.

A method according to claim 19, wherein the non-linear process comprises a root, mean square process. 21. A method according to any preceding claim, wherein the step of receiving an image frame comprises receiving the image frame from a remote client device having a camera viewing the document. A method according to any of claims 15 to 21 , further comprising processing each image sample by filtering with each of multiple filters of a filter bank to produce filtered image samples, wherein the filter bank comprises filters that emphasise structures in each of multiple orientations.

23. A method according to claim 22, further comprising converting the filtered image samples for an image frame to a vector and classifying the vector using machine learning.

24. A method according to claim 23, wherein the filters comprise filter pairs, each filter in each pair being offset relative to the other filter in the pair.

25. A method according to claim 23, wherein each filter is elongate.

26. A method according to claim 24, wherein each filter is wider at a central portion and narrower at each end.

27. A method according to any of claims 15 to 22, wherein each filter comprises alternating high and low portions.

A method according to claim 27, wherein each filter comprises two low and a central high portion.

A method according to any of claims 15 to 22, wherein receiving an image of a document comprises receiving the image from a client device having a camera.

30. A computer program comprising code which when executed on a processor undertakes the method of any of claims 15 to 29.

Description:
IMAGE PROCESSING

DOCUMENT IMAGE PROCESSING This invention relates to methods and systems for efficient processing of captured images, such as images of cards received from mobile devices. In particular, embodiments of the invention relate to optical character recognition of data on a card. Optical character recognition techniques are known for the automated reading of characters. For example, scanners for the automated reading of text on A4 pages and for scanning text on business cards and the like are known. However, such devices and techniques typically operate in controlled lighting conditions and capture plain, non-reflective surfaces.

SUMMARY OF THE INVENTION

We have appreciated the need for improved methods, systems and devices for processing image of cards and other regular shaped items bearing alphanumeric data.

The invention is defined in the claims to which reference is now directed.

In broad terms, the invention resides in the processing of images of documents such as cards to enhance subsequent processing such as to produce OCR data.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention will now be described in more detail by way of example with reference to the drawings, in which:

Figure 1 : is a functional diagram of the key components of a system embodying the invention;

Figure 2: shows the framing of a card image;

Figure 3: is a flow diagram showing the main process for processing a card image;

Figure 4 shows an image of a portion of a card before and after processing; Figure 5 shows the processing steps implemented in an embodiment;

Figure 6 shows the processing stages of an embodiment in greater detail; Figure 7 shows a first approach to glint reduction;

Figure 8 shows a second approach to glint reduction;

Figure 9 shows a preferred set of filters for use with feature extraction; and Figure 10 shows use of the filters for feature extraction.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The invention may be embodied in methods of operating servers or client devices, methods of using a system involving a client device, client devices, modules within client devices and computer instructions for controlling operation of client devices. Client devices include personal computers, smart phones, tablet devices, wearable computing devices and other devices useable to access remote services.

The embodiment of the invention will be described as providing image capture at a client device and transmission to a server for processing. This is because the main implementation is for efficient image processing for character recognition using image capture at a client device. However, it is noted that the capture and processing may equally be done at a client device or at a server. In the embodiment, a client device is arranged to capture an image of a document such as a card. Such a card may be a credit card, debit card, store card, driving licence, ID card or any of a number of credit card sized items on which text and other details are printed. For ease of description, such cards will be simply referred to hereafter as "cards", and include printed, embossed and cards with or without a background image. Other objects with which the embodying device and methods may be used include cheques, printed forms and other such documents. In general, the embodying device and processes are arranged for capture of images of rectangular documents, in particular cards which are one type of document.

The embodiment is particularly beneficial for processing images containing embossed characters on cards in situations in which the characters produce glint. A glint is a reflection of light that produces a reflective highlight seen as a particularly bright spot in the image. Such a glint can be fleeting due to the varying angle of a card in relation to a light source, or can persist if the card is held at a constant angle in relation to the light source. The level of reflection may cause clipping of pixels in the image, or may be within the tolerance of the imaging device, but nonetheless be very bright in comparison to surrounding pixels. We will refer generally to the brights spots caused by such reflection as reflective highlights.

The embodiment operates processes to process such images of cards to assist in optical character recognition (OCR) which may be undertaken either at the client device or at a remote system or server. For speed of processing, the preferred embodiment involves transmitting the image to the server and undertaking the image processing steps at the server. The source of images for processing at the server may be from client devices such as smart phones or the like and so details of an overall system involving smart phones will be described first.

A system embodying the invention is shown in Figure 1 . The system shown in Figure 1 comprises a mobile client device such as a smart phone, tablet device or the like 2 and a server system 20 to which the client device connects via any known wired or wireless network such as the Internet. The client device 2 comprises a processor, memory, power source, screen and input devices such as keyboard or touch screen. Such hardware items are known and will not be described further. The device is arranged to have a number of separate functional modules, each of which may be operable under the command of executable code. As such, the functional components may be considered as either hardware modules or as software components.

A video capture module 10 is arranged to produce a video stream of images comprising a sequence of frames. The video capture module 10 will therefore include imaging optics, sensors, executable code and memory for producing a video stream. The video capture module provides the sequence of frames to a card detection module 12 and a focus detection module 14. The card detection module 12 provides the functionality for determining the edges of a card and then determining if the card is properly positioned. This module provides an edge detection algorithm and a Hough transform based card detection algorithm. The latter consumes the edge images, which are generated by the former, and determines whether the card is properly positioned in each frame of the video stream. The focus detection module 14 is arranged to determine which frames of a sequence of frames are in focus. This module features an adaptive threshold algorithm, which has been developed to determine the focus status of the card in each frame of the video stream. The adaptive threshold algorithm consumes focus values calculated by one of a number of focus metrics discussed later. A card framing module 16 is arranged to produce a final properly framed image of a card. This module combines a card detection process and card framing algorithm and produces a properly framed card image from a high-resolution still image, as shown in Figure 2. An image upload module 18 is arranged to upload the card image to the server 20.

The image processing of an embodiment undertaken at the server is shown in Figure 3. The diagram is in the manner of a functional diagram and, as such, each component can be considered as a functional module or as a step in a process. The overall process receives an image transmitted from the smartphone over the Internet (via 3G / 4G / WiFi etc.), and processes this into structured text data along with associated confidence information and makes the results available to the client device for further operations. An image storage module 30 provides a file repository which stores the images (still or video) as they are received from the front-end client device. As the data may be sensitive, the storage is preferably a secure store. In addition, as low latency will be crucial for a good user experience, the storage also needs to provide very fast access to multiple backend processes.

A pre-processing step 32 takes the image and undertakes various operations prior to an OCR step. This pre-processing step uses specific information about the document being processed. One of the problems addressed by this step is that lighting conditions could mislead the OCR engine, and this problem can be reduced using this pre-processing stage. In addition, quite often ID cards contain intricate background patterns that are quite hard to remove, and would fool known OCR engines.

The pre-processing filter chain may include denoising, custom frequency transform and brightness equalisation. An example output of these steps is shown in Figure 4 showing how the image is improved by the pre-processing.

An OCR engine 34 may be one of a number of third party software packages, such as ABBYY FlexiCapture Engine. However, we have noted that the performance of known OCR engines is not sufficient for OCR processing of images from cards in uncontrolled lighting arrangements. The operation of the OCR engine will be described further later. The document returned from the OCR stage is still very raw; it just contains a labelled set of characters with associated confidences. It is the extraction 36 stage's purpose to convert these characters into usable data. These information sets differ to a great extent among different ID cards and also for bank cards. The processes at the server side of the embodiment will now be described in greater detail. The architecture of the system is built around three main stages, each dealing with different concerns within the context of reading the image. Their proper collaboration is orchestrated by a fourth component that handles some of the more cross-cutting activities such as simple transformations of data between stages, error handling, logging, etc. This architecture can be seen in Figure 5 with each component now described in turn.

The purpose of the segmentation stage 40 is to locate the areas of the image where the characters are located. This stage is fed the particular area in the image that is known to contain the elements that need to be read. Processing just that area of interest in the image improves the speed of the system while reducing the likelihood of false positives. The output of the stage is a set of segments. The main piece of information per segment is the rectangle that contains an embossed character. The recognition stage operates processes to recognise characters. The assembly stage 44 takes the recognised characters and assembles them into a document for use by other systems. The card reader 46 is the component that coordinates these functions. The segmentation stage 40 and recognition stage 42 will now be described in greater detail with reference to Figure 6.

The segmentation stage 40 has an image pre-processor that is a composite filter. It takes a raw image and simply returns an image that is better suited for recognition. Typical preprocessing of this sort includes spatial resampling, cropping of the regions of interest, intensity normalisation, denoising, histogram equalisation, grayscaling and such. It is relatively independent of the other steps within recognition, and it is included in some form or the other in most image processing pipelines.

Preprocessing of the image is broken down into global (image) pre-processing 41 and local pre-processing 45 (segment preprocessing taking place after cropping 43) steps because they deal with different contexts of the image information. Image preprocessing 41 changes properties of the image as a whole, such as dynamic range. Also, because it affects every pixel in the image, it is provided as a fast process.

The global pre-processing of the embodiment performs: Grayscaling, Fast glint reduction, Normalisation and Precision conversion (from 8-bit integer to 32-bit floating point values).

Glint reduction may be described as a reduction of the impact of bright spots due to reflections on subsequent processing, such as character recognition or similar techniques. Embossed characters typically comprise rather shiny raised surfaces which, when illuminated by a non-defused light source, reflect at specific locations on a character providing a strong reflective highlight or glint. When subject to the techniques such as filter processing described later, such highlights or glints can cause problems for the recognition engine because they can appear at random positions on a character depending upon the directionality of the illumination.

Glint reduction may be performed by a thresholding process in which pixel values within an image are analysed and those pixels having values above a threshold are reduced to the value of that threshold. The threshold itself may be a fixed level or a variable set within the system. The variable threshold may be determined by an analysis of the image so that it is set, for example, to a level near an average of the pixel value for an image or as a multiple of that average. Such glint reduction helps to remove some of the randomness from images.

Figure 7 shows this thresholding process and shows an image of a sample character along with the corresponding luminance histogram for that image. As can be seen, the characters have some glint or highlights in the "before" images, which are removed in the "after" images. This removal is by setting the values of pixels that are above a given threshold to the value of that threshold as shown by the corresponding histogram. This first filter analyses the pixel intensities within an image patch and calculates its Xth percentile. This X value should be high enough so as to separate pixels which are full glint from pixels which are not. We have seen values around 90 to work well. Once this value is found all pixels above that value are set to it, i.e.: dst(x, y) = Min(value, src(x, y)). It is very easy to see its effect in the before-after samples along with the corresponding histograms, pixels with glint adopt the maximum value of pixels without glint. The reduction of the effect of highlights may also be achieved by other techniques that reduce the difference between the luminance values of non- specular reflective parts of a character in comparison to the reflective highlights. This can be achieved by a variety of non-linear processes (often referred to as curves in image processing). Such non-linear changes include reducing the luminance values of pixels above a threshold, increasing the values of pixels in a mid-range of luminance values, boosting shadows by increasing the values of pixels below a threshold. The preferred option is a non-linear process affecting all pixels by a curve.

Figure 8 shows the effect of non-linear processing on all pixels. This second filter applies a non-linear transformation to all the pixels in an image patch. The transformation starts by moving intensity values to the [0, 1 ] range by linearly squashing them. After that, values go through a square root. Finally, values are restored to their initial range. By applying a square root in the [0, 1 ] range, values on the lower end (dark tones) are pushed towards the high end (bright tones). It mitigates the relative relevance of pixels with glint by pushing other pixels to their end while maintaining the structures present and the ratio between minimum and maximum. It attempts to compensate the exponential nature of the specular highlight. Pixel intensity values are more evenly distributed, without going as far as image equalisation.

The task of the cropper 43 is very simple; for every segment candidate coming out from the segmentation algorithm, the containing rectangle is cropped from the preprocessed image and stored in a segment copy. Because the size of the rectangles coming out of segmentation may not be ideal for recognition, the size of the rectangle being really copied is adapted. More precisely, the source rectangle is grown to fit the preferences of the classifier, while the centre of the source rectangle is preserved. Within the recognition stage, steps taking place after the cropper can be divided into independent tasks per candidate, and thus easily processed in parallel.

The segment preprocessor 45 aims at improving the performance of the feature extractor and it is specifically tailored around it. As this affects individual segments within the image it can afford to be of higher complexity than its global counterpart. At the moment preprocessing just includes one filter which performs a heavier glint reduction algorithm. The recognition stage 42 has a feature extractor 47 that processes the raw data (in this case grayscale pixel intensity values) to generate a feature vector. This vector has usually a different dimension than the original data while holding the maximum amount of useful information given by the data. The feature extractor used in this step implements a wavelet filterbank. The individual filters in the bank correspond to Gabor wavelets where the orientation and phase offset parameters change. An example set of filters is shown on Figure 9. This filterbank has been specially tuned for embossed character OCR.

Each of the filters is convolved with a preprocessed segment to form a filter output. Outputs for the same orientation are merged using Euclidean distance. These merged outputs are in turn downsampled to make the size of the resulting vector more manageable. Downsampled merged outputs are joined into a single vector which can be directly consumed by a classifier. The operation of the feature extractor 47 will be described in greater detail with respect to figures 9 and 10. The purpose of the feature extractor is to generate feature vectors that are distinctive of alpha-numeric characters, and so can be recognised using machine learning, in spite of the uncontrolled lighting conditions and embossed nature of the characters. Consider first the filter bank shown in Figure 9. As can be seen, each filter has a generally elongate shape with an orientation to that shape. The top left filter, for example, has a generally vertical direction and a generally black/white/black configuration with a wider central portion and narrower top and bottom portions. The top and bottom portions generally curve from the central portion. Such a black/white/black arrangement will emphasise boundaries of generally vertical structures within an image thereby improving depiction of vertical edges of characters. Consider the third filter down on the left hand side of Figure 9 and it can be seen that this as a similar arrangement to the top left filter, but this time arranged with a generally horizontal orientation to the elongate arrangement. This filter will accordingly emphasise horizontal edges within an image. As can be seen reading from top left to bottom right, a variety of orientations of the filters are provided rotating from vertical through horizontal and back to near vertical giving a total of eight distinct orientations. Each orientation will emphasise structures that are generally oriented at the same angle as the filter being used.

The filter bank comprises different orientations, and also differing phase offsets of filter. For each filter orientation a pair of filters is provided (one shown in darker grey and one lighter grey for ease of representation). Considering the first and second filters on the top row it can be seen that the first filter has a generally black/white/black arrangement and is generally symmetrical about a vertical axis, whereas the second filter is asymmetric about a vertical axis having a smaller black area followed by white then black. The phased offset provided by the symmetric and asymmetric pair of filters will emphasise transitions within the image that are low to high to low again. The asymmetric filter will emphasise transitions that are high to low, but will not emphasise transitions that are low to high to low again as much as the symmetric filter. Accordingly, these each emphasise slightly different features within an image. The asymmetric filter emphasises white to black transitions. The symmetric filter also emphasises white to black transitions, but emphasises even further transitions that are black to white to black that are of the same width as the filter itself. The pairs of filters are said to be offset in phase by 90° as they are constructed from sine waves and by using a 90 ° offset this produces the differences in pairs of filters as shown. A total of 16 filters are thereby provided comprising eight different orientations in each of two different phases.

The imaged segments are each processed by each of the 16 filters in turn, providing 16 output images each having structures in different orientations emphasised as described above. These images are combined together to produce effectors as will now be described in relation to Figure 10.

The example shown in Figure 10 is processing of an image segment containing an embossed character zero as shown on the left hand side of Figure 10. The image segment is processed by each of the filters in turn by convolving the filter and the image to produce a new image. For simplicity of representation, only four pairs of filters comprising the main orientations of vertical 45° horizontal and 45° down are shown in Figure 10, but in practice all of the filters shown in Figure 9 will be used. The intermediate images produced by the filters are merged together such that each pair of images produced by each pair of filters is merged together by a merging process. The preferred merging process is too square, sum and square route pixel by pixel. The produces a total of 8 merged images (only 4 shown in Figure 10). In order to simplify processes, these are preferably downscaled which could be by any downscaling process, but the preferred is linear interpolation between pixels to produce a total of 8 merged and downscaled images.

The filtered merged and downscaled images are then converted to a vector for subsequent classification by machine learning. The preferred approach to producing the vector is to read the value of each pixel in turn from each of the resultant images and concatenate the numbers together. Accordingly, the dimension M of the vector will be the number of pixels in each filtered merge and down sample image not applied by the number of images (8 such images in the example embodiment).

For the classification of the segments into labels the classifier 49 uses a machine learning based approach. The resulting vectors described above are provided to the machine learning tool. The initial model that proved to surpass the 90% accuracy mark relied on Support Vector Machines. The biggest deficiencies of this classifier, are that in its more common fashion it is hard to get confidence values for the predictions and that it cannot be trained in an iterative way. In order to overcome those deficiencies, the model was updated to an Multi-Layer Perceptron with a single hidden layer. This allowed the easy extraction of good confidence values, and training can now be done iteratively. However, this also meant losing around 1 % accuracy. The model can be trained with a larger training set if memory (RAM) allows, by dividing the training set into batches that are processed sequentially.

The mini-batch training capability was successfully used to train the classifier with randomly translated segments following the error distribution coming out of the segmentation algorithm so as to build in translation invariance. In this sense the classifier is specially tuned to adapt itself to the properties of the segmenter. In order to build better generalization properties into the network, a similar scheme can be used to introduce other affine distortions, elastic distortions, noise, etc. Due to the segmentation algorithm returning many false positives (around 50%), an extra capability is built into the classifier to tell apart whitespaces. Apart from digits, the classifier also recognises dashes and slashes, which enables the system to more easily identify items such as dates and sort codes. Various further modifications to the processing described above are possible to further improve character recognition. In particular, various global pre-processing steps mentioned earlier may be operated prior to the recognition stage.