Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD AND DEVICE FOR DETECTING TEXT
Document Type and Number:
WIPO Patent Application WO/2006/129261
Kind Code:
A1
Abstract:
The invention is related to a method for detecting at least a text region model embedded in at least one image, and also to a corresponding computer program and device. The method comprises the following steps: - a) considering (6, 42) a group of lines of the image including a predefined number of subsequent lines ; - b) analyzing (12, 40, 44) the pixels of the considered group of lines with at least one classifier to identify among the pixels of this group of lines, the pixels belonging to text regions and the pixels belonging to non-text regions ; - c) segmenting (78, 80) the identified text regions of the considered group of lines (8) ; and - d) repeating (94) steps a) to c) to other groups of lines of the image until consideration of all lines of the image to generate the text region model based on at least a segmented text region embedded in the whole image.

Inventors:
EKIN AHMET (FR)
JASINSCHI RADU (FR)
Application Number:
PCT/IB2006/051697
Publication Date:
December 07, 2006
Filing Date:
May 29, 2006
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
KONINKL PHILIPS ELECTRONICS NV (NL)
EKIN AHMET (FR)
JASINSCHI RADU (FR)
International Classes:
G06V30/10
Domestic Patent References:
WO2002095662A12002-11-28
Other References:
CHEN D ET AL: "Text detection and recognition in images and video frames", PATTERN RECOGNITION, ELSEVIER, KIDLINGTON, GB, vol. 37, no. 3, March 2004 (2004-03-01), pages 595 - 608, XP004479007, ISSN: 0031-3203
KIM K C ET AL: "Scene text extraction in natural scene images using hierarchical feature combining and verification", PATTERN RECOGNITION, 2004. ICPR 2004. PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON CAMBRIDGE, UK AUG. 23-26, 2004, PISCATAWAY, NJ, USA,IEEE, vol. 2, 23 August 2004 (2004-08-23), pages 679 - 682, XP010724483, ISBN: 0-7695-2128-2
LIENHART R ED - ROSENFELD A ET AL: "VIDEO OCR: A SURVEY AND PRACTITIONER'S GUIDE", VIDEO MINING, KLUWER INTERNATIONAL SERIES IN VIDEO VIDEO COUMPUTING, NORWELL, MA : KLUWER ACADEMIC PUBL, US, 2003, pages 155 - 184, XP009046500, ISBN: 1-4020-7549-9
LEE C W ET AL: "Automatic text detection and removal in video sequences", PATTERN RECOGNITION LETTERS, NORTH-HOLLAND PUBL. AMSTERDAM, NL, vol. 24, no. 15, November 2003 (2003-11-01), pages 2607 - 2623, XP004443627, ISSN: 0167-8655
JULINDA GLLAVATA RALPH EWERTH BERND FREISLEBEN: "A Text Detection, Localization and Segmentation System for OCR in Images", MULTIMEDIA SOFTWARE ENGINEERING, 2004. PROCEEDINGS. IEEE SIXTH INTERNATIONAL SYMPOSIUM ON MIAMI, FL, USA 13-15 DEC. 2004, PISCATAWAY, NJ, USA,IEEE, 13 December 2004 (2004-12-13), pages 310 - 317, XP010757265, ISBN: 0-7695-2217-3
SITA R ET AL: "A Single-chip Hdtv Video Decoder Design", IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2 June 1998 (1998-06-02), pages 50 - 51, XP010283010
RICK A ET AL: "DIGITAL COLOR DECODER FOR PIP-APPLICATIONS", IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, IEEE SERVICE CENTER, NEW YORK, NY, US, vol. 42, no. 3, August 1996 (1996-08-01), pages 716 - 719, XP000638559, ISSN: 0098-3063
Attorney, Agent or Firm:
Chaffraix, Jean (156 Boulevard Haussmann, Paris, FR)
Download PDF:
Claims:
CLAIMS
1. A method of detection of at least a text region model embedded in at least one image (2) made of an array of pixels distributed over lines and columns, wherein the method comprises the following steps: a) considering (6, 42) a group of lines (8) of the image (2) including a predefined number of subsequent lines (10), the predefined number of lines (10) being the minimum of one half of the height of the smallest expected font of the image and twelve lines, and the predefined number of lines (8) being lower than the number of lines (8) of the image (4) ; b) analyzing (12, 40, 44) the pixels of the considered group of lines (8) with at least one classifier to identify among the pixels of this group of lines (8), the pixels belonging to text regions (81, 86) and the pixels belonging to nontext regions (38, 74) ; c) segmenting (78, 80) the identified text regions (81, 86) of the considered group of lines (8) to refine the separation of the text regions (81, 86) from the background of the image ; and d) repeating (94) steps a) to c) to other groups of lines (8) of the image (2) until consideration of all lines of the image (2) to generate the text region model based on at least a segmented text region (81, 86) embedded in the whole image (2).
2. A method according to claim 1, wherein the analyzing step (12, 40, 44) comprises the following steps: classifying (12) the pixels of the considered group of lines (8) as belonging to potential text regions (81, 86) or to potential nontext regions (38, 74) by the use of a first classifier and according to a primary threshold (ThrsvM) ; classifying (44) the pixels of the considered group of lines (8) as belonging to hypothetical text regions (81, 86) or to hypothetical nontext regions (38, 74) by the use of a second classifier ; and combining (40) the result of both classifying steps by the first classifier and by the second classifier to identify the pixels belonging to conceivable text regions (81, 86) and the pixels belonging to conceivable non text regions (38, 74).
3. A method according to claim 2, wherein the first classifier is a Support Vector Machine (SVM) based classifier.
4. A method according to any of the claims 2 and 3, wherein the second classifier is a Connected Component (CC) based classifier.
5. A method according to any of the claims 1 to 4, wherein the analyzing step (12, 40, 44) comprises a step (12) of classifying the pixels by the use of a SVM based classifier, the classifying step (12) being preceded by a step of filtering (4) the pixels of at least a part of the image (2) to determine at least one strong edge, and wherein the group of lines (8) considered comprises at least one determined strong edge.
6. A method according to claim 5, wherein the filtering step (4) comprises the exclusion of the pixels having an horizontal derivative (Dx) or a vertical derivative (Dy) lower than a predefined number of times (C) the average of the sum of the absolute values of the horizontal derivative (Dx) and vertical derivative (Dy) of a defined array of pixels, the predefined number of times being comprised between 4 and 6. 7.
7. A method according to any of the claims 1 to 6, wherein the analyzing step (12, 40, 44) comprises a step of classifying (12) the pixels by the use of a SVM based classifier which comprises a step of scaling down (22) only in the horizontal direction the group of lines (8) by a predefined ratio.
8. A method according to any of the claims 1 to 7, wherein the analyzing step (12, 40, 44) comprises a step of classifying (12) the pixels by the use of a SVM based classifier, the classifying step (12) comprising the following steps: a) considering (16) in the group of lines (8) a defined pixel located in a window (W) having a predefined size ; b) computing (18, 20) a likelihood parameter £ of the defined pixel ; c) scaling down (22) only in the horizontal direction the group of lines (8) by a predefined ratio ; d) considering (24) the defined pixel located in the window (W) in the scaled group of lines (8) ; e) repeating steps b) to d) a predefined number of times and comparing the likelihood parameter (£ ) computed during each repetition to a predefined number (ThrsvM) to classify the whole pixels of the window (W) as belonging to text regions (81, 86) or to nontext regions (38, 74).
9. A method according to any of the claims 1 to 8, wherein the analyzing step (12, 40, 44) comprises a step of classifying (12) the pixels by the use of a SVM based classifier trained with a linear kernel during a training step (14).
10. A method according to claim 2, wherein the combining step (40) comprises a step of searching the conceivable regions classified as potential and hypothetical text regions (81, 86) by the first classifier and the second classifier.
11. A method according to any of the claims 1 to 9 in combination to claim 10, wherein the segmenting step (78, 80) comprises the following steps : quantizing (78) the color and/or intensity of the pixels of the group of lines (8); determining (80) the jumps of the color and/or the intensity in consecutive pixels along at least one line and at least one column of the conceivable text regions (81, 86).
12. A method according to any of the claims 1 to 11, wherein between the segmenting step (78, 80) and the repeating step (94), the method comprises an additional analyzing step (82, 84, 90) which comprises the following steps : searching (82) candidate regions classified as hypothetical text regions (81, 86) by the second classifier and as potential nontext regions (38, 74) by the first classifier ; classifying (84) the pixels of the candidate regions as belonging to presumed text regions (81, 86) or presumed non text regions by the use of the first classifier according to a threshold lower than the primary threshold (ThrsvM); and wherein the additional analyzing step (82, 84, 90) is repeated to other groups of lines (8) of the image (2) until consideration of the all lines of the image (2), the text region model being generated also based on the presumed text regions (81, 86).
13. A method according to claim 12, wherein the additional analyzing step (82, 84, 90) comprises further the following steps: testing (90) the brightness of presumed non text regions ; and classifying (90) as acceptable text regions, the nontext regions having a brightness greater than a predefined value, wherein the text region model being generated also based on the acceptable text regions.
14. A method according to any of the preceding claims, wherein the group of lines (8) comprising a number of lines less than the number of lines of the smallest expected font of the image (2).
15. A method for detecting at least a text region embedded in images (2) of a video frame, wherein the image (2) is a part of a sequence of images (2) comprising at least a previous image and a current image , comprising the following steps: a) receiving (96, 102) the current image (2) of the video frame ; b) detecting (98, 104) a text region model in the current image (2) using the method defined according to claims 1 to 14 ; c) generating (100, 106, 108) a bounding box associated to the current image (2), the bounding box comprising the coordinates of the pixels delimiting the text regions of the text region model of the current image (2) ; d) repeating steps a) to c) for the subsequent images (2) of the video frame by using at least one bounding box generated to detect the text region model in the at least one previous image (2) during the detecting step (98, 104).
16. A method according to claim 15, wherein the step of generating (100,106 108) a bounding box associated to the current image (2) comprises the following steps: comparing (106) the text region detected in the current image to the coordinates of the text bounding boxes associated to the previous image to accept or reject the text region detected in the current image according to the result of the comparing step ; and updating (108) the bounding box generated with the coordinates of the pixels delimiting the text region of the text region model of the current image.
17. A method according to the claim 16, wherein the comparing step (106) comprises a step of classifying pixels of the current image as belonging to text region model when these pixels are classified as belonging to non text region and when the pixels located at the same coordinate of the previous image (2), were classified as belonging to text region model.
18. A method according to the claim 16, wherein the comparing step (106) comprises a step of classifying pixels of the current image as belonging to text region model during the detecting step (98, 104) only when the pixels located at the same coordinate of the previous image (2), were classified as belonging to text region model during the detecting steps (98, 104).previous to the detecting step of the current image.
19. A computer program for a processing unit comprising a set of instructions which, when loading into said processing unit, causes the processing unit to carry out the step of the method as claimed in any of the claims 1 to 18.
20. A device for detecting at least a text region model embedded in at least one image (2) made of an array of pixels distributed over lines and columns, wherein the device comprises : a) considering means adapted to consider a group of lines (8) of the image (2) including a predefined number of subsequent lines (10), the predefined number of lines (10) being comprised between the number of lines of one half and ten times the number of lines of the smallest expected font of the image (2) ; and the predefined number of lines (8) being lower than the number of lines (8) of the image (4) ; b) analyzing means able to analyze the pixels of the considered group of lines (8) with at least one classifier to identify among the pixels of this group of lines (8), the pixels belonging to text regions (81, 86) and the pixels belonging to nontext regions (38, 74) ; c) segmenting means adapted to segment the identified text regions (81, of the considered group of lines (8) to refine the separation of the text regions (81, 86) from the background of the image ; and d) repeating means able to repeat steps a) to c) to other groups of lines (8) of the image (2) until consideration of all lines of the image (2) to generate the text region model based on at least a segmented text region (81, 86) embedded in the whole image (2).
Description:
"METHOD AND DEVICE FOR DETECTING TEXT"

TECHNICAL FIELD OF THE INVENTION

This invention relates to a method and a device for detecting a text embedded in at least one image.

BACKGROUND OF THE INVENTION

Many situations arise where it would be beneficial to be able to identify text within particular visual images. For example, to remove indirect advertisement included in TV programs and movie scenes, like logos or banners on clothes, on electrical home appliances or furniture.

Identifying text is not an easy task due to the nature of the underlying image. Text may be overlaid on top of a wide range of different backgrounds and textures of the image such that distinguishing the background from the text becomes very difficult. In the past two decades, there has been much work on overlay video text detection to be able to automatically process the growing amount of video for storage, data mining, and video indexing and retrieval applications. Although some algorithms achieved a certain level of maturity, these algorithms benefit from large processing power and unlimited random access memory in spatial and, in the case of video, temporal directions. However, many hardware architectures support access to only a group of data (image, video) lines rather than the whole frame at one instant. When text height is greater than the number of available data lines for processing, the hardware constraint demands a text or non-text decision about a pixel or a block of pixels before seeing the whole of the text.

In order to better evaluate the existing algorithms by hardware constraints, they can be classified into two categories based on their methodologies: 1) Connected Component (CC) based algorithms, and 2) machine-learning based algorithms. Both types of algorithms consist of a first step usually called candidate detection step and of a second step following the first step and called segmentation step. During the candidate detection step, text candidates are identified by extracting

and analyzing pre-defined features. Whereas CC-based algorithms can assign a pixel as a text candidate or not, machine-learning-based algorithms have lower resolution in that they usually define candidates in terms of blocks.

During the segmentation step, the nearby text candidates are connected by employing morpho logical operations, the border of text segments are computed to extract text lines and word boundaries, and finally these segments are accepted or rejected. Although the random access memory constraints for CC-based and machine learning based algorithms differ, none of the CC-based and machine learning based approaches could complete both steps with a number of lines that is smaller than the text height.

CC-based algorithms, such as the ones described in WO-02/095662 and in US-6,738,512, usually exploit texture by detecting edges in the candidate detection step and may use color to connect candidates. Some CC-based algorithms, such as "A new robust algorithm for video text extraction" in Pattern Recognition 36(2003), pp. 1397-1406 by E.K. Wong and M. Chen, may complete candidate detection step by analyzing only one line of data. However, such methods are so sensitive to the background complexity that they result in too many false alarms in textured areas. Furthermore, all of the CC-based algorithms need large number of data lines in the segmentation step even though they may satisfy hardware constraints in the candidate detection step.

Machine-learning based methods, such as the one described in the paper entitled "Automatic text detection and removal in video sequences" from Chang Woo Lee, Kee Chul Jung, Hang Joon Kim, published in Pattern Recognition Letters 24 (2003), p. 2607 to 2623, employ powerful, but computational, machine learning tools, such as Support Vector Machines (SVMs). Differently from CC-based algorithms, they analyze blocks of texture-based features that are extracted from gray- level or color image data for candidate detection. They need to match the height of the text with the block height to be able to detect the text. Because text can be of any size, in order to fit the text to the block height, they perform multi-resolution analysis by scaling the image in the vertical and horizontal directions. This type of approach requires the hardware to support as many lines as the text height and is not practical.

Therefore, it is necessary to develop a new and robust text detection method that is able to detect text even if the underlying hardware supports fewer data lines (smaller random access memory) than the height of the text.

SUMMARY OF THE INVENTION Accordingly, it is an object of the invention to provide a new method for detecting an embedded text as recited in claim 1.

This method makes the algorithm attractive for hardware implementation because the processor needs less random access memory when it carries out the steps of the method to detect text of any font size. The invention provides a robust text detection performance with very limited random access memory. Specifically, the invention can detect text with as few as 12 lines of data lines. Differently from CC- based and machine-learning based approaches, both candidate detection and segmentation stages conform to the hardware constraints. Text of any size is detectable due to the invention of special type of scaling that scales the image only in the horizontal direction. Furthermore, the algorithm benefits the accuracy of machine-learning based approaches by using an SVM-based classifier while circumventing the computation problem by employing only a linear SVM kernel. Other features of the method of the invention are further recited in the dependent claims. It is also an object of the invention to provide a corresponding program as recited in claim 18.

It is also an object of the invention to provide a corresponding device as recited in claim 19.

These and other aspects of the method for detecting text in an image will be apparent from the following description, drawings and from the claims.

BRIEF DESCRIPTION OF THE FIGURES

Fig. 1 is a flow chart illustrating the main steps of the first embodiment of the method of the invention. Fig. 2 is a representation of an image during the implementation of the method.

Fig. 3 is an enlarged view of some groups of lines of the image illustrated in Figure 2.

Fig. 4 is a flow chart illustrating the processing steps of SVM-based classifier. Fig. 5 is a representation of three views of the image illustrated in Figure

2, the center view and the right view being horizontally downscaled with a ratio of 2:1 and 4:1 (height-preserving image scaling, according to which the spatial resolution decreases only in the horizontal direction).

Fig. 6 is a schematic view of an example of one group of lines where the regions classified as text regions by the SVM-based classifier, are shaded in.

Fig. 7 is a flow chart illustrating the processing steps of CC-based classifier.

Fig. 8 is a view similar to the view of Figure 6 where the regions classified as text regions by the CC-based classifier, are shaded in. Fig. 9 is a view similar to the view of Figure 6 where the segmenting text regions, are shaded in.

Fig. 10 is a view similar to the view of Figure 6 where the regions classified as text regions by the CC-based classifier and as non-text regions by the SVM-based classifier, are shaded in. Fig. 11 is a view similar to the view of Figure 6 where the regions classified as text regions by the CC-based classifier and as text areas by the second application of the SVM-based classifier, are shaded in.

Fig. 12 is a view similar to the view of Figure 6 where the regions classified as text regions by the CC-based classifier and as non text areas by the second application of the SVM-based classifier, are shaded in.

Fig. 13 is a view similar to the view of Figure 6 where all the text regions detected by the first embodiment of the method of the invention, are shaded in.

Fig. 14 is a flow chart illustrating one part of a second embodiment of the method of the invention.

DETAILED DESCRIPTION

Referring to Figure 1, a first embodiment of the method for detecting text regions according to the invention, is illustrated. This first embodiment does not rely on temporal information so that the image to be analyzed can be either an image of a video frame or a picture. The image 2 to be analyzed, also called original image, is illustrated in

Figure 2. This image 2 is made of an array of pixels distributed along horizontal rows and vertical columns. For example, such an image comprises 400 lines and 530 columns.

The method of Figure 1 begins with a step 4 of filtering the original image 2 to detect its strong edges. During the filtering step 4, the edge strength of each pixel (x,y) from the original image 2 is computed according to the following equation:

E, {x,y) = ABS{D x {x,y))+ ABS(D y {x,y)) (1) where the function ABS refers to the absolute value and where the function Di(x,y) with i = x,y refers to the horizontal D x and vertical D y derivatives of each pixel (x,y). The horizontal and the vertical derivatives are computed according to the following equations:

D x (x,y) = 0.5 * (l(x,y)+ l(x,y + \)-l(x + \,y)-l(x + \,y + \)) (2)

D y {x,y) = 0.5 * {l{x,y)-l{x,y + \)+ l{x + \,y)-l{x + \,y + \)) (3) where / (x, y) refers to the intensity of the pixel located at position (x, y).

Then, an average edge strength E t and a strong edge threshold E nr are computed using the equations as follows: M N

E< = l h∑ ∑E s {x,y) (4) x y

E nr = CΕ t (5) where C is a constant, and M and N are the number of pixels in a line and the number of lines, respectively, for the image 2. Without limiting the invention, a sample value of C can be taken between 4 and 6. For each pixel (x, y) of the image 2 the horizontal and vertical derivatives are computed and compared to the strong edge threshold Erhr-

As a result of the filtering step 4, the generated image comprises only pixels having edge strength larger than the strong edge threshold E nr . According to the settings, the pixels having edge strength smaller than the threshold E nr may not be considered for the step of classifying the pixels with a Support Vector Machine- based classifier as explained in the following of the description.

At step 6, a group of lines 8 (see Figures 2 and 3) of the filtered image is chosen for the application of the SVM-based classifier. The group of lines 8 taken into consideration is the first group of lines having at least one strong edge. To this end, the first group is chosen by scanning the image from the bottom to the top or from the top to the bottom, searching for at least one strong edge. It is to be noted that a band of a predefined number of lines can be defined for both top and bottom image borders without any difficulty. The size of the group of lines 8 is determined based on the hardware support. This number can be as low as 12 lines. Alternatively, it can be as low as one half of the smallest expected font of the image 2. For example, Figures 2 and 3 show an example of a first group 8 of twelve (12) lines where a line referenced 10, is highlighted.

At step 12, the pixels of the considered group of lines 8 are analyzed with the SVM-based classifier, for classifying regions by SVM method. Referring to Figure 4, substeps of the analyzing step 12 will be illustrated. The SVM-based classifier is a statistical learning tool which is learned offline in a supervised training step 14 before being applied to the group of lines 8 to analyze. At the beginning of the training step 14, provided for building up a linear classifier, a number of text and non-text images with their text and non-text labels are introduced into the SVM to determine a feature matrix M SVM and a single parameter β . These parameters are learned by minimizing a cost function defined in "The nature of statistical learning theory," Springer, 1995 by Vladimir Vapnik.

Because it is difficult to find the representative hard-to-classify non-text examples, the popular bootstrapping approach that was introduced by K.K. Sung and T. Poggio in "Example-based learning for view-based human face detection" IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, No. 1, pp. 39-51, Jan. 1998, can be followed.

Bootstrap-based training is completed in several iterations and, in each iteration, the resulting classifier is tested on some images that do not contain text. False alarms over this data set represent difficult non-text examples that the current classifier cannot correctly classify. These non-text samples are added to the training set. Hence, the non-text training data set grows and the classifier is retrained with this enlarged data set. For example, the SVM-based classifier is trained with 1 000 text blocks and, at most, 3 000 non-text blocks for which edge orientation features are computed.

When a classifier is being trained, an important issue to decide upon is the size of the image blocks that are fed to the classifier because the height of the block determines the smallest detectable font size whereas the width of the block determines the smallest detectable text width. A block size of 12 x 12 pixels is chosen for example for the training of the classifier because in a typical frame with a height of 400 pixels, it is rare to find font size smaller than 12 pixels and 12 is also equal to the number of lines supported by the hardware in our system. Given a block size of 12x12 and two features per pixel, the size of matrix M SVM can be found as 12x24. The SVM based classifier employs a linear kernel which is very efficient in terms of computational requirements and accuracy.

Having computed the feature matrix M SVM and the parameter β in the training stage, text region detection is performed in the group of lines 8. To this end, a step 16 is provided for considering a window W in a strong edge zone. A K x L pixels window W is considered in the group of lines 8, for example a 12 x 12 pixels window. The center of the window W is located on a defined pixel (x,y).

At step 18, a feature vector fsvM is computed for this defined pixel (x,y). In the described embodiment, the feature vector fsvM comprises two features per pixel, which are the horizontal and the vertical derivatives. where Z ) Λ and Z) y are the horizontal and vertical derivatives obtained for example, from Equations 2 and 3. As a result, for 12x12 blocks, the size of fsvM is 12x24, the same as M SVM - Alternatively, the features of the feature vector fsvM may be computed by a variety of popular methods used in the image processing literature. For example, these features comprise weighted intensity derivatives and/or weighted

color derivatives. Changing the feature definition also involves re-training to recompute the SVM parameters M SV M and β .

At step 20, a first text likelihood parameter £ 0 characteristic of the pixels in the window W, is computed using the following equation: £ = correlation{f sm , M sm ) - β (7) where M SVM and β are the result of the training step 14, and where the cross-correlation of two matrices A, B is defined by :

JsT L correlation{A, B) = ∑∑ A{i, j ) * B{i, j )

1=1 7=1 where i and j are the pixels in the window W. Because text should be detected independent of its font size, a step 22 is provided for scaling down the group of lines. The group of lines 8 considered at step 6 is downscaled by a predefined value, e.g., two. In contrast to the existing algorithms, this invention proposes to perform downscaling only in the horizontal direction so that the required number of image lines (or the amount of random access memory for the image content) can be kept the same as it is needed for the original resolution. The downscaling can be implemented by first filtering the image lines to prevent aliasing and then sub-sampling.

At step 24, provided for reconsidering the window W, a window W of the same size as the window W considered at step 16 and centered at the same defined pixel (x,y), is reconsidered on the downscaled group of lines 8. Figure 5 shows, on its right side, as example, a view of the image 2 downscaled in the horizontal direction by two and another view of the image 2 downscaled in the horizontal direction by four. Steps 18 and 20 are repeated on this reconsidered window W on the group of lines 8 downscaled a first time at step 22, to obtain a second text likelihood parameter £ x (step 20), characteristic of the pixels in the window W.

Steps 22, 24, 18 and 20 are repeated once more for a window W of the same size as the window W considered at step 16 and centered at the same defined pixel (x,y), in the group of lines 8 downscaled a second time by a predefined value during step 22 to obtain at step 20, a third text likelihood parameter £ 2 , characteristic

of the pixels in the window W. Alternatively, steps 22, 24, 18 and 20 are repeated more than three times and the downscaling factor can be set to any value.

At step 26, the first, the second, and the third text likelihood parameters I 0 , I 1 , l 2 axQ compared to a specified threshold Thrsw- For example, the threshold ThrsvM is specified as a function of the spatial distribution of the edge strength of the image 2. The threshold ThrsvM may be both spatially and temporally variant. For example, it may be equal to 0.5 in the bottom part of the image and to 0 in the other part of the image.

If any one of the three text likelihood parameters £ 0 , I 1 , I 2 is greater than the specified threshold ThrsvM, all the pixels, within the window W considered at step 16, are classified at step 28 as belonging to a text region. As a result, even pixels that are not classified as strong edge pixels can be assigned as text pixels if they are close to strong edge text pixels. If none of the three text likelihood parameters I 0 , I 1 , I 2 is greater than the specified threshold Thrsw, all the pixels within the window W considered at step 16, are classified at step 30 as belonging to a non-text region.

A step 32 is then provided for moving the window W in another strong edge zone, in order to start again the process of Figure 4. The window W is moved to another area of pixels next to the already analyzed area of pixels. Thereafter, steps 18 to 26 are repeated for these new pixels included in the moved window W. Steps 18 to

32 are repeated during several iterations until all the pixels of the group of lines 8 are classified as text regions or non-text regions.

Figure 6 illustrates schematically an example of a result from the analyzing step 12 for the group of lines 8. In this figure, a shaded region 36 identifies a region classified as a text region and a non shaded region 38 identifies a region classified as a non-text region by the SVM-based classifier. The result of the analyzing step 12 is notably employed in a searching step 40 (see Figure 1), as explained in the following of the description.

At step 42, a group of lines of the original image (non filtered) is considered. This group of lines corresponds to the group of lines 8 (of the filtered image) considered at step 6, which means it comprises the same number of lines and

is situated at the same location as the group of lines 8 considered during step 6. To simplify the description, the group of lines considered at step 42 is also referenced 8.

A step 44 is then provided for classifying regions by CC method. Text and non-text regions are identified on this group of lines 8 using a connected component based classifier, hereunder called CC-based classifier. Referring to Figure 7, substeps of the classifying step 44 are illustrated.

At step 46, a pixel (x,y) located at the left extremity of the first line of the group of lines 8 is considered as the current pixel.

At step 48, a gradient Gi between the current pixel (x c , y c )and its neighboring pixel on the same line (x c + l,y c ) is computed by using the relation: G{x,y) = D^ (x, y) = 0.5 * (D x {x, y) + D y {x,y)) (8) where Z ) x and Z) y refer to the horizontal and vertical derivatives computed according to equations 2 and 3.

At step 52, G 1 is assigned as the gradient G(x c ,y c ) and is compared to the strong edge threshold E nr computed according to equations (4) and (5), and instructions described at step 4.

If the magnitude of the gradient G 1 is lower than the strong edge threshold E nr , the neighborhood pixel (x c + \,y c ), is assigned as the current pixel (x c ,y c ) at step 54, provided for incrementing (x,y) by going to the next line at the end of a line.

Steps 48 and 52 are repeated until the end of the line. At the end of the line, the subsequent line is considering for further processing of steps 48, 52 and eventually 54 until the end of the group of lines 8 to find a gradient magnitude satisfying equation (9). If at step 52, the magnitude of the gradient G 1 is larger than the strong edge threshold E nr , a value X 1 is assigned to the x-coordinate of the current pixel (x c ,y c ) at step 56.

At step 58, provided for incrementing (x,y), the neighborhood pixel of the pixel assigned X 1 is considered as the current pixel (x c ,y c ).

At step 60, the magnitude of the current gradient G 2 of the current pixel defined at step 58, is computed applying equation (8).

At step 62, the magnitude of the current gradient G 2 is compared to the magnitude of the gradient G 1 to check whether the current gradient magnitude G 2 has the same order of magnitude with the previous gradient magnitude G 1 . Furthermore, the phase angle between θ 2 of G 2 (x c , y c ) and G 1 of G 1 (X 1 , y c ) must be 180 degrees and are therefore also compared. When they are not of the same order of magnitude and opposite polarity, steps 58, 60 and 62 are repeated until finding the corresponding difference. When they satisfy the magnitude and the phase conditions, the value X 2 is assigned (step 64) to the corresponding current pixel, (x c ,y c ), determined at step 58.

At step 66, the length of the section between the values X 1 and X 2 , computed by subtracting X 1 from X 2 , is compared to a predefined number k to validate the region as a text candidate. The predefined number k is, for example set to 1. When the length is lower than the value k, the process goes to step 58 for further processing of steps 58, 60 and 62. Otherwise, all the pixels located between the pixel assigned X 1 and the pixel assigned X 2 are identified as belonging to a text region, at step 68. After step 68, the pixel to consider is the pixel adjacent to the pixel defined by the value X 2 and steps 48 to 68 are repeated for the next pixel of the line and after, line by line, for all the pixels of the group of lines 8.

Figure 8 illustrates the group of lines 8 in which regions 70 and 72 are classified as text regions and shown shaded and a region 74 is classified as a non text region and shown non-shaded.

Returning to Figure 1, at step 40, the segmenting step starts. First, the regions classified as text regions by both the CC-based classifier and the SVM-based classifier are searched (said step 40).

At step 78, the group of lines 8 is quantized into N levels of intensity to form a spatial context of the group of lines 8. The number N of levels is determined automatically or is set to a predefined value such as for example 8. Alternatively, the group of lines 8 is quantized into levels of the weighted combination of intensity and/or color instead of intensity alone. This method is a fast and an efficient color

segmentation that can be implemented with simple bit shifts and additions/subtractions.

At step 80, provided for localizing the extremities of the identified regions, the text regions found during the searching step 40 are segmented using the spatial context formed during step 78. The confidence of these text areas is the highest because both classifiers classify them as text. To this end, the text regions are refined by determining the color and/or intensity jumps in the text regions. This determination is done by scanning the pixels of each region looking for intensity level differences larger than a predefined value in consecutive pixels along a line or along a column. As a result of the segmenting step 80, the text region 81 is separated from the background of the group of lines 8, as shown in Figure 9.

At step 82, the regions identified as text regions by the CC-based classifier and as non-text regions by the SVM-based classifier are searched using the results of the analyzing steps 12 and 44. Figure 10 illustrates the region 72 found during step 82.

At step 84, provided for classifying areas by applying SVM method on the found regions, the SVM-based classifier is applied again on the region 74 found at step 82. However, a lower threshold ThrsvM is used for the processing of the method. The SVM-based classifier operates in the same way as it does in step 12. It identifies an area 86 as a text area and an area 88 as a non-text area as visible in

Figures 11 and 12. Those regions satisfying lower SVM threshold are also defined as text. A lower confidence value can be associated with these regions.

At step 90, the brightness of the non text area 88 is tested if the region does not satisfy the lower SVM threshold (i.e. the areas are classified as non-text). This type of regions is only assigned as text if their intensity values are brighter than the background. In order to verify that, it is checked if the brightness is above a predefined value. When the brightness condition is satisfied, the classified non text area 88 is accepted as a text area. The result of the spatial context formed at step 78 is used for testing the brightness of the classified non text area 88. At step 92, the classified regions are binarized. Figure 13 shows the totality of the text regions 81, 86 classified at step 80, 84 and 90.

At step 94, a model of the text regions of the group of lines 8 is stored in a memory and the method is repeated for a second group of lines defined at step 6 until consideration of all lines of the image.

According to a second embodiment of the invention, the image is a part of a sequence of images of a video frame. In this embodiment, bounding box information about the detected text regions in the past frame is used for text region detection in the current frame. The bounding box information from the past frames is added to the final segmentation stage 78, 80, 82, 84, 90, and 92 of the algorithm in Figure 1. The direct use of image information for the past frame is not favored because it requires hardware (random access memory) support for the temporal information, which is not always the case.

As illustrated in Figure 14, the method according to this second embodiment begins with a step 96 of receiving a first image 2 of a video frame. At step 98, a detecting step is performed on the received image 2 using the method according to step 4 to 94 of the first embodiment of the invention. At step 100 a bounding box of the initial image is generated. This bounding box comprises the coordinates of the pixels delimiting the text regions of the first image as well as a temporal detection counter or an index incremented (for detection) or decremented (for non-detection) depending on detection of text at the corresponding (motion- compensated) location in each new image of the video to analyse. This bounding box is initially associated to the first image where text regions have been detected, but the bounding box features are updated later at step 108 for each processed image.

At step 102, a second image 2 of the video frame is received. At step 104, the text regions of this new received image 2 are detected. According to this second embodiment of the invention, this spatial detection is done in the same way as during steps 4 to 94 of the previous embodiment. At step 106, a temporal refinement step uses the information contained in the bounding box built up for the past images to accept or reject the spatial (image-only) detection done during detection step 104 . This refinement step verifies if there is a bounding box corresponding to the current detection.

According to one policy, only when the text is detected in one or several past frame(s), the current detection done during step 104 is accepted as valid during

temporal refinement step 106. Otherwise, the current detection is only used in the update stage by creating a new box with temporal detection counter set to one (1). According to another policy, if the bounding box comprises coordinates delimiting a text region at a certain location and no text is detected during step 104 at the current frame, the text region defined in the bounding box is regarded as a valid text region in the current frame as well, but the temporal detection counter of the corresponding box is decremented.

According to another policy, a bounding box is removed only after a number of consecutive non-text detections. This number could be found as a function of the video parameters, such as frames per second, or as a function of the time (in seconds).

It is also possible to define the groups of lines 8 of the second image at step 6 (Figure 1) according to the bounding box built up for the previous image. One group of lines is chosen as comprising the lines extending between the coordinates of the text regions as well as a predefined number of lines above and under these text regions.

At step 108, bounding box coordinates and the counter are updated. The update stage can keep the existing bounding box, delete it, or create a new bounding box. If the bounding box is kept, its features are updated. The location of the bounding box is updated by considering the bounding box location in the current frame and the motion between the current and the past frame. The temporal detection counter is incremented for detection and decremented (or set to a low value) for a non-detection.The update step deletes a bounding box when no text region is detected for a certain number of consecutive frames. The temporal refinement stage creates new bounding boxes for the detections in the current frame that do not have any corresponding bounding box.

The temporal detection counters of these new boxes are set to one. Depending on the active policy, these new detection results may be regarded as text regions in the current frame or may wait for several frames to be verified. Steps 102 to 108 are repeated for the subsequent images 2 of the video frame.

This method can be carried out in hardware by a device or in software in computer executable instructions executed by a device. The device may be one or

more conventional personal computers, hand held devices, multiprocessor system micro-processor based or programmable consumer electronic or the like. The device comprises :

- considering means adapted to consider the group of lines 8 of the image 2 including a predefined number of subsequent lines 10 (Figure 3),

- analyzing means able to analyze the pixels of the considered group of lines 8 with at least one classifier to identify among the pixels of this group of lines 8, the pixels belonging to text regions 81, 86 and the pixels belonging to non-text regions 38, 74 ; - segmenting means adapted to segment the identified text regions 81, 86 of the considered group of lines 8 to refine the separation of the text regions 81, 86 from the background of the image ; and

- repeating means able to introduce other group of lines in the considering means until consideration of all lines of the image 2 to generate the text region model based on at least a segmented text region 81, 86 embedded in the whole image 2.

Alternatively, the step of considering 6 the group of lines 8 is done before the filtering step 4. In such case, the filtering step 4 is not applied on the whole image 2 but only on the considered group of lines 8. Alternatively, the first embodiment method is performed without the filtering step 4. In such case, the group of lines 8 considered at step 6 comprises all the pixel lines of the image.

Advantageously, both SVM based detection and CC based detection algorithms are applied only on a limited number of lines of the image, such as 12 lines, which makes the algorithm attractive for hardware implementation. This feature is a significant improvement over the state-of-the-art text region detection algorithms that require the whole image information in either detection (as it is the case for the algorithms that employ multi-resolution image analysis) and/or in segmentation (as it is the case for some connected component based algorithms).

Advantageously, the algorithm uses connected component based feature and texture-based feature for text detection. This approach of using two distinct features results in a robust text detector.

Advantageously, a Support Vector Machine that uses a linear kernel for texture-based text detection is employed. This use speeds up the processing.

Advantageously, the step of filtering the image before applying the Support Vector Machine-based classifier eliminates 80 to 98% of the pixels of the original image from applying SVM-based analysis.

Advantageously, the proposed algorithm can work with both color and grey level image.

Advantageously, the first embodiment of the method does not rely on temporal information. Hence, it is applicable to both image and video-based applications.

Advantageously, this robust text detection method is able to detect text even if the underlying hardware supports fewer data lines (smaller random access memory) than the height of the text.

Advantageously, this algorithm robustly performs without much computational resources unlike the machine-learning based algorithms.