Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
IMAGE SEGMENTATION
Document Type and Number:
WIPO Patent Application WO/2013/144418
Kind Code:
A1
Abstract:
A method and an apparatus arranged to provide an image on a display; detect a user command on the display, the user command selecting a region of the image as a foreground object; update a mask image at a location of the user command; update a classification algorithm distinguishing between the foreground object and a background region of the image; repeat steps starting from said detecting until a predefined number of user commands has been detected; and start a segmentation process of the image.

Inventors:
UGUR KEMAL (FI)
ALATAN AYDIN (TR)
GUNDOGDU ERHAN (TR)
SENER OZAN (TR)
Application Number:
PCT/FI2012/050313
Publication Date:
October 03, 2013
Filing Date:
March 29, 2012
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NOKIA CORP (FI)
UGUR KEMAL (FI)
ALATAN AYDIN (TR)
GUNDOGDU ERHAN (TR)
SENER OZAN (TR)
International Classes:
G06T7/00; G06V10/26; G06V10/426; H04N13/02
Foreign References:
US20110216976A12011-09-08
US20110243443A12011-10-06
US20080136820A12008-06-12
US20090304280A12009-12-10
Other References:
TASLI, H. ET AL.: "Interactive 2D 3D image conversion method for mobile devices.", 3DTV CONFERENCE: THE TRUE VISION - CAPTURE, TRANSMISSION AND DISPLAY OF 3D VIDEO (3DTV-CON), 16 May 2011 (2011-05-16), Retrieved from the Internet [retrieved on 20130129]
ROTHER, C. ET AL.: "GrabCut - interactive foreground extraction using iterated graph cuts.", ACM TRANSACTIONS ON GRAPHICS (TOG) - PROCEEDINGS OF ACM SIGGRAPH 2004, vol. 23, no. 3, August 2004 (2004-08-01), pages 309 - 314, XP002340109, Retrieved from the Internet [retrieved on 20130129], DOI: doi:10.1145/1015706.1015720
WANG, J. ET AL.: "Soft scissors: an interactive tool for realtime high quality matting.", ACM TRANSACTIONS ON GRAPHICS (TOG) - PROCEEDINGS OF ACM SIGGRAPH 2007, vol. 26, no. 3, July 2007 (2007-07-01), Retrieved from the Internet [retrieved on 20130129]
Attorney, Agent or Firm:
TAMPEREEN PATENTTITOIMISTO OY (Tampere, FI)
Download PDF:
Claims:
Claims:

1 . A method comprising:

a) providing an image on a display;

b) detecting a user command on the display, the user command selecting a region of the image as a foreground object;

c) updating a mask image at a location of the user command;

d) updating a classification algorithm distinguishing between the foreground object and a background region of the image;

e) repeating steps b - d until a predefined number of user commands has been detected; and

f) starting a segmentation process of the image. 2. The method according to claim 1 or 2, the method further comprising:

g) updating the display to show the updated mask image; and

h) repeating steps b - g until detecting a user command to end the segmentation process.

3. The method according to any preceding claim, the method further comprising:

segmenting, prior to step b, the image into over-segments preserving a graph structure and spatial neighbourhood of pixel data of the image; and

carrying out the segmentation process on an over-segment level. 4. The method according to any preceding claim, the method further comprising:

providing a grayscale version of the image on the display; and

displaying the foreground object in color during the segmentation process.

5. The method according to any preceding claim, the method further comprising:

carrying out the segmentation process within a spatial neighbourhood of the foreground object.

6. The method according to any preceding claim, the method further comprising:

displaying the mask image as a touch of a brush being centered at the location of the user command and having a predefined radius.

7. The method according to claim 6, the method further comprising:

in response to a location of the user command extending beyond boundaries of the foreground object, marking the location as an exit point;

in response to a location of the user command returning within the boundaries of the foreground object, marking the location as an entrance point; and

amending the path of the mask image to be between the exit point and the entrance point along the boundary of the foreground object. 8. The method according to claim 6, the method further comprising:

adjusting the radius of the brush according to local image characteristics around the location of the user command. 9. The method according to any preceding claim, the method further comprising:

in response to detecting that no further user commands are provided on the foreground object, completing the segmentation process of the image on the basis of the information available in the updated classification algorithm.

10. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to at least:

a) provide an image on a display;

b) detect a user command on the display, the user command selecting a region of the image as a foreground object;

c) update a mask image at a location of the user command; d) update a classification algorithm distinguishing between the foreground object and a background region of the image;

e) repeat steps b - d until a predefined number of user commands has been detected; and

f) start a segmentation process of the image.

1 1 . The apparatus according to claim 10, comprising computer program code configured to, with the at least one processor, cause the apparatus further to:

g) update the display to show the updated mask image; and h) repeat steps b - g until detecting a user command to end the segmentation process.

12. The apparatus according to claim 1 0 or 1 1 , comprising computer program code configured to, with the at least one processor, cause the apparatus further to:

segment, prior to step b, the image into over-segments preserving a graph structure and spatial neighbourhood of pixel data of the image; and

carry out the segmentation process on an over-segment level.

13. The apparatus according to any of the claims 10 - 12, comprising computer program code configured to, with the at least one processor, cause the apparatus further to:

provide a grayscale version of the image on the display; and display the foreground object in color during the segmentation process.

14. The apparatus according to any of the claims 10 - 13, comprising computer program code configured to, with the at least one processor, cause the apparatus further to:

carry out the segmentation process within a spatial neighbourhood of the foreground object. 15. The apparatus according to any of the claims 10 - 14, comprising computer program code configured to, with the at least one processor, cause the apparatus further to:

display the mask image as a touch of a brush being centered at the location of the user command and having a predefined radius.

16. The apparatus according to claim 15, comprising computer program code configured to, with the at least one processor, cause the apparatus further to:

mark, in response to a location of the user command extending beyond boundaries of the foreground object, the location as an exit point;

mark, in response to a location of the user command returning within the boundaries of the foreground object, the location as an entrance point; and

amend the path of the mask image to be between the exit point and the entrance point along the boundary of the foreground object. 17. The apparatus according to claim 15, comprising computer program code configured to, with the at least one processor, cause the apparatus further to:

adjust the radius of the brush according to local image characteristics around the location of the user command.

18. The apparatus according to any of the claims 10 - 17, comprising computer program code configured to, with the at least one processor, cause the apparatus further to:

complete, in response to detecting that no further user commands are provided on the foreground object, the segmentation process of the image on the basis of the information available in the updated classification algorithm

19. A computer readable storage medium stored with code thereon for use by an apparatus, which when executed by a processor, causes the apparatus to perform:

a) providing an image on a display;

b) detecting a user command on the display, the user command selecting a region of the image as a foreground object;

c) updating a mask image at a location of the user command;

d) updating a classification algorithm distinguishing between the foreground object and a background region of the image;

e) repeating steps b - d until a predefined number of user commands has been detected; and

f) starting a segmentation process of the image.

20. The computer readable storage medium according to claim 19, causing the apparatus to further perform:

g) updating the display to show the updated mask image; and

h) repeating steps b - g until detecting a user command to end the segmentation process. 21 . The computer readable storage medium according to claim 19 or 20, causing the apparatus to further perform:

segmenting, prior to step b, the image into over-segments preserving a graph structure and spatial neighbourhood of pixel data of the image; and

carrying out the segmentation process on an over-segment level.

22. The computer readable storage medium according to any of the claims 19 - 21 , causing the apparatus to further perform:

providing a grayscale version of the image on the display; and

displaying the foreground object in color during the segmentation process.

23. The computer readable storage medium according to any of the claims 19 - 22, causing the apparatus to further perform:

carrying out the segmentation process within a spatial neighbourhood of the foreground object.

24. The computer readable storage medium according to any of the claims 19 - 23, causing the apparatus to further perform:

displaying the mask image as a touch of a brush being centered at the location of the user command and having a predefined radius. 25. The computer readable storage medium according to claim 24, causing the apparatus to further perform:

in response to a location of the user command extending beyond boundaries of the foreground object, marking the location as an exit point;

in response to a location of the user command returning within the boundaries of the foreground object, marking the location as an entrance point; and

amending the path of the mask image to be between the exit point and the entrance point along the boundary of the foreground object.

26. The computer readable storage medium according to claim 24, causing the apparatus to further perform:

adjusting the radius of the brush according to local image characteristics around the location of the user command.

27. The computer readable storage medium according to any of the claims 19 - 26, causing the apparatus to further perform:

in response to detecting that no further user commands are provided on the foreground object, completing the segmentation process of the image on the basis of the information available in the updated classification algorithm.

Description:
IMAGE SEGMENTATION

Field of the invention

The present invention relates to image processing, and more particularly to a process of image segmentation Background of the invention

Stereoscopy and displaying three-dimensional (3D) content has been a major research area for decades. Consequently, various stereoscopic displays of different sizes has been proposed and implemented. Big sizes of 3D displays have already gained popularity and mobile displays are expected to be popular soon. Despite of mobile 3D displays being ready for market, there is very little 3D content available for these displays. While waiting for more 3D content to be available in future, conversion of 2D content to 3D content may partially solve the problem in the intermediate period.

It is, nevertheless, generally known that fully automatic 2D/3D conversion of image and video content is difficult and computationally costly. A common approach in 2D/3D conversion of image and video content is to segment a visual content into semantically meaningful regions, typically into foreground and background. While it has turned out to be difficult to implement fully automatic image segmentation, it has been proposed to utilize user interaction through capable devices in order to improve the overall performance. Main types of user interactions used for interactive image segmentation are drawing scribbles on foreground objects, possibly also on background objects, and drawing a rectangle or an approximate contour around object of interest. However, many of these methods are poorly applicable to small-sized touch screens. A further problem is that they typically require complete user interaction before the segmentation process, such as a graph cut, can be started. On the other hand, the methods unrealistically assume perfect user input and drawing around the object of interest. As a result, the segmentation process easily becomes slow, cumbersome to use and vulnerable to user interaction errors.

Summary of the invention

Now there has been invented an improved method and technical equipment implementing the method, by which the above problems are at least alleviated. Various aspects of the invention include a method, an apparatus and a computer program, which are characterized by what is stated in the independent claims. Various embodiments of the invention are disclosed in the dependent claims.

According to a first aspect, a method according to the invention is based on the idea of:

a) providing an image on a display;

b) detecting a user command on the display, the user command selecting a region of the image as a foreground object;

c) updating a mask image at a location of the user command;

d) updating a classification algorithm distinguishing between the foreground object and a background region of the image;

e) repeating steps b - d until a predefined number of user commands has been detected; and

f) starting a segmentation process of the image.

According to an embodiment, the method further comprises:

g) updating the display to show the updated mask image; and

h) repeating steps b - g until detecting a user command to end the segmentation process.

According to an embodiment, the method further comprises: segmenting, prior to step b, the image into over-segments preserving a graph structure and spatial neighbourhood of pixel data of the image; and carrying out the segmentation process on an over-segment level.

According to an embodiment, the method further comprises: providing a grayscale version of the image on the display; and displaying the foreground object in color during the segmentation process. According to an embodiment, the segmentation process is carried out within a spatial neighbourhood of the foreground object. According to an embodiment, the method further comprises: displaying the mask image as a touch of a brush being centered at the location of the user command and having a predefined radius.

According to an embodiment, the method further comprises: in response to a location of the user command extending beyond boundaries of the foreground object, marking the location as an exit point; in response to a location of the user command returning within the boundaries of the foreground object, marking the location as an entrance point; and amending the path of the mask image to be between the exit point and the entrance point along the boundary of the foreground object.

According to an embodiment, the method further comprises: adjusting the radius of the brush according to local image characteristics around the location of the user command.

According to an embodiment, the method further comprises: in response to detecting that no further user commands are provided on the foreground object, completing the segmentation process of the image on the basis of the information available in the updated classification algorithm.

According to a second aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to at least: a) provide an image on a display;

b) detect a user command on the display, the user command selecting a region of the image as a foreground object;

c) update a mask image at a location of the user command; d) update a classification algorithm distinguishing between the foreground object and a background region of the image;

e) repeat steps b - d until a predefined number of user commands has been detected; and

f) start a segmentation process of the image.

According to a third aspect, there is provided a computer readable storage medium stored with code thereon for use by an apparatus, which when executed by a processor, causes the apparatus to perform:

a) providing an image on a display;

b) detecting a user command on the display, the user command selecting a region of the image as a foreground object;

c) updating a mask image at a location of the user command;

d) updating a classification algorithm distinguishing between the foreground object and a background region of the image;

e) repeating steps b - d until a predefined number of user commands has been detected; and

f) starting a segmentation process of the image.

These and other aspects of the invention and the embodiments related thereto will become apparent in view of the detailed disclosure of the embodiments further below.

List of drawings

In the following, various embodiments of the invention will be described in more detail with reference to the appended drawings, in which Figs. 1 a and 1 b show a system and devices suitable to be used in a image segmentation according to an embodiment;

Fig. 2 shows a flow chart of a image segmentation method according to an embodiment of the invention; Figs. 3a, 3b and 3c illustrate an example of user interaction according to an embodiment of the invention;

Fig. 4 illustrate an example of user interaction according to another embodiment of the invention;

Fig. 5 illustrate an example of user interaction according to a further embodiment of the invention; and Figs. 6a and 6b illustrate the propagation of the user interaction from a first frame of a video sequence to a second frame.

Description of embodiments Figs. 1 a and 1 b show a system and devices suitable to be used in an image segmentation according to an embodiment. In Fig. 1 a, the different devices may be connected via a fixed network 21 0 such as the Internet or a local area network; or a mobile communication network 220 such as the Global System for Mobile communications (GSM) network, 3rd Generation (3G) network, 3.5th Generation (3.5G) network, 4th Generation (4G) network, Wireless Local Area Network (WLAN), Bluetooth ® , or other contemporary and future networks. Different networks are connected to each other by means of a communication interface 280. The networks comprise network elements such as routers and switches to handle data, and communication interfaces such as the base stations 230 and 231 in order for providing access for the different devices to the network, and the base stations 230, 231 are themselves connected to the mobile network 220 via a fixed connection 276 or a wireless connection 277.

There may be a number of servers connected to the network, and in the example of Fig. 1 a are shown servers 240, 241 and 242, each connected to the mobile network 220. Some of the above devices, for example the computers 240, 241 , 242 may be such that they are arranged to make up a connection to the Internet with the communication elements residing in the fixed network 210.

There are also a number of end-user devices such as mobile phones and smart phones 251 , Internet access devices, for example Internet tablet computers 250, personal computers 260 of various sizes and formats, televisions and other viewing devices 261 , video decoders and players 262, as well as video cameras 263 and other encoders. These devices 250, 251 , 260, 261 , 262 and 263 can also be made of multiple parts. The various devices may be connected to the networks 210 and 220 via communication connections such as a fixed connection 270, 271 , 272 and 280 to the internet, a wireless connection 273 to the internet 210, a fixed connection 275 to the mobile network 220, and a wireless connection 278, 279 and 282 to the mobile network 220. The connections 271 -282 are implemented by means of communication interfaces at the respective ends of the communication connection.

Fig. 1 b shows devices for the image segmentation video remixing according to an example embodiment. As shown in Fig. 1 b, the server 240 contains memory 245, one or more processors 246, 247, and computer program code 248 residing in the memory 245. The different servers 241 , 242, 290 may contain at least these elements for employing functionality relevant to each server. Similarly, the end-user device 251 contains memory 252, at least one processor 253 and 256, and computer program code 254 residing in the memory 252 for implementing, for example, the image segmentation process. The end-user device may also have one or more cameras 255 and 259 for capturing image data, for example stereo video. The end-user device may also contain one, two or more microphones 257 and 258 for capturing sound.

The end user devices may also comprise a screen for viewing single- view, stereoscopic (2-view), or multiview (more-than-2-view) images. The end-user devices may also be connected to video glasses 290 e.g. by means of a communication block 293 able to receive and/or transmit information. The glasses may contain separate eye elements 291 and 292 for the left and right eye. These eye elements may either show a picture for viewing, or they may comprise a shutter functionality e.g. to block every other picture in an alternating manner to provide the two views of three-dimensional picture to the eyes, or they may comprise an orthogonal polarization filter (compared to each other), which, when connected to similar polarization realized on the screen, provide the separate views to the eyes. Other arrangements for video glasses may also be used to provide stereoscopic viewing capability. Stereoscopic or multiview screens may also be autostereoscopic, i.e. the screen may comprise or may be overlaid by an optics arrangement, which results into a different view being perceived by each eye. Single-view, stereoscopic, and multiview screens may also be operationally connected to viewer tracking such a manner that the displayed views depend on viewer's position, distance, and/or direction of gaze relative to the screen.

In addition to applications in converting 2D images to 3D images, the different embodiments could be used in different applications, such as image editing.

It needs to be understood that different embodiments allow different parts to be carried out in different elements. For example, various processes of image segmentation may be carried out in one or more processing devices; for example, entirely in one user device like 250, 251 or 260, or in one server device 240, 241 , 242 or 290, or across multiple user devices 250, 251 , 260 or across multiple network devices 240, 241 , 242, 290, or across both user devices 250, 251 , 260 and network devices 240, 241 , 242, 290. The elements of the image segmentation process may be implemented as a software component residing on one device or distributed across several devices, as mentioned above, for example so that the devices form a so-called cloud. An embodiment relates to a method for interactive, dynamic and realtime image segmentation usable in data processing devices, especially in touch screen devices, which method enables to effectively select a foreground object from background. In the proposed method, a user of the touch screen device is prompted to select the foreground object by providing scribbles on the desired object through the touch screen. The method is based on an algorithm, which uses pixel data located in the neighbourhood of the input scribble and rest of the image, and performs a segmentation to separate the foreground object from background. The algorithm iteratively updates the model parameters used in segmentation during a new stroke information provided by the user and updates the segmented object on the screen. The segmentation method may utilize iterative and dynamic usage of color Gaussian Mixture Model (GMM) and graph cut.

The method according to the embodiment is illustrated in Figure 2. Representing images with a limited number of pixel groups rather than individual pixels, thus decreasing significantly the number of computation nodes with the image, as well as the computational complexity, is generally called over-segmentation. Over-segments, also referred to as super pixels or turbo pixels in certain processes, are created by grouping similarly colored pixels via merging.

According to an embodiment, in order to decrease the computational burden, the algorithm underlying the method is executed at the over- segmentation level. Therefore, the method is started with segmenting (200) an image into over-segments such that the graph structure and the spatial neighbourhood of the pixel data of the image are preserved. A computationally efficient over-segmentation method is disclosed in "Efficient graph-based image segmentation via speeded-up turbo pixels", by C. Cigla and A.A. Alatan; In Image Processing (ICIP), 2010 17th IEEE International Conference on, pp. 3013 -3016, Sept. 2010. Therein, some pixels in originally rectangular over-segments are assigned to new convex-shaped segments (super pixels) by minimizing a cost function, which ensures similarly colored pixels to be merged, and at the same time, provides super pixels to have convex shapes by preventing distant pixels to be merged. The super pixels still remain at least quasi-uniformly shaped, which is important for the computational efficiency. It is, however, noted that the implementation of the method is not limited to the above-described over-segmentation method, but any other over-segmentation method, such as methods based on mean-shift and k-means algorithms, can also be used. For example, another over-segmentation method that is suitable for this purpose is described by R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. SCisstrunk, in "SLIC Superpixels", EPFL Technical Report no. 149300, June 2010. It is further noted that the overall method is applicable on pixel level as well; i.e. no over-segmentation needs to be applied, if the increased computational burden is tolerable.

Provided that the over-segmentation is applied, then after creating an over-segmentation graph of over-segments, every node represents a over-segment (i.e. super pixel) and there is an edge link between each connected over-segments. Thereafter, in order to allow the user to select a foreground object, the user interaction is started (202) by detecting a user command on the user interface. The user command may be, for example, a tap by a finger or a stylus on a touch screen display, or a click by a mouse or other pointing means on the display area, if a conventional display without a touch input feature is used. The algorithm is arranged to interpret the user command, such as a touch of the user, on a point of the screen as a touch of a brush being centered on said point of the screen and having a predefined radius. Now the interaction point, i.e., the circle-shaped region touched by the brush, is stored in a mask image called a brush map. The brush map is displayed on the original image, and the user interaction is preferably indicated on the display by changing the color of the interaction point. As the user continues to interact with the screen by brushing scribbles on the foreground object, the brush map is updated (204) on the screen at each interaction point. Similarly, a color GMM is updated (206) at each interaction point.

In order to achieve dynamic and real-time image segmentation, the segmentation process should be started as soon as possible. While the user continues the interaction by providing further user commands (208), the number of interaction points is continuously updated. If the number of interaction points exceeds a predefined threshold (210), the segmentation process (212) is also started. The value of the predefined threshold may vary depending on the properties of the device or the image to be segmented, for example. Generally, a value between 5 and 20 pixels/super pixels, such as 10 pixels/super pixels, can be used as the value of the predefined threshold, but naturally values beyond this range are applicable, as well.

The segmentation may be carried out according to the so-called "Grabcut" method, which is disclosed more in detail in "Grabcut: Interactive foreground extraction using iterated graph cuts" by C. Rother, V. Kolmogorov, and A. Blake; ACM Transactions on Graphics, 23:309-314, 2004. Therein, a so-called Gibbs energy function E is defined so that its minimum should correspond to a good segmentation, in the sense that it is guided both by the observed foreground and background color GMMs, similarity of pixel-wise GMM components and that the opacity is coherent, reflecting a tendency to solidity of objects. The Grabcut process is carried out iteratively, each iteration step minimising the total energy E with respect to the variables describing pixel-wise GMM components and the opacity of either a foreground or a background object, in turn. Hence E decreases monotonically, thus guaranteeing convergence at least to a local minimum of E. When E ceases to decrease significantly, the iteration can be terminated automatically.

One of the benefits of applying the Grabcut method in the algorithm according to the embodiment is that the Grabcut method can be started with very minimum amount of user interaction. Further, although the entire image is segmented at each iteration step, only the selected foreground object is masked with the brush map and only the foreground region around the brushed area is shown to user. Accordingly, the segmentation may discard the background part of the brushed foreground object and focus only on finding intersection of the brushed region and the object of interest, thus further simplifying the segmentation process. From the usability point of view, it may be beneficial that the user may stop brushing the scribbles on the foreground object at any point, but the algorithm automatically continues the segmentation on the basis of the user interaction provided so far. This derives from the fact that only a minimum amount of user interaction is required to start the segmentation. Thus, when the user feels that he/she has provided enough interaction for the completion of a correct brush map, he/she can simply stop providing user commands, for example by removing the finger from touch screen, which is detected (208) by the algorithm. In response to the user stopping to provide user commands, the algorithm proceeds to complete (214) the on-going iteration step and discard the brush map, whereafter the result of the segmentation as carried out so far is shown (216) on the display. If the user is not satisfied with the resulting segmentation, he/she may continue the interaction by providing further user commands, for example by touching again on the screen.

Even though many of the implementation details described above are selected due to their performance efficiency, it is noted that a plurality of alternatives are available for various implementation details. For example, instead of using a color GMM based background and foreground model, a spatio-color model can be used. Instead of the GMM, a color histogram or an intensity histogram can be used to model the foreground and the background. Actually, any clustering or classification algorithm can be used to define a model which distinguishes foreground and background.

Similarly, instead of using the likelihood of foreground and background GMM for finding edge costs of the graph, some other method can also be used, like determining the distance to a mean color. Regarding the segmentation, instead of the Grabcut method, a method based on alpha/beta swapping may also be used to segment the image into background and foreground. Any other energy minimization, such as simulated annealing or belief propagation, may also be used to minimize the defined cost function. According to an embodiment, the user interaction with the image on the display may be arranged to simulate a coloring gesture to make interaction entertaining and intuitive. Initially, a grayscale version of the image is provided on the display and the user start to color an object of interest within the image. The result of the segmentation is shown to user by coloring the foreground object; i.e. the pixels belonging to foreground object is shown in color, whereas background pixels are shown in grayscale, as shown in Figure 3a. Using this feedback from the device, the user continues drawing scribbles and colors the remaining areas of the foreground object, as shown in Figures 3b and 3c.

While coloring the object, incautious users may easily overfill at the boundaries of the object. It is also difficult to select small regions, especially when using a touch screen with finger interaction. According to an embodiment, when a user starts to interact with an image, a model is learned throughout the interaction. When the interaction continues and if a new interaction point is not consistent with this model, thus suggesting that the point is outside the boundaries of the object of interest, that point is recorded as an exit point. Then, if a later interaction point provided by the user enters back to a region, which is consistent with the previous model, that point is recorded as an entrance point. Then a correct path between the exit and the entrance points is determined by local search.

This is illustrated in Figure 4, where the user has started drawing scribbles on the foreground object according to path 400. Along the continued user interaction, the scribbles are compared to the underlying object of interest, and a model of the presumed intention of the user is created on the basis of this comparison. Then at point 402 the user, maybe accidentally, draws scribbles outside the boundaries of the object of interest. This is not consistent with the created model, and the point 402 is defined as an exit point. After making a turn, the user draws scribbles towards the foreground object and enters the boundaries of the object of interest at point 404. Thus, point 404 is defined as an entrance point. Now the algorithm defines a correct path between the exit point 402 and the entrance point 404 along the boundary of the object of interest, and inserts the correct path in the brush map. Later along the path of scribbles, a similar exit occurs at point 406 and an entrance at point 408. Again, a correct path is defined between the exit point 406 and the entrance point 408 along the boundary of the object of interest, and the correct path is inserted in the brush map.

According to an embodiment, in order to increase accuracy of segmentation result and quality of user interaction while coloring the object, only a region which is small compared to the interaction point is colored. Thus, stroke-neighborhoods (i.e. brush diameter around the interaction point) are limited to avoid undesired foreground segmentation at unconnected regions.

According to an embodiment, in order to improve the performance of the user experience, the brush size around the user interaction point, i.e. the input scribble, is automatically adjusted based on the local image characteristics around the input scribble. Based on the obtained local texture, thick brush size is assigned to homogenous regions and thin brush size is assigned to regions with high textures.

This is illustrated in Figure 5, where it is assumed that the image region near the shoulder of the human figure is provided with high textures, thus the scribbles in that region are drawn with a narrow brush. When drawing the scribbles towards the chest of the human figure, which is assumed to be a more homogenous region, the brush is gradually converted thicker. If a touch-screen device is used in the method, it may become cumbersome to select foreground and background regions separately because of the limitations of the touch-screen device, such as the lack of right-button mouse click. According to an embodiment, only foreground is selected by the user, whereas the background model is estimated indirectly from the foreground model as the model of the regions which are farthest from foreground model. The above method is applicable to interactive video segmentation, as well. Therein, the interaction of the user in the first frame may be utilised in the video segmentation such that the interaction is propagated through the subsequent frames. Accordingly, the user interaction is started by drawing scribbles on a foreground object in the first frame. The user interaction is continued in the first frame, for example as described in Figure 3, whereafter the algorithm as described above solves the minimization problem and shows the segmented boundary on the screen.

Since the segmentation algorithm uses the over-segmented regions (super pixels), the previous interaction result is utilized in terms of those super pixels. Once the system has obtained an initial interaction from the first frame, the foreground and background samples of the first frame are assumed to be known. Starting from the second frame, over- segmented regions are used to estimate the super pixels, which are closer to the previously selected super pixels in terms of a similarity metric.

The propagation of the interaction result through the video sequence to segment the subsequent frames is accomplished by using a search algorithm on the super pixels, not using any motion estimation algorithm.

Segmentation of a foreground region in a video sequence is based on a color model obtained from the interaction result of the previous frames. Color models are extracted from the interaction with an initialization taken from the previous frames. Such an approach helps to make the color model accurate and fast to acquire.

After finding one or more regions in the current frame to be the interaction result, the regions are given to be modeled with the Gaussian Mixture. Except the first frame, the previous GMM information is provided as initials of the Expectation Maximization algorithm, which is carried out to find GMM for the given interaction. By performing such an initialization, the convergence and accuracy increases and becomes faster. The calculated Gaussian Mixture Model is used in constructing the graph as in the still image part. Finally the segmentation is performed using a graph cut algorithm.

Figures 6a and 6b illustrate the propagation of the user interaction from a first frame to a second frame. In Figure 6a, representing the first frame, the user draws a scribble on the human figure considered a foreground object. The super pixels of the scribble of Figure 6a are used as a basis for estimating the super pixels in the next frame (Figure 6b), which are sufficiently similar to the super pixels selected in the first frame. The resulting super pixels are shown as a lighter color scribble in Figure 6b. Then the super pixels of this scribble are used as a basis for estimating the super pixels in the next frame, and so on. Thus, no computationally heavy motion estimation is used for segmenting the foreground objects, but only a light-weight similarity search is carried out for the super pixels.

A skilled man appreciates that any of the embodiments described above may be implemented as a combination with one or more of the other embodiments, unless there is explicitly or implicitly stated that certain embodiments are only alternatives to each other.

The various embodiments may provide advantages over state of the art. With rather minimum amount of user interaction, an accurate and pleasant looking segmentation and 3D perception may be achieved. The various embodiments provide real-time image segmentation, which is remarkably robust to interaction errors. From the usability point of view, the overall process is intuitive and entertaining for the user. Furthermore, the process is adaptive to complicated textures of foreground objects.

The various embodiments of the invention can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the invention. For example, a terminal device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the terminal device to carry out the features of an embodiment. Yet further, a network device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment. The various devices may be or may comprise encoders, decoders and transcoders, packetizers and depacketizers, and transmitters and receivers.

It is obvious that the present invention is not limited solely to the above- presented embodiments, but it can be modified within the scope of the appended claims.