Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD AND SYSTEM FOR RELIABLE REFLECTIVE OBJECT DETECTION USING DISPLAY LIGHT SCENE ILLUMINATION
Document Type and Number:
WIPO Patent Application WO/2012/175703
Kind Code:
A1
Abstract:
A method for generating control signals particularly based on a movement or a pose of a at least one viewer comprising; detecting a plurality of images of a scene illuminated by ambient light and the light generated by the display device, wherein each of the plurality of images comprise a substantially constant amount of ambient light and a pre-determined amount of light generated by the display device and wherein the pre-determined amount of light varies between at least two consecutive images of the plurality of images; separating display generated light from the ambient light by using known indicia of the display device generated light to extract only a portion of the illuminated scene; and generating a control signal in response to the extracted portion.

Inventors:
ZIVKOVIC ZORAN (NL)
GROOT HULZE HENDRIKUS WILLEM (NL)
Application Number:
PCT/EP2012/062131
Publication Date:
December 27, 2012
Filing Date:
June 22, 2012
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
TRIDENT MICROSYSTEMS INC (US)
ZIVKOVIC ZORAN (NL)
GROOT HULZE HENDRIKUS WILLEM (NL)
International Classes:
G06F3/01; G06F3/03
Foreign References:
US20080231564A12008-09-25
Other References:
D. COMANICU; P. MEER: "Mean shift: A robust approach toward feature space analysis", IEEE TRANS. PATTERN ANAL. MACHINE IN- TELL., May 2002 (2002-05-01)
M. K. HU: "Visual Pattern Recognition by Moment In- variants", IRE TRANS. INFO. THEORY, vol. IT-8, 1962, pages 179 - 187, XP011217262, DOI: doi:10.1109/TIT.1962.1057692
E. PRADOS; O. FAU- GERAS: "Shape From Shading: a well-posed problem?", IEEE CONFERENCE CVPR, 2005
C. HERNANDEZ; G. VOGIATZIS; G. J. BROSTOW; B. STENGER; R. CIPOLLA, NON-RIGID PHOTOMETRIC STEREO WITH COLORED LIGHTS, 2007
Attorney, Agent or Firm:
ASSOCIATION OF REPRESENTATIVES NO. 175, EPPING HERMANN FISCHER PATENTANWALTSGESELLSCHAFT MBH (München, DE)
Download PDF:
Claims:
Claims :

1. A method for generating control signals particularly based on a movement or a pose of a at least one viewer comprising:

- detecting a plurality of images of a scene illuminated by ambient light and the light generated by the display device, wherein each of the plurality of images comprise a substantially constant amount of ambient light and a pre-determined amount of light generated by the display device and wherein the pre-determined amount of light varies between at least two consecutive images of the plurality of images;

- separating display generated light from the ambient light by using known indicia of the display device generated light to extract only a portion of the illuminated scene; and

- generating a control signal in response to the extracted portion .

2. The method according to claim 1, wherein the step of de- tecting a plurality of images comprises:

- Illuminating, using a first light generated by the display, close-by objects;

- Capturing a first image or a portion of a scene comprising the illuminated close-by objects and background objects;

- Illuminating, using a second light generated by the display, close-by objects;

- Capturing a second image or a portion of a scene comprising the illuminated close-by objects and background objects. 3. The method of claim 2, wherein illuminating comprises changing luminosity for a pre-determined period of time of at least one of the following: displayed video material;

displayed picture;

displayed graphics;

back light generated by a backlight unit of the display device;

additional light source;

flashing content on an emissive display.

4. The method according to any of claims 2 to 3, wherein illu- minating comprises at least one of the following:

- display light modulation by changes already present in the video material;

- display light modulation by Increasing the sensitivity by "flashing light"

- display light modulation by Including reference measurement to be independent on the changes in the video material

- display light modulation by Reducing the fluctuation caused by the image on the display;

- changing camera sensor sensitivity between taking images - with rolling shutter camera device where parts of images are used and different frequencies are used for camera capture and display light modulation;

and/or wherein separating display generated light comprises at least one of:

- motion compensation two exposures to reduce the false detection of ambient light changes because of motion;

- dual-mode operation where in one mode has high picture quality and other mode high sensitivity for detecting viewer actions .

5. The method according to any of the preceding claims, wherein the step of separating display generated light comprises: - Comparing at least two of the plurality of detected images

Removing background objects from image by applying a first a threshold;

Segmenting the image to remove noisy results by detecting connected regions and removing small connected region; wherein the image is one of the captured images or a combined image, particularly an image based on the difference of at least two captured images;

Extracting contours of the detected segments, said contours describing a set of features adapted to distinguish a plurality of different predetermined shapes;

Passing the extracted contours to the detector for detecting at least one of the pluralities of predetermined shapes.

6. The method of any of the preceding claims, wherein the step of generating a control signal comprises:

Temporally filtering the extracted contours of the at least one detected predetermined shape.

7. The method of any of the preceding claims, further comprising at least one of the following.

selecting 2D or 3D viewing mode of the display in response to the control signal;

selecting or activating a touch-free user interface.

8. The method of any of the preceding claims wherein the plurality of predetermined shapes comprise at least one of:

a hand or portions of such hand, particularly an open hand, appointing hand or a closed hand;

an arm;

a head;

glasses a predetermined marker corresponding to 3D viewing glasses ;

control device (e.g. pointer);

a predetermined marker corresponding to control device.

9. A processing device comprising:

a video processing system adapted to adjust an amount of light generated by a display device connected thereto;

a camera to capture at least two images of a scene under different illumination by said display, said scene comprising at least one close-by object and a background object;

an evaluation device coupled to the camera and adapted to determine at least one shape out of a plurality of predetermined shapes based on a difference between the at least two captured pictures.

10. The processing device of claim 9, wherein the video processing system is adapted to modulate the luminosity of the display device during at least a portion of a frame period, wherein said modulation is introduced in at least one of the following :

displayed video material;

displayed picture;

displayed graphics;

- back light generated by a backlight unit of the display device .

11. The processing device of claim 10, wherein the modulation of the luminosity is invisible to a viewer of the display.

12. The processing device of any of the preceding claims, wherein the evaluation device comprises Removing background objects from one of the captured image by applying a first a threshold;

Segmenting the one of the captured image to remove noisy results by detecting connected regions and removing small con- nected region;

Extracting contours of the detected segments, said contours describing a set of features adapted to distinguish a plurality of different predetermined shapes;

Passing the extracted contours to the detector for detect- ing at least one of the plurality of predetermined shapes.

13. The processing device of any of the preceding claims, wherein the evaluation device comprises

an edge detecting device for detecting cohesive regions based on at least one of the captures images or on a difference between at least two images captured under different illumination;

a storage for storing contours describing a set of features adapted to distinguish the plurality of different prede- termined shapes.

14. The processing device of any of the preceding claims wherein the plurality of predetermined shapes comprise at least one of the following:

- a hand or portions of such hand, particularly an open hand, appointing hand or a closed hand;

an arm;

a head;

glasses;

- a pointer

a predetermined marker corresponding to 3D viewing glasses .

15. The processing device according to any preceding claim, wherein the evaluation device is constructed using a statistical pattern recognition classifier function, particularly the statistical pattern recognition technique AdaBoost.

Description:
Method and system for reliable reflective object detection using display light scene illumination

FIELD OF THE INVENTION

The present invention applies to the field of user interface, image sensors, video processing, image analysis, display technology and display backlight control.

BACKGROUND OF THE INVENTION

Buttons on a remote control are usually used to send commands to a TV or a set top box to adjust video and audio, change channels, etc. More natural and richer user control can be achieved using other sensors (e.g. camera) to detect the users and pose of certain objects used for device control (e.g. detect if the users are wearing glasses used for viewing stereoscopic 3D data and then automatically switch between 2D and 3D mode on the displays that can show stereoscopic 3D data) . Robust and reliable user/object detection using cameras is often difficult. The major reason is the complex algorithms required to detect the objects of interest in highly variable environments such as a typical living room. Furthermore, the highly variable light conditions that can be expected in the viewing environment make detection even more difficult.

DETAILED DESCRIPTION OF THE INVENTION

A solution is proposed to overcome the limitations explained above. The display is a light source. In one embodiment, the display light (and more specifically, the display backlight) and the information about the amount of light coming out of the display is used to increase the performance of a camera based object detection system by separating display generated light from other unknown (ambient) light sources.

In one embodiment, temporal modulation is introduced to the properties of the display light, e.g. intensity or other properties, such as wavelength. In one preferred embodiment, temporal modulation of light invisible for human eyes is used. The temporal modulation is detected in the camera (light sensor) signal and used to extract only the part of the signal corre- sponding to the display light. In this way the influence of the ambient light sources is eliminated. This "demodulated" signal (image) is then further analyzed to detect the users or/and objects and their pose, to control one or more devices. Close-by and reflective objects will provide strong reflection of the emitted display light and will be easier to detect since they are isolated from the background which will not reflect the modulated display light as strongly.

Figure 1 illustrates an example embodiment of the whole system of the user aware display. Users observe the display. Light sensor (e.g. camera) data is used for analyzing the scene. Part of the light in the scene comes from the display itself. The camera data might be used to detect user but also some other relevant objects such as: glasses that enable to watch stereo- scopic 3D, or remote control or special device used for control using the camera. The objects that need to be detected might have parts of special reflective material to help accurate detection, or be close by, e.g. for the hand gesture control. Figure 2 illustrates the basic principle. Light modulation is introduced by the display (for example by controlling the display backlight) . Preferably high-frequency light modulation in- visible to human eyes is used. The camera captures images illuminated by the light. A number of camera images are used and combined to separate display generated light from the other (ambient) light sources.

Figure 3 illustrates an example of the base processing blocks and data flow. A number of camera images are used and combined to separate the display generated light from the other (ambient) light sources in the demodulation block. The result is an image (signal) that corresponds to the display generated light and where the unknown ambient light influence is removed. A scene analysis engine analyzes the demodulated data and detects users and/or other objects. The extracted information about the users or/and the objects may be used to control the TV (or the set-top box) . The information about the light coming out of the display (available at the display control unit) is used to control the demodulation. The light sensor can be controlled as well; for example to synchronize its data capture time with the display illumination. The scene analysis engine can also use the information about the generated display light that depends on the displayed content, for example to the scene analysis engine can be made aware of time periods where the amount of light is low such that the detection could be difficult. Figure 4 illustrates an example of demodulation by subtracting two images. Two images are shown on the display and the camera is on the top of the display similar to Figure 1. One of the displayed images is darker and the other one brighter. As a result, the amount of light coming from the display is changing and it is captured by the camera. The close-by objects (for example, the user) and highly reflective objects in the two scenes (glasses in the top row and the object in the hand of the user in the bottom row) are clearly visible in the image on the right showing the difference between the two captured images. Information about the display light color and object reflectance properties can be used by the scene analysis engine to further improve detection to the specific objects of interest.

Extracting temporal modulated light:

The basic principle and exemplary methods to extract the tempo- ral modulated light signal are described first. Other types of modulation are possible but here we will focus on square pulses since they are the most common in practice.

Let I denote the amount of light observed by the camera (or other light sensor) at a certain pixel corresponding to an ob- ject in the scene. Part of the reflected light will be from the ambient light sources lambient anc ^ another part will be the re ¬ flected display light Idisplay giving the total observed amount : I = lambient + -'-display (i)

An example of how the different light sources are reflected is presented in Figure 2. If the amount of light coming from the display is changing, the reflected display light Idisplay from the object will also change in synchronous way, see examples in Figure 4 and 5a. The ambient light is assumed constant and the object is assumed to be still. Therefore the part corresponding to the ambient light lambient i s constant. To increase the visibility of the objects in low light, some parts of the objects can be made of a special reflective mate- rial. If enough light comes from the display, such object will be visible by the camera also in low light. The highly reflective object and the close-by object will exhibit large change in Idisplay an d in this way it can be distinguished from other objects. See Figure 4 for an illustration of an implementation where the difference of the images captured at two time instances, is used to extract the effect of the changing display light. The amount of light coming from the display will change due to the variation in the presented video material. It is also possible to introduce special high frequency temporal light modulation invisible to human eye into the displayed images. In this way the known light modulation will always be present in- dependent of the video material (except if video is completely black when no light is coming from the display) . The modulation could be introduced within the displayed video material, displayed graphics, the display back light or by using additional light source.

Increasing the sensitivity by "flashing" light:

Light sensor will have a certain dynamic range that will determine the minimum measurable reflected display light Idisplay relative to the ambient light lambient- l n w light conditions the influence of the display light, Idisplay i n Equation (1), will be significant. If there is a lot of ambient light, lambient will be dominant part of Equation (1) and it might be dif ¬ ficult to measure the changes in Idisplay due to the limited precision that is used to capture the images. Assuming that the ambient light is continuous, the ratio of the display generated light with respect to the ambient light can be increased in the following way. Instead of continuously emitting light, the display light is emitted in shorter light pulses of higher amplitude. A practical realization of such a system is by "flashing" the display back light where the display backlight is controlled to emit light during short pulses with higher amplitude where the average brightness level is preserved. Therefore it will be assumed further that the "flashing" is realized by the "flashing" display backlight, but other possible implementa- tions are not excluded. Figure 5b illustrates how shorter light pulses can be used to increase Idisplay/- 1 -ambient ra tio with re ¬ spect to the situation presented in Figure 5a where the display is always emitting light. With the backlight duty cycle of 25% in Figure 5b the ratio Idisplay/ -'-ambient during the light pulse will be 4 times larger than in Figure 5a. The camera (light sensor) exposure should be limited to the time interval of the light pulse. Longer camera exposure will reduce the improvement in the Idisplay/^ambient ratio since the camera integrates the light.

Using flashing light and reference exposures to remove the influence of the ambient light:

A reference exposure when there is no light coming from the display can be used to measure lambient an ^ the ambient light level can be removed by subtracting it in equation (1) . In this way Idisplay i- s extracted from the measurement I. When using the flashing light the reference exposure measurements can be performed between the light pulses. Figure 6b illustrates the use of pulse modulation and the reference exposure in between the light pulses. The camera is measuring during the light pulse and also between the pulses. The camera measurement between the pulses is used as reference ambient light measurement . A longer duty cycle of the display backlight is preferred in practice because for most light sources more light can be generated by the display. In Figure 6a and 6b it is assumed that the duty cycle (period Al) is 50% of the frame period (period A1+A2) . To enable a longer duty cycle of the backlight, the reference exposure period should be reduced. The same exposure period should be used for the measurement during the period when the display light is on. Figure 7 illustrates an example when the duty cycle is 75%. If the exposure periods when the light is on and off are not the same, the measured levels need to be corrected by inverse ratio of the periods. For example, if time A2 is only 25% and Al 75%, the measured reference exposure should be multiplied by Al/A2=3. In this way, the areas corresponding to the two peri- ods lambient * A ^ anc ^ 1 ambient * A ^ are made identical.

Reducing the fluctuation caused by the image on the display:

If a brighter image is displayed the measured Idisplay w iH be larger than when a darker image is on the display. Further processing, for example gesture control and people detection, might require reducing the fluctuation of the Idisplay This is possible since the displayed image is known. Figure 6c illustrates compensation of the fluctuation caused by the image on the display. In Figure 6b the ambient light level is removed and the extracted display light reflection for the two periods Idisplay ( A ) anc ^ ^display ( B ) depend on the image on the display. The modulation is compensated by dividing Id splay by the spatial average brightness of the displayed image.

Larger distance illumination and sensor dynamic range:

Usually a limited amount of light will come from the display so it is difficult to reliably measure the influence of the dis- play on the far away objects and objects with low reflectance. For detecting users close to the display this actually helps to distinguish them from the background.

However, for other applications the requirement could be to measure the display illumination at larger distance. The light sensor sensitivity should be increased and the signal to noise ratio can be increased as described above. Another problem will be that the sensor analog to digital conversion (ADC) will have usually limited dynamic range. If the scene is assumed static than multiple measurements (exposures) can be used to extend the dynamic range.

Another approach to deal with the limited dynamic range of the light sensor is to focus on the range of signal that is of in- terest. For example for making the hand gesture interface, detecting human skin is of interest. The sensor sensitivity could be set in such way that the display light modulation reflected on the human skin does not exceed the dynamic range of the sensor. For other brighter objects this might cause overexposure but this is not important for the detection. The range of the skin reflection can be initialized and adapted by using some procedure for detecting skin colored regions. For example a face detector can be used. Faces are detected and based on the pixel information in the detected face region and the current camera settings, dynamic range corresponding to skin can be estimated.

Synchronization : Usually high-frequent light modulation invisible for human eyes is used. However, the light sensor does not need to measure at the same high frequencies but can operate on lower frequencies by skipping some frames. For example in Figure 6, the light sensor can measure during period Al and then take the first reference ambient light measure during B2 period.

To be able to measure the highest amplitude of the modulated signal, see Figure 6, the light sensor needs to be synchronized with the display light illumination in some way. This could be done for example using some trigger signal to synchronize the light sensor and the light source. Another way is adaptive synchronization where some automatic procedure is used that is analyzing the measured signal, for example the phase lock loop (PLL) approach.

Rolling shutter camera implementation:

Rolling shutter camera as a light sensor is an important use case because of the low costs of such cameras. In such a system each image line integrates over a different time period since the lines are reset, exposed and read out sequentially. As result of sequentially reading the image lines there will be im- age lines that integrate over transitions of the display light modulation and therefore do not measure the highest amplitude of the modulated signal. For larger exposure periods more image lines will be influenced. The lines that integrate over transi- tions of the display light modulation could also be used, but they will have worse signal to noise ratio. A solution is to combine multiple captured camera images and select the image lines captured at proper time instances to be able to measure the highest amplitude of the modulated signal for each line. To ensure that after a number of images all image lines get the highest and the lowest amplitude, the frequencies of the camera and the display light modulation should be different.

Figure 8 illustrates an example demodulation by combining a number of images captured by a rolling shutter camera. The display backlight was emitting square pulses at 90Hz and camera was capturing at 120Hz with exposure time of 20%. Four captured images are shown. A white object was present close to the camera visible at the left side of the images, to illustrate which image lines are captured when the display backlight was on

(brighter ones) and which one when the light was off (the darker ones) . The transitions are smooth due to the integration time. Using the four captured images it is possible to measure the highest amplitude of the modulated signal by selecting the proper lines to be compared. The arrows illustrate roughly which lines from which images are used to get the highest amplitude. The resulting image showing just the demodulated display light reflection is shown on the right. It can be observed that all lines are demodulated with the maximum amplitude.

Dual mode operation: In practice, it is preferred to have the display backlight mostly turned on, for example 80%, to maximize the amount of light coming from the display. Introducing modulation requires dark periods that can reduce the amount of light coming from the display. If short dark periods are used, the system will need more images and longer time to generate the demodulated image. The same holds for short bright periods, for example in backlight dimming applications. As result the detection of moving objects will be more difficult. Furthermore, the detection of the user actions will be slower. A duty cycle between 20% and 80% is a typical useful range for dimming applications.

A dual mode operation can be used as a solution:

- user detection stand-by mode (video mode) : when the user not using the user interface, but for example watching a video.

Short dark periods are used to maximize the picture quality. For example flashing backlight with 80% duty cycle. As result many images are used for demodulation and the user detection is slower, but the system only needs to detect that the user wants to switch to the interactive mode.

An alternative is not to increase the duty cycle but to generate a light level modulation between a high and low level were the light is never turned off. As result the amplitude of the modulated display light will be smaller and more difficult to detect.

- interactive mode (display menu mode) : when switched to interaction mode it is assumed that the user wants to control the display and perform some selections, so the full brightness of the screen is not so important. Longer darker periods are used to increase the reaction speed of the user detection. As result the brightness of the screen can get lower, but this can serve also as a natural indication that the display is in the interaction mode (display menu mode) .

Scanning backlight implementation:

The principle can be applied in displays with scanning backlight. In typical dimming applications the segments are refreshed per column in a scanning order locked to the video refresh rate. Every video frame can hold a different light level. Still short dark periods can be introduced by switching off all columns at the same time for a few milliseconds. See the example in figure 11. In practical solutions usually the light level is controlled by a PWM (pulse width modulation) signal at a higher frequency of e.g. 600Hz, as indicated in Figure 11c.

Another display backlight scanning technology is used for sequential crosstalk reduction and motion portrayal improvement. In this case the duty cycle of the PWM signals is already low (e.g. 50%) and the frequency is equal and locked to the fre- quency of the video (e.g. 60Hz) , see figure 10a. In this embodiment "light on" period is split or/and shifted in time to insert a black period for all backlight columns. An example is presented in figure 10b and 10c This might have some consequences to the picture quality because the ideal scanning is interrupted. This embodiment can also be combined with systems that use dimming.

Frame sequential 3D TV implementation: Combination of the flashing light modulation with frame sequential 3D systems is often beneficial since they usually already have dark periods when switching between the left and right eye image to reduce the cross talk between the two images. These dark periods can be then used. Depending on the 3D technology the dark periods might need to be made longer. Similar combination can be used for other sequential techniques (for example, some color-sequential implementations) where the dark periods are usually introduced to reduce the cross talk between the different images that are sequentially displayed.

Further implementation remarks:

For the simple implementation illustrated in Figure 4, the assumption is that the objects are still. If the measurements are done on high frequencies, the movement between the captured frames will be small. For lower measurement frequencies the moving objects can be tracked first and aligned before they are checked for the appearance changes due to the display light changes .

In the embodiment described in Figure 2 and Figure 6, the dif- ference between the reference exposures (A2 and B2 in Figure 6) can be used to detect the changes due to the movement. The changing areas in these frames correspond to the motion and they can be excluded from analysis. Another method is to use the reference exposures (A2 and B2) for object tracking and mo- tion based compensation.

It is possible to apply the same principles if other properties of emitted light are changing, e.g. color or polarization. Furthermore, the infra red (IR) light is another way to provide illumination not visible to the user, in combination with an IR sensitive camera. The techniques described here can also be applied to such systems. Scene and user analysis engine:

The proposed light modulation/demodulation method of the pre- sent invention is used to extract only the part of the signal corresponding to the known display light and in this way be robust against the influence of the unknown ambient light. The close-by and reflective objects will be strongly visible in the resulting images since they provide strong reflection of the emitted display light. As used herein, the term "close-by" is intended to include object within the range of zero to some meters for example two to three meters. To distinguish between different reflective objects or/and extract their pose and other properties, further processing is needed.

It is expected that the present system and method based on demodulated images will be more robust and have much lower computation costs then a systems that works on camera images directly without removing unknown ambient light effects.

Various applications based on the presented light modulation system can be realized. Here more details are given about a few such applications.

Hand detection user interface:

Detecting users and their hands is very important for realizing user interfaces. Hands are usually reflecting lot of light and it would be possible to detect them close to the display even if small amount of light is emitted by the display.

In this way "touch-free" replacement for the touch screen displays can be realized using a regular camera in combination with a regular display. Such a system could be used to com- pletely replace the basic control buttons of a TV set (except the on/off button), leading to cost reduction.

An example of a processing implementation for user hand detec- tion is described in more detail:

First. The display light will illuminate only the close-by objects so first a threshold is applied to remove pixels that belong to the far away background. Usually this simple procedure will be good enough. In addition it is possible to make a model of the background to remove the static objects in the background, for example a long term temporal average image of the scene. The result is segmentation where image pixels are labeled as foreground and background pixels.

Second. Further image segmentation is performed considering only the pixels labeled as foreground pixels. Connected regions are detected and small connected regions are removed. Connected regions are groups of interconnected foreground pixels where interconnected means that at least one of the neighboring pixels is also labeled as foreground. Small connected regions are regions with fewer pixels than the object is expected to have. For example minimum size of human hand at maximum distance of 3 meter for a certain camera can be calculated and this number of pixels can be used to remove all regions with less pixels. Each connected region can be further split if needed, using some segmentation algorithm like "Mean Shift" [2] such as defined in D. Comanicu, P. Meer: "Mean shift: A robust approach toward feature space analysis", IEEE Trans. Pattern Anal. Machine In- tell., May 2002". An example of a demodulated image and result of the segmentation is presented in Figure 9. Two connected regions are detected presented in the right image and the differ- ent gray values are used to mark the pixels belonging to each of the regions .

Third. Contours of detected connected segments are extracted. The contours of the segments are then described by a set of features that can be used to distinguish different shapes. In one preferred embodiment, we used Hu moments [1] such as described in M. K. Hu, "Visual Pattern Recognition by Moment Invariants", IRE Trans. Info. Theory, vol. IT-8, pp.179-187, 1962 and Wu, M.-F.; the disclosures of which are incorporated herein by reference.

Fourth. The extracted features are passed to the detector for detecting specific shapes. For example the specific shape of a reflective marker can be detected. In case user hands need to be detected the detector for example for open hand shape, pointing hand shape and closed hand shape can be used.

Fifth. The output of the user detector is also temporally fil- tered. Detections that remain stable for some time are considered true detections. If the user hands and their shape are detected, this information could be used to design some touch- free user interface. If a special marker is detected corresponding to 3D viewing glasses, this can be used to automati- cally select 2D or 3D viewing mode of the display.

The mentioned shape detector is constructed using a statistical pattern recognition classifier function. The classifier function is automatically constructed using a statistical pattern recognition technique AdaBoost. The statistical pattern recognition technique constructs the function automatically from two large sets of images: one dataset containing the shape examples that need to be detected (for example open hand) and the other one containing random shapes .

Furthermore, in addition to the camera, other sensors can be also used to improve detection and tracking.

3D glasses detection:

Detecting 3D viewing glasses for automatic switching between 2D and 3D mode of the display is another application. The detection can be made robust to light conditions if the glasses are made of highly light reflective material. Example in Figure 4 shows that the highly reflective objects are clearly visible in the demodulated images and possibly easy to detect.

Graphics for the light modulation:

Instead of using the display backlight, the light modulation can be introduced also by presented graphics directly. Cur- rently many displays allow fast updates of the graphics (for example 240Hz screens) which allows also including the modulation directly into the presented graphics. Advantage of such system is that there is no need then for the special control of the backlight.

Another advantage is that by modulating graphics only at the part of the screen, it is possible to generate "localized light modulation" for the part of the screen, even if the display backlight does not support this directly. Modulating the light from just a part of the screen can have different purposes. For example if only the "menu object" generates modulated light it will illuminate human hand or a finger only when it gets close to the objects and in this way make detection more robust.

3D depth reconstruction using the localized display illumina- tion:

Photometric stereo, such as is described in E. Prados, 0. Fau- geras, Shape From Shading: a well-posed problem?, IEEE conference CVPR, 2005 and C. Hernandez, G. Vogiatzis, G. J. Brostow, B. Stenger, R. Cipolla, Non-rigid Photometric Stereo with Colored Lights, 2007, the disclosures of which are incorporated herein by reference, [3] [4] is a technique where 3D structure is reconstructed by analyzing how known light is reflected form objects and analyzing the introduced shadows. For the 3D recon- struction usually multiple images are needed of an object which is illuminated by known light sources from different directions. The 3D reconstruction using similar processing principles can be realized using the localized display light illumination (either by the display backlight or graphics as de- scribed above) to generate the set of images of an object illuminated by known light sources from different directions. Scanning backlight is an example case where display light is emitted from different parts of the screen at different times. An example implementation of such system performs the following in one preferred embodiment:

Generate modulated light from some small area of one corner of the display and demodulate. Repeat this for all four corners. Alternatively this could be done simultaneously if different modulation/demodulation is used for each corner of the display (different frequencies or phases or different light color) .

The results are four images of the object illuminated from four different known light sources.

Apply some processing technique such as [3], [4] to estimate the 3D structure of the object illuminated by the display. Alternate implementations may also be included within the scope of the disclosure. In these alternate implementations, functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved. The foregoing de- scription has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. Obvious modifications or variations are possible in light of the above teachings. The implementations discussed, however, were chosen and described to illustrate the principles of the disclosure and its practical application to thereby enable one of ordinary skill in the art to utilize the disclosure in various implementations and with various modifications as are suited to the particular use contemplated. All such modifications and varia- tion are within the scope of the disclosure as determined by the appended claims when interpreted in accordance with the breadth to which they are fairly and legally entitled.