Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEM AND METHOD FOR HUMAN COMPUTER INTERACTION
Document Type and Number:
WIPO Patent Application WO/2017/157433
Kind Code:
A1
Abstract:
A method for detecting a change in the position of an entity relative to its surroundings, wherein the method uses at least two smart devices, each smart device having one camera, the cameras of the smart devices being set up to obtain a respective picture frame of an entity present in the field of view of the cameras and the smart devices being connected to each other via a wireless network and/or a direct communication link; and the method comprises the steps of capturing picture frames of the entity with each camera of the at least two smart devices, and processing the captured picture frames to obtain information about the change in the position of the entity.

Inventors:
HUI PAN (CN)
TZANETOS IOANNA (CN)
PEYLO CHRISTOPH (DE)
Application Number:
PCT/EP2016/055707
Publication Date:
September 21, 2017
Filing Date:
March 16, 2016
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
DEUTSCHE TELEKOM AG (DE)
International Classes:
G06F3/01; G06F3/03; G06K9/00
Foreign References:
US20150241987A12015-08-27
KR20130070034A2013-06-27
Attorney, Agent or Firm:
VOSSIUS & PARTNER (DE)
Download PDF:
Claims:
Claims

A method for detecting a change in the position of an entity relative to its surroundings, wherein the method uses at least two smart devices, each smart device having one camera, the cameras of the smart devices being set up to obtain a respective picture frame of an entity present in the field of view of the cameras and the smart devices being connected to each other via a wireless network and/or a direct communication link; and the method comprises the steps of

a) capturing picture frames of the entity with each camera of the at least two smart devices, and

b) processing the captured picture frames to obtain information about the change in the position of the entity

A method according to claim 1 , wherein the method further comprises a step of setting up the cameras of the smart devices such that the respective picture frames are in a predetermined spatial relationship to each other,

which preferably comprises placing the smart devices relative to a surface preferably in a close position to each other, more preferably a parallel position to each-other, and which more preferably comprises placing the smart devices facing downwards with the respective camera facing upwards on a stable surface in a parallel position to each other.

A method according to claim 1 or 2, wherein the method further comprises the step of setting up the cameras of the smart devices to capture the picture frames, wherein preferably the cameras have at least one of the following characteristics: a same resolution, a same rate and a same shutter speed.

The method according to any one of claims 1 to 3, wherein the method further comprises a step of calibrating each of the cameras of the smart devices to establish a predetermined spatial relationship between the three dimensional world and the two dimensional image plane; and/or training a motion detection algorithm and/or motion tracking algorithm. 5. The method according to any one of claims 1 to 4, wherein the two smart devices are connected to a network and the network comprises at least one further computing system.

6. The method according to any one of claims 1 to 5, wherein computational effort is distributed within the network, and data is transmitted within the network to complete the computational effort, wherein image capturing and computationally inexpensive steps, preferably basic image processing steps, are performed in at least one of the at least two smart devices, and computationally expensive steps, preferably processing the captured picture frames to obtain information about the entity, are performed in the further computing system; and data transmission is performed between the at least two smart devices directly and/or the network.

7. The method according to any one of claims 1 to 6, wherein the step of processing the captured image frames comprises at least one of the following further steps, a step of motion detection, a step of motion tracking, and/or a step of gesture recognition, preferably by using the Hidden Markov Model.

8. The method according to claim 7, wherein the entity is at least one hand region

preferably at least one finger and wherein step of motion tracking comprises a step of tracking at least one hand region; a step a determining the 3D orientation of at least one hand; a step of determining the motion of at least one finger of at least one hand; and/or wherein the step of gesture recognition comprises a step of recognizing a gesture based on the tracked motion of at least one finger.

9. The method according to any one of claim 1 to 8, wherein the computational effort is implemented as a framework which is located at an abstraction layer close to the operating system kernel layer of the performing computing system.

10. The method according to any one of claim 1 to 9, wherein at least one of the following gestures is recognized: a swipe, a circle, a tap; and/or a hand orientation is recognized, preferably a roll, a pitch and a yaw. 11. The method according to any one of claim 1 to 10, wherein when no motion is detected the two smart devices are operated in an energy consuming idling mode until motion is detected again.

12. A system for detecting a change in the position of an entity relative to its surroundings, comprising

at least two smart devices, each smart device having one camera, the cameras of the smart devices being set up to obtain a respective picture frame of an entity present in the field of view of the cameras and the smart devices are configured to be connected to each other via a wireless network and/or a direct communication link; and the system is configured to process the captured picture frames to obtain information about the change in the position of the entity.

13. A system according to claim 12, wherein

the cameras of the smart devices are set up such that the respective picture frames are in a predetermined spatial relationship to each other; preferably

the smart devices are configured to be placed relative to a surface preferably in a close position to each other, more preferably a parallel position to each-other; and more preferably

the smart devices are configured to be placed facing downwards with the respective camera facing upwards on a stable surface in a parallel position to each other.

14. A system according to claim 11 or 12, wherein the cameras of the smart devices are further set up to capture the picture frames, wherein preferably the cameras have at least one of the following characteristics: a same resolution, a same rate and a same shutter speed.

15. The system according to any one of claims 1 1 to 13, wherein each of the cameras of the smart devices are configured to be calibrated to establish a predetermined spatial relationship between the three dimensional world and the two dimensional image plane; and/or to train a motion detection algorithm and/or to perform a motion tracking algorithm. 16. The system according to any one of claims 11 to 14, wherein the two smart devices are configured to be connected to a network and the network comprises at least one further computing system.

17. The system according to any one of claims 1 1 to 16, wherein computational effort is distributed within the network, and data is transmitted within the network to complete the computational effort, wherein

at least one of the at least two smart devices is configured to perform image capturing and computationally inexpensive steps, preferably basic image processing steps, and

the further computing system is configured to perform computationally expensive steps, preferably by processing the captured picture frames to obtain information about the entity; and the at least two smart devices and the further computing system are configured to perform data transmission between the at least two smart devices directly and/or between the at least two smart devices and the further computing system.

18. The system according to any one of claims 1 1 to 17, wherein the performing computing system is configured to carry out the computational effort implemented as a framework which is located at an abstraction layer close to the operating system kernel layer.

19. The system according to any one of claims 11 to 18, wherein the smart devices and/or the further computing device is configured to perform the method according to any one of claims 1 to 11.

Description:
System and Method for Human Computer Interaction

The present invention relates to a system and method for Human Computer Interaction. More specifically, the invention relates to a system and method for detecting a change in the position of an entity relative to its surroundings.

Human Computer Interaction (HCI) is a field that approaches the development of computer systems, having an end user as the center of every phase of this development. In particular, HCI focuses on the understanding of the user behavior which acts as the base of the system design and implementation. The general understanding is that HCI is a discipline that primarily aims to create easy to use interfaces. However, it is noted that there is a more purposeful reason that it is studied, which is the observance of the effects that users eventually have by the interaction with the system. While computers become more and pervasive, designers seek for new ways to make interfacing with devices more efficient, safer and easier, having as a goal to build machines that fit the human environment, avoiding in this way the need to force humans to learn how to enter the environment of the machines.

Aiming to combine the digital with the natural world flawlessly, the field of pervasive or ubiquitous computing has arisen. More specifically, as technological advances dictate a hardware shrinkage, it is thus easier for devices of daily use, to blend in with the natural world, as their small size and empowered connectivity enables them to communicate discretely. On the basis that electronic circuits can be embedded on every device, either digital or natural, pervasive computing plays a significant role in interconnecting the wanted device to a massive network of other devices.

The majority of daily human computer interaction is limited to a set of two dimensional (2D) control movements. Being able to navigate only by moving the pointer of a mouse backward and forward, left and right or in the now established touch screen world that allows pinch to zoom in and out and tapping to select, the user experience is far from the pervasive model described above. However, a three dimensional (3D) interface is ideal when the user needs an intuitive gesture model, that makes it easy even for a young child to navigate with its hands across all axes of the space.

One exemplary embodiment of a 3D sensor has been developed by Microsoft. This sensor comprises of an orthogonal bar and a base that enables it to tilt upwards and downwards and can be positioned parallel to the video display. More specifically, the precise face and voice recognition, along with the full human skeleton tracking are achieved by using an RGB camera together with a depth sensor and a multi-array microphone; Kinect TM is running proprietary software. One of the advantages of Kinect TM is capturing under any ambient light. This capability is ensured by the depth sensor's design which consists of an infrared laser projector and a monochrome CMOS sensor. Also, according to the user's position, Kinect TM can detect the user and adjust not only the range of the depth sensor but also calibrate the camera.

Another exemplary embodiment of a 3D sensor is the Leap Motion TM Controller It is able to track movements down to a hundredth of a millimeter and able to track all 10 fingers simultaneously when pointing, i.e., gestures that involve more gross movements are avoided. The Leap Motion TM Controller is able to track objects which belong to a hemispherical area. The Leap Motion TM Controller consists of two monochromatic infrared cameras and three infrared LEDs and its design enables it to face upwards when plugged in a USB port. The LEDs generate pattern-less IR light and the cameras generate almost 300 frames per second of reflected data, which is then sent through a USB cable to the host computer. The host computer runs the Leap Motion TM Controller software that analyzes the data by synthesizing the 3D position data and by comparing the 2D frames generated by the two cameras the position of the object can be extracted. The following movement patterns are recognized by the Leap Motion TM Controller software: a circle with a single finger, a swipe with a single finger (as if tracing a line), a key tap by a finger (as if tapping a keyboard key), and a screen tap, i.e., a tapping movement by the finger as if tapping a vertical computer screen.

The devices mentioned above provide a limited selection of useful applications to interact with and also as a major disadvantage they lack portability and they are costly. It is an object of the invention to provide a system and method for Human Computer

Interaction. The object is achieved with the subject-matter of the independent claims. The dependent claims relate to further aspects of the invention. While the majority of users choose a keyboard, a mouse or a touch screen to interact with a computer it might be perceived that there is no need for gesture recognition to emerge in the market and making gestures to interact with a mobile phone might seem to be a mismatch, as a mobile phone is designed to be kept close at hand. However, there is still a need for gesture recognition technology for mobile devices, especially if the challenging requirements, like the effectiveness of the technology in adverse light conditions, the variations in the background, and the high power consumption are met.

In one aspect the invention relates to a method for detecting a change in the position of an entity relative to its surroundings, wherein the method uses at least two smart devices, each smart device having one camera. The cameras of the smart devices are set up to obtain a respective picture frame of an entity present in the field of view of the cameras and the smart devices being connected to each other via a wireless network and/or a direct communication link. The method comprises the steps of capturing picture frames of the entity with each camera of the at least two smart devices, and processing the captured picture frames to obtain information about the change in the position of the entity.

In another aspect of the invention the method further comprises a step of setting up the cameras of the smart devices such that the respective picture frames are in a predetermined spatial relationship to each other. This aspect preferably comprises placing the smart devices relative to a surface, preferably in a close position to each other, more preferably a parallel position to each-other. More preferably the method comprises placing the smart devices facing downwards with the respective camera facing upwards on a stable surface in a parallel position to each other.

In another aspect of the invention the method further comprises the step of setting up the cameras of the smart devices to capture the picture frames, wherein preferably cameras have at least one of the following characteristics: a same resolution, a same rate and a same shutter speed.

In one embodiment of the invention the computational effort is further reduced if the captured picture frames are captured as binocular vision picture frames, i.e., the frames having similar characteristics. Preferably both smart devices have identical cameras and/or the cameras are set to a same resolution, a same rate and/or a same shutter speed. In another aspect of the invention the method further comprises a step of calibrating each of the cameras of the smart devices to establish a predetermined spatial relationship between the three dimensional world and the two dimensional image plane; and/or training a motion detection algorithm and/or motion tracking algorithm. In another aspect of the invention at least one of the two smart devices is connected to a network and the network comprises at least one further computing system.

In another aspect of the invention the computational effort is distributed within the network, and data is transmitted within the network to complete the computational effort. The image capturing and computationally inexpensive steps, preferably basic image processing steps, are performed in at least one of the at least two smart devices. The computationally expensive steps, preferably processing the captured picture frames to obtain information about the entity, are performed in the further computing system. Data transmission is performed between the at least two smart devices directly and/or the network.

In another aspect of the invention the step of processing the captured image frames comprises at least one of the following further steps, a step of motion detection, a step of motion tracking, and/or a step of gesture recognition, preferably by using the Hidden Markov Model.

In another aspect of the invention the entity is at least one hand region preferably at least one finger and wherein step of motion tracking comprises a step of tracking at least one hand region. Further comprises a step of determining the 3D orientation of at least one hand, a step of determining the motion of at least one finger of at least one hand, and/or wherein the step of gesture recognition comprises a step of recognizing a gesture based on the tracked motion of the at least one finger.

In another aspect of the invention the computational effort is implemented as a framework which is located at an abstraction layer close to the operating system kernel layer of the performing computing system. h another aspect of the invention at least one of the following gestures is recognized: a swipe, a circle, a tap; and/or a hand orientation is recognized, preferably a roll, a pitch and a yaw.

In another aspect of the invention when no motion is detected, the two smart devices are operated in an energy consuming idling mode until motion is detected again. In one aspect of the Invention a system for detecting a change in the position of an entity relative to its surroundings is provided. The system comprises at least two smart devices, each smart device having one camera, the cameras of the smart devices being set up to obtain a respective picture frame of an entity present in the field of view of the cameras and the smart devices are configured to be connected to each other via a wireless network and/or a direct communication link; and the system is configured to process the captured picture frames to obtain information about the change in the position of the entity.

In another aspect of the invention the cameras of the smart devices are set up such that the respective picture frames are in a predetermined spatial relationship to each other; preferably the smart devices are configured to be placed relative to a surface preferably in a close position to each other, more preferably a parallel position to each-other; and more preferably the smart devices are configured to be placed facing downwards with the respective camera facing upwards on a stable surface in a parallel position to each other.

In another aspect of the invention the cameras of the smart devices are further set up to capture the picture frames, preferably binocular vision picture frames, wherein preferably the cameras have at least one of the following characteristics: a same resolution, a same rate and a same shutter speed.

In another aspect of the invention the cameras of the smart devices are configured to be calibrated to establish a predetermined spatial relationship between the three dimensional world and the two dimensional image plane; and/or to train a motion detection algorithm and/or to perform a motion tracking algorithm.

In another aspect of the invention at least one of the two smart devices is configured to be connected to a network and the network comprises at least one further computing system.

In another aspect of the invention the computational effort is distributed within the network, and data is transmitted within the network to complete the computational effort, wherein at least one of the at least two smart devices is configured to perform image capturing and computationally inexpensive steps, preferably basic image processing steps, and the further computing system is configured to perform computationally expensive steps, preferably processing the captured picture frames to obtain information about the entity; and the at least two smart devices and the further computing system are configured to perform data transmission between the at least two smart devices directly and/or between the at least two smart devices and the further computing system.

In another aspect of the invention the performing computing system is configured to carry out the computational effort implemented as a framework which is located at an abstraction layer close to the operating system kernel layer.

In another aspect of the invention the smart devices and/or the further computing device is configured to perform the method according to any one of the preceding aspects of the invention.

The invention lets its users experience the ability to control a computer or interact with augmented reality applications in a three-dimensional space by making touch-free gestures. The aforementioned method may be implemented as software provided by a server which can be either a computer or the cloud.

A hand gesture recognition system is a key element used for HCI, as using hand gestures provides an attractive alternative to the cumbersome interface devices for HCI. With the advent of a mobile society, portable devices such as cellular phones have become ubiquitous.. The present invention provides an inexpensive solution of a gesture control mechanism that is also portable and wireless.

The term smart device refers to an electronic device that has preferably one or more of the following properties: enabled to connect to at least one network with at least one network connecting means, wireless, mobile, and has at least one camera. The network connecting means, can be one of the following, a NFC connecting means, a wireless LAN connecting means, a LAN connecting means, a mobile data connecting means, and a Bluetooth connecting means. The smart devices can be one of the following: A mobile phone, a smart watch, a smart camera, and a tablet device. None of the above presented lists for the network connecting means, the sensors and the smart devices is exclusive and limiting the scope of the invention. Furthermore it is noted that the terms mobile device and smart device are interchangeably used herein.

The term entity, encloses both a rigid entity, i.e., an object that has a more or less fixed shape and may or may not change its orientation and/or position in a 3D space, and a flexible entity, i.e., an object that may or may not change its shape and may or may not change its orientation and/or position in a 3D space. The present invention has the advantage that depth sensing can be performed without using lasers or infrared sensors. In this way, the invention is based on the idea of binocular vision, which allows a pair of eyes to be used together in harmony. While most organisms have their eyes located in a way that enables them to stare in front of them, it can be understood that the overlapping of the captured images creates a different view of the same entity. In this way, creatures with binocular vision are able of depth-sensing.

For detecting a change in the position of an entity relative to its surroundings the two picture frames preferably contain roughly the same visual information. This is preferably achieved by positioning the two cameras close to each other and by pointing them into a similar direction. When the 3 dimensional relative positions of the two cameras and their respective line of sight is known, the 3D position data for entities within the acquired picture frames can be calculated. In other words, if the two cameras are set up in a predetermined spatial relationship the picture frames can be used to calculate 3D position data for entities in the acquired picture frames. A parallel positioning or a quasi-parallel positioning is preferred since the computational effort is reduced compared to an arbitrary orientation of the cameras. A stable surface may help to arrange the smart devices. In addition or alternatively a reference surface may help to arrange the smart devices. In addition or alternatively a software using data from at least one sensor of the smart device may help to arrange the smart devices. When the cameras are arranged side by side the stereo and vision principle the following relation is given. For every inch the cameras are apart the subject of the 3D position to be calculated for has to be 30 inches away. Following this relation the viewer is able to fuse the images together. This is typically referred to as the" 1/30 Rule".

In one embodiment of the invention the framework is preferably located on a layer directly above the operating system's kernel and below the application framework layer. In one embodiment of the invention the framework for the respective computational effort will preferably be automated to ensure protection from any kind of attempt that is threatening the user's privacy. More specifically, the system may be protected in order to avoid malicious actions intending to invade the user's privacy and manipulate personal data. However, exact location of this mechanism depends on the architecture of the respective operating system. In order to protect the framework against manipulation, it is preferred that the method is implemented as a program which is located at a layer near the operating system layer or the kernel, which is not accessible by malicious applications programmed by application developers. Preferably, the mentioned layer should be interpreted as an abstraction layer.

The above requirement is considered in order the framework to be protected by hacking attacks, thus the preferred location for the framework to be implemented is below the layers to which developers don't have access to. The importance of the requirement can be assessed if we consider the ease with which developers can write code to access services (camera, speaker, call log) for their applications.

In particular, almost all of the advanced operating systems of devices can be divided in abstraction layers. These layers typically separate different functional units of the operating system. Although a fine grained detail of these abstraction layers may differ between operating systems, on higher level these operating systems are preferably divided typically into the kernel/hardware layer, the library layer, the service layer and the application layer.

In most embodiments of the invention the available amount of energy is limited, in particular if the smart device runs on a battery. In order to reduce the energy consumption, the method preferably works in an idle state if no hand is detected in the tracking space and no state signal is generated, such that the tracking step is not performed. In the context of the present invention, the term "mechanism" or "framework" can relate to a set of methods and a system.

In one embodiment a tracking module, a human detection module, a gesture module, and a control module are preferably located on a layer directly above the operating system's kernel and below the application framework layer.

Other aspects and advantages of the invention will become apparent from the following detailed description taken in conjunction with the accompanying drawings, illustrating by way the principles of the invention.

Brief Description of the Drawings The accompanying drawings, which are included to provide a further understanding of the present invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the present invention, and together with the description serve to explain the principle of the present invention. In the drawings:

Figure 1 schematically shows the real-time gesture recognition system comprising two smartphones, Figure 2 schematically shows the triangulation method that is used for depth sensing,

Figure 3 schematically shows a flowchart of the method for hand position and orientation acquisition, and

Figure 4 schematically shows a flowchart of the gesture recognition method. Detailed Description of the Invention

According to one aspect of the present invention which focuses on the camera feature of each device, an in-device camera framework for a smart phone is proposed, while in another aspect of the invention another framework is designed to detect the hands of the user's body extracted by the images taken by the on-device camera. Moreover, based on the results from detection, the use of a robust tracking algorithm is preferable, as any error in tracking will prevent the flawless interaction with the computer. After these steps, the framework provides the results of detection and tracking as an input to the desired software application

For ease of explanation the embodiments are explained in a way that the system resources are available on local desktop computers with fast processors and adequate memory. However, in addition or alternatively the present invention may at least partially be implemented in a cloud and/or may be also capable of working with the cloud.

In the following the invention is described in an embodiment using image and depth data captured by two identical mobile device cameras to control an interactive session on a computer. Broadly speaking, the session can be any type of software that takes input from a user. The camera which is embedded on the mobile device, can provide data to the computer for analysis and processing. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it is apparent to one skilled in the art that the present invention may be practiced without some or all of these specific details. In other instances, well known process steps may not be described in detail in order not to unnecessarily obscure the present invention.

Figure 1 schematically shows a real- time gesture recognition system comprising a space where a pair of identical mobile devices 10a, 10b is used in accordance with one embodiment of the present invention. The space preferably comprises a flat surface 20 where the pair of the mobile devices are set side by side and facing downwards. The system comprises a computer 30 that serves as further computing system and may also be the unit that is being controlled by the mobile devices 10a, 10b. Further shown are the left hand 40a and the right hand 40b of a user. In this embodiment of the invention two cameras 11a, 1 lb of the same model are used. However, any two cameras may alternatively be used. Preferably the two cameras can record at the same resolution, same rate and shutter speed. The mobile devices according to one embodiment of the invention, more specifically the cameras of the mobile devices, form an array of smart device cameras for hand gesture recognition. Despite the shown embodiment shown in Fig 1 , any number of devices can alternatively be used to carry out the invention. In other words other embodiments comprise 3, 4, 5, 6, 7, 8, or any other number of mobile devices. Furthermore the number of cameras existing and/or used per device is also not limited to one. In other words each mobile device may also comprise one or more cameras, which may or may not be used to carry out the invention. For example, a device comprising a front and a rear camera may be used in a way that only the rear or the front camera is used to carry out the invention.

Figure 2 shows the deployment of the cameras based on the theory of binocular vision.

Binocular vision can be scientifically explained with geometry and more specifically triangulation. Triangulation is any kind of distance calculation based on given lengths and angles, using trigonometric relations. Triangulation with two parallel cameras, or stereo triangulation may be performed based on parameters like, for example, distance between the cameras, focal length of the cameras, spatial angles of the line of sight from the imaged entity to each camera and/or other suitable parameters known to the skilled person.

More specifically, as shown in Fig 2, two cameras 10a, 10b preferably with the same focal length are placed preferably parallel or quasi parallel to each other. X is the X-axis of both cameras, Z is the optical axis of both cameras, k is the focal length, d is the distance between the two cameras 10a, 10b, O is the entity captured, R is the projection of the real-world entity O in an image acquired by the right camera 10b and L is the projection of the real-world point O in an image acquired by the left camera 10a. As both cameras 10a, 10b are separated by d, both cameras view the same entity O in a different location on the two-dimensional captured images. The distance between the two projected points R and L is called disparity and is used to calculate depth information, which is the distance between real-world entity O and the stereo vision system.

Depth imaging acquired by triangulation may not be ideal when real-time and continuous imaging is required. In this case a different method is employed to generate the depth data. At this point, it is noted that according to one aspect of the invention, the cameras are calibrated before the beginning of the method of Human Computer Interaction. The purpose of calibration is to establish a relation between the three dimensional world (that is corresponding to a real- world coordinate system) and the two dimensional plane (that is corresponding to the image plane of the respective camera).

Figure 3 shows the steps performed in one embodiment of the invention. First, a hand region may be tracked. An image region that correlates with a hand is extracted in each of the captured frames. As the gesture recognition might take place in a cluttered environment, it is assumed that the background is complex and dynamically changing, while illumination may be a stable factor, at least when the invention is carried out in an indoor environment. In order to track the hands of the user, the hands have to be separated from the background. Since the functions of tracking, recognition, and 3D processing require advance algorithms, sufficient computational power is needed. Thus the implementation of computational offloading is a major advantage. Preferably at least part of the method is executed on a computing system configured accordingly, preferably a further computing system, e.g., a laptop or a desktop computer (c.f. 30 in Fig 1).

In this embodiment of the invention, in order for the background image to be extracted and the human skin to be recognized as candidate region, the images taken by both cameras 1 1 a, 1 lb can be captured in step SI as YUV color images that are then converted in step S2 to HSV images. After this, a filter, preferably a median filter is applied in step S3 for minimizing effects of image noise. The human skin is differentiated by its background where saturation values are high and hue values are close to these of the skin. In step S4 a detection of the hand is performed.

In this embodiment of the invention, by capturing images and using in step S5 a triangulation method, the 3D position of the hand is obtained in step S6. Next, the 3D orientation of the hand is determined. The method of triangulation can further identify the roll of the hand, the pitch and yaw angles of the fingers. To estimate these angles, preferably three more parameters are determined, which are preferably the tip of a hand region, and both right and left points of the region and also the center of gravity of a hand region. By having the center of gravity and the tip point, the direction of the hand in 3D space reveals the roll and pitch of the hand. In a similar way the left and right points determine whether a yaw takes place .Figure 4 shows the gesture recognition method according to one embodiment of the invention. As the method is focusing on real-time dynamic hand gesture recognition, the Hidden Markov Model is used to identify those gestures. The possible moving patterns are the following four the circle, the swipe, the key tap and the tap which can be visualized as a single finger tracing a circle, a long linear movement of a finger, a tapping movement by a finger as if tapping a keyboard key, and a tapping movement by the finger as if tapping a vertical computer screen. While for static gestures, neural networks are preferably used for identification, for dynamic gestures, the Hidden Markov Model is preferred.

The Hidden Markov Model (HMM) is a doubly stochastic process with an underlying process of transitions between hidden states of the system and a process of emitting observable outputs. The method described in this embodiment of the invention may have to undergo an initialization process, where the HMM is trained and tested. Preferably for the training and recognition tasks the Baum- Welch and Viterbi algorithms are used as they are known for their well-known computational savings. The initialization of the HMM parameters may be done by setting the initial probability for the first state of the initial state to 1 and the transition probability distribution for each state is uniformly distributed. On starting the hand- gesture recognition method, the preprocessed images are given as input to the HMM as sequences of quantized vectors. A recognizer based on HMM interprets those sequences, which are directional codewords who characterize the trajectory of the motion. In order a particular sequence to be correlated with a gesture, its likelihood should exceed that of other models. As the method is used for the tracking of dynamic gestures, setting a static threshold value is impractical, since recognition likelihoods of gestures varying in size and complexity vary significantly. C. Keskin, O. Aran and L. Akarun "Real Time Gestural Interface for Generic Applications", Computer Engineering Dept. of Bogazici University have constructed an adaptive threshold model by connecting the states of all models. If the likelihood of a model that is calculated in each sequence exceeds that of the threshold model for one of these sequences, then the result is the recognition of the gesture. Otherwise classification is rejected. It should be noted that the training step preferably takes place only once and before the initialization of the process, because it needs substantial time to be executed and also needs the vast resources of the computing system. For this, a data set that consists of images may be given as input.

In one embodiment of the invention to enable the interactivity with the further computing system the mobile devices are interconnected in order to function. The mobile devices and further computing system are connected to the same network and/or directly to each other. Once the respective mobile device cameras begin capturing frames, the frames are sent via the network and/or direct connection to the further computing system where the image processing, tracking and gesture recognition take place. More specifically, the software that resides in the computer and is responsible for the above functionality preferably stores the IP addresses of the mobile devices temporarily, that may be resolved by the software when and while the user is interacting with an application and/or software. Once the connection is established, the data transfer begins and the mobile devices send the frames to the further computing system for further processing. The output of the software is then given as input to the favorable application or software that the user is interacting with. While the present invention has been described in connection with certain preferred embodiments, it is to be understood that the subject-matter encompassed by the present invention is not limited to those specific embodiments. On the contrary, it is intended to include any alternatives and modifications within the scope of the appended claims.