Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
TARGET LOCALIZATION AND SIZE ESTIMATION VIA MULTIPLE MODEL LEARNING IN VISUAL TRACKING
Document Type and Number:
WIPO Patent Application WO/2015/163830
Kind Code:
A1
Abstract:
Visual target tracking has many challenges such as robustness to occlusion, noise, drifts, stabilization etc. Although various algorithms are proposed as a remedy for these problems, the solutions should be narrowed to algorithms with low computational cost when real time systems are in consideration. In this manner, the family of tracking methods based on correlation filters is a prominent option since many of the algorithms in this family are efficient and simple to implement. In order to achieve efficient and robust tracking system, the present invention relates a correlation based tracker with a target localization and size estimation method with a feedback mechanism. In this sense, target model is dynamically learnt and extracted in the tracker window encapsulating the actual target and this model is used both for target localization and size estimation in addition to track window correction which introduces robustness to improper initializations. Moreover, a multiple model visual tracking methodology is also presented in order to adapt to changes in target model with different rates that are either caused by changes in the target or its surroundings. The overall system can be used as a real-time visual tracking system with adaptive learning mechanism and provides minimum sized target bounding box as output. Furthermore, the method presented in this invention is capable of target model extraction which can be considered as a preprocessing step of a shape based object classification algorithm.

Inventors:
GUNDOGDU ERHAN (TR)
TUNALI EMRE (TR)
TANISIK GÖKHAN (TR)
OZ SINAN (TR)
Application Number:
PCT/TR2014/000117
Publication Date:
October 29, 2015
Filing Date:
April 22, 2014
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ASELSAN ELEKTRONIK SANAYI VE TICARET ANONIM SIRKETI (TR)
International Classes:
G06T7/20
Foreign References:
US20120288152A12012-11-15
US8520956B22013-08-27
EP2202671A22010-06-30
US8520956B22013-08-27
US20080304740A12008-12-11
US20130101210A12013-04-25
US8477998B12013-07-02
Other References:
BAKER S ET AL: "The Template Update Problem", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, IEEE COMPUTER SOCIETY, USA, vol. 26, no. 6, 1 June 2004 (2004-06-01), pages 810 - 815, XP011111532, ISSN: 0162-8828, DOI: 10.1109/TPAMI.2004.77
NOBUYOKI OTSU: "A threshold selection method from gray-level histograms", IEEE TRANS. SYS. MAN., CYBER., vol. 9, no. 1, 1979, pages 62 - 66
Y. WEI; F. WEN; W. ZHU; J.SUN: "Geodesic Saliency Using Background Priors", IEEE, ICCV, 2012
Y. WEI: "Geodesic saliency using background priors", ECCV, 2012
Attorney, Agent or Firm:
ANKARA PATENT BUREAU LIMITED (Kavaklıdere, Ankara, TR)
Download PDF:
Claims:
CLAIMS

1. The real-time visual tracking method for target localization and size estimation, based on correlation filter with multiple model, the method comprises the following steps:

- Given track initialization (101),

- Multiple model visual target tracking for the purpose of responding to abrupt and indiscernible changes of the target appearance (102),

- Target bounding box generation and feedback decision using the updated saliency map (103).

2. The real-time visual tracking method according to claim 1, wherein the multiple model visual tracking step further comprises the sub-steps of:

- A necessary and appropriate preprocessing procedure for incoming frame (201),

- Correlation matching using the first filter group of the multiple model visual tracking (202),

- Querying the quality of the response of the first filter group to the target in the current frame (203),

- Updating the target location according to the first filter group (204) if the querying result of the response of the first filter group is greater than a selected threshold (203),

- Updating the first filter group with a low learning rate and updating the second filter group with a high learning rate (207),

3. The real-time visual tracking method according to claim 1, wherein the multiple model visual tracking step further comprises the sub-steps of:

- Correlation matching using the second filter group of the multiple model visual tracking (205) , if the result of the querying the quality of the first filter group is less than the selected threshold, - Querying the quality of the response of the second filter group to the target in the current frame (206),

- Updating the target location according to the second filter group if the result of the querying the quality of the first filter group is greater than a predefined threshold (208),

- Updating the first filter group with a high learning rate (210).

- Updating the second filter group with a high learning rate (211),

4. The real-time visual tracking method according to claim 1, wherein the multiple model visual tracking step further comprising:

detecting occlusion (209), if the result of querying the second filter group is less than the predefined threshold (206).

5. The multiple model visual tracking according to claim 2 and claim 3, the low learning rate is used for learning the target more slowly than learning the instantaneous target with the high learning rate.

6. The real-time visual tracking method according to claim 1, wherein the target bounding box generation and feedback decision step further comprises the sub-steps of:

- Calculating current saliency map (301),

- Calculating saliency ratio for the current saliency map (303),

- Calculating the correlation score using normalized cross correlation between the target model selected (304) from the updated saliency map using target selection procedure (307) and current saliency map,

- Calculating the learning rate using the saliency ratio and the correlation score (305),

- Updating the saliency map according to the calculated learning rate (306), to obtain the updated saliency map as input , to the system for the next frame (302), - Selecting target according to the updated saliency map (307),

- Querying tracking feedback (308).

7. The real-time visual tracking method according to claim 6, wherein the target bounding box generation and feedback decision step further comprising:

locating the object in the region of interest by generating a minimum size bounding box including the target instead of the region of interest given by the real time visual tracking method.

8. The real-time visual tracking method according to claim 6, wherein the target bounding box generation and feedback decision step further comprising:

compensating for false target initialization via its learning mechanism where the targets, which are not well-localized in initialization, are centralized via the feedback mechanism.

9. The real-time visual tracking method according to claim 6, wherein the target bounding box generation and feedback decision step further comprising:

detecting the scale changes of the target through the video frames, adapting to the scale changes and updating the visual target model more appropriately than the updating the tracking model of a visual tracking method without the ability of detecting the scale changes.

10. The real-time visual tracking method according to claim 6, wherein the target bounding box generation and feedback decision step further comprising:

using an adaptive learning rate selection algorithm, which prevents mislearning of the target model in the cases of clutter or occlusion.

11. The real-time visual tracking method according to claim 10, even though redetection of the target is the merit of any tracking system, the method further comprising:

constructing a target model which is appropriate to be used in order to boost redetection of target after the target is lost due to occlusion, clutter or noise.

12. The method of binarization in claim 6 is the modified version of the equation , which

results in L, number of histogram bins, less multiplication.

13. The real-time visual tracking method according to claim 1, wherein the multiple model visual target tracking further comprising:

adapting to abrupt changes of the region of interest which is the benefit of multiple modeling.

14. The real-time visual tracking method according to claim 1, wherein the multiple model visual tracking step further comprising:

sensing both the rapid and imperceptible changes of the target at the same time without sacrificing either low or high learning rates.

15. The real-time visual tracking method according to claim 2, wherein the multiple model visual tracking step further comprising:

interacting the multiple filter groups to tolerate the errors of each other in different conditions by using their corresponding learning rates when one of the filter groups starts to give low quality tracking results.

16. The real-time visual tracking method according to claim 6, wherein the target bounding box generation and feedback decision step further comprising:

constructing a target model, which is appropriate to be used in a shape based classifier.

17. The real-time visual tracking method according to claim 1, wherein the method of multiple model visual target tracking step further comprising: a robust tracking adaptation under different temporal variation rates by changing the update parameters.

Description:
TARGET LOCALIZATION AND SIZE ESTIMATION VIA MULTIPLE MODEL LEARNING IN VISUAL TRACKING

Field of the Invention

The present invention relates to a target localization and size estimation method for visual tracking purposes using an intelligent system including a dynamic and adaptive localization algorithm, which is robust to improper target initializations, as well as a multiple model structure with a model selection algorithm for tracking.

Background of the Invention

Visual target tracking is a well known topic in computer vision and machine learning disciplines. As well as many machine vision problems, visual target tracking has also trade-offs such as computational complexity and robustness to various problems including occlusion, noise, drifts etc. Although diverse set of algorithms exists as solutions of mentioned problems, they may not be appropriate for real-time systems since they tend to be computationally costly. To achieve target tracking in real-time systems with less computational burden, using trackers that are based on correlation filters can be considered as an option.

In correlation filter based trackers; there is plurality of templates used to find the location of an object by searching a predefined area in a video frame. According to a cost function, one can find the location of a target using the predefined space and the prior information provided at the beginning of the tracking.

In most scenarios, trackers based on correlation filters assume fixed object size and limits target search to a predefined window. Moreover, most of them do not use the appropriate computer vision tools to extract the semantic information behind the data taken from the sensors. Furthermore, many algorithms model target objects and surroundings which may differ in time. These model based algorithms may have limits on adaptation to changes in the scene which results in conflicts with the current model parameters, hence the performance of the algorithms are confined in a limited range.

This disclosure provides solutions for previously mentioned problems by using a biologically inspired framework which is capable of target model extraction in awareness of changes in the scene. Actually, the proposed disclosure interprets the target model and decides on its learning rates for both localization and size estimation which yields better track maintenance. In addition, a multiple model visual tracking method is also proposed to extend limitations of adaptation on changes in the target and its background.

United States patent document US8520956B2 discloses an efficient method for visual tracking based on correlation filters. In this approach, the prior information is plurality of images used for learning a correlation filter which is optimum for a defined cost function. They have basically three different options to find the optimum correlation filter as ASEF (Average of Synthetic Exact Filters), MOSSE (Minimizing Output Sum of Squared Error) Filter and a cost function minimizing filter. In all of the methods of this patent document, the filter is assumed to have a fixed size. Although the basic idea is novel and works well in many scenarios, the assumption of the fixed size object does not hold all the time. As the object starts to magnify in the region of interest, the tracker may not compensate for the enlarged object inside the tracker window or vice versa. Moreover, the method presented in US8520956B2 selects target to be nearly the whole image patch in the window. Hence, the target together with its background is matched in consecutive frames. Since, the shape (boundaries) of the target is not extracted in this method; neither background suppression is utilized in consideration of target size, nor any mechanism is included to centralize the tracked target. However; in the cases of erroneous track initialization or mismatching between consecutive frames, the target may shift from the center of track window which is not desired and may be result in premature track loss if it is not corrected. Centralization of the target might be achieved by obtaining the silhouette of the object. For this purpose, the saliency map calculation is exploited in the target window and the target model is learnt by using saliency maps in time. Then, the most salient object in the updated saliency map is defined as the target. Selection of most salient region also means more distinctive target which also increases probability of longer track maintenance.

United States patent document US2008304740 discloses methods for detecting a salient object in an input image are described. A set of local, regional, and global representations are exploited such as including multi-scale contrast, center- surround 20 histogram, and color spatial distribution to find the salient region in the image. Conditional random field learning methodology is used to combine relationships between the features mentioned above. The learned conditional random field helps to locate the salient object in the image. Although image segmentation is proposed to find the salient region in the scene, it may not be appropriate for real time applications.

United States patent document US20130101210 discloses a method for determining a region to crop from a digital image using a saliency map. In their method, the image or the region of interest is processed to obtain the saliency map. The pixels in the image are assigned with saliency values within a number range. Their method further contains analyzing the saliency map to find a potential cropping rectangle. In order to find a rectangle to crop, candidate rectangles are extracted. Every candidate rectangle has a score corresponding to the sum of saliency values inside the rectangle. The rectangle with the highest saliency score is chosen to be cropped. However, instead of finding a potential cropping rectangle, the method presented here exploits the connected component labeling of the binarized saliency maps as well as the saliency values of each connected component to find a dominant connected component to be used in the target boundary calculation. Another important contribution of the proposed disclosure is temporal refinement of saliency map with an adaptive learning rate. More clearly, decision does not depend on a single frame unlike the method in US20130101210; on the contrary saliency map is generated from weighted average of previous frames in which weights depend on adaptively changing object model.

Change of environmental conditions such as clutter in background, noise, deformations of the object is another important problem which should be compensated. In order to solve these problems, adaptive algorithms are proposed such as the method in US8520956B2. There are also methods which uses models to track objects such as the method in US8477998B1. In their method, an input image is taken with a target selection at the first frame. Plurality of images chosen based on the selected target region is processed to construct a generalized target model with different poses. In each frame, the model is used to find a recognition score in many candidate regions. Moreover, the model is adaptively trained using the content of the most likely object region. The adaptive model of the method in US8477998B1 helps to reduce the drifts.

Although adaptive tracking methods can solve drifting problems up to a scale, infinitesimal drifting of targets or sensors (camera) may not be compensated using standard adaptive algorithms. A real world example of such a problem is very small drifting of the camera such as one pixel drift in 100 frames. As a matter of fact, the slightly shifted target is considered as the original target and learnt in adaptive tracking methods. This can cause considerable drifts as the actual target model starts to deform. In order to handle both abrupt and indiscernible appearance changes, multiple models for different adaptation rates can be used interchangeably. Therefore, in this disclosure a multiple model visual tracking system is also presented with a model selection mechanism including two filter groups with different learning parameters. References

[1] Nobuyoki Otsu, "A threshold selection method from gray-level histograms", IEEE Trans. Sys. Man., Cyber. 9, 1979 (1), 62-66.

doi: 10.1109/TMSC.1979.4310076

[2] Y. Wei, F. Wen, W. Zhu and J.Sun, "Geodesic Saliency Using Background Priors", IEEE, ICCV, 2012.

Summary of the Invention

This invention proposes a method of generation of minimum sized bounding box for the target, which means target location and size estimation, in a visual tracking system. Moreover, a multiple model learning methodology is also presented to improve adaptation limits of the system. Since the proposed methodologies require learning of the target model, they can be adapted to any model based tracking algorithm.

In the presented visual tracking framework, track initialization (101) is given by a user or system by inputting an image patch that includes target to be tracked. Since the target bounding box generation requires the condition that the track window should contain the target completely, the drifts, caused by the tracker, should be prevented in order not to dissatisfy this condition. Therefore, a tracker robust to drifts should be preferred to provide the whole target in the track window. In order to prevent drifts, the track is maintained by multiple model visual target tracking procedure is designed to adopt to different rates of temporal variations of the target (102) in which the model updates are achieved by different learning rates to compensate for abrupt and indiscernible changes in the target appearance and appropriate model is interchangeably selected.

The target bounding box generation and feedback decision (103) procedure is utilized by using the updated saliency map. In each frame a saliency map is calculated in the region of interest, which is determined by the tracker, is referred as current saliency map. Since the aim is to use temporal information of the saliency map, another saliency map is defined as updated saliency map and calculated using the current saliency map with an adaptive learning rate. One should note that using a constant learning rate may cause mislearning of target silhouette in cases of full/partial occlusions or with noisy data. Hence, an adaptive learning rate is utilized. The details of the calculation of the learning rate will be explained in the next section. After updating the saliency map with the calculated learning rate, the target selection is performed. At the output of the target selection process, the silhouette of the target, which is used for size estimation, as well as the location of the target is determined. Using the current saliency map and the updated saliency map a feedback is applied to the system. In the consecutive frames the procedure explained above is applied and target bounding box is obtained from updated saliency map for each frame.

Brief Description of Contents

A system and method realized to fulfill the objective of the present invention is illustrated in the accompanying figures, in which:

Figure 1 shows the flowchart of the overall system.

Figure 2 shows the multiple model visual tracking algorithm steps.

Figure 3 shows the target bounding box generation and feedback decision steps.

Figure 4 shows the target bounding box with the track window.

Figure 5 shows the current and updated saliency maps with their binarized images at the beginning of the tracking.

Figure 6 shows the current and updated saliency maps with their binarized images after 380 frames.

Figure 7 shows the current and updated saliency maps with their binarized images after 510 frames. Detailed Description of the Invention

Hereinafter, the present invention will be described in more detail with reference to attached drawings. The following embodiments are presented only for the purpose of describing the present invention, and are not intended to limit the scope of the present invention.

The real-time visual tracking method for target localization and size estimation, based on correlation filter with multiple model, the method comprises the following steps:

- Given track initialization (101),

- Multiple model visual target tracking for the purpose of responding to abrupt and indiscernible changes of the target appearance (102),

- Target bounding box generation and feedback decision using the updated saliency map (103).

Multiple Model Target Tracking

Visual tracking is the method for tracking a selected region of interest throughout the video frames. The output of the visual tracker is a region of interest in which the saliency map is to be calculated. In each frame, the output of the target localization system is used to feed the visual tracker with the target size and location if necessary conditions are satisfied. The details of the necessary conditions are explained in the part namely learning rate calculation and temporal refinement of the saliency map.

In the present invention, a multiple model visual tracker is designed to respond successfully to abrupt and indiscernible changes of the target appearance. To achieve this goal, two filter groups are constructed. The first filter group, is intended to solve the drifts, which are undetectably small for the second filter group, The second filter group is designed to adapt to high variations in target appearance. The flow of the algorithm is shown in Figure 2. The presented tracking algorithm takes a region of interest in the new frame as input. Then, the necessary and appropriate preprocessing procedure for incoming frame, such as low-pass filtering, contrast stretching, spatial windowing etc., (201) is performed according to the tracking method.

In the above equations, the F represents the filter groups with subscript denoting the filter group ID. t and t+1 denote previous and the next filters, respectively. represents the currently calculated value of the filter using the current frame. The first filter group takes the preprocessed image part as input and the tracker algorithm is run according to the first filter group by correlation matching using the first group of the multiple model visual tracking (202). The output of the filter generates a quality measure for the resulting target location. If the quality of the response of the first filter group to the target in the current frame is higher than a predefined threshold (203), the target location is updated according to the first filter group (204) and the first filter group is updated with a low learning rate and the second filter group is updated with a high learning rate (207) as in Eqn.l and Eqn.3. Yet, if the quality measure of the first filter is less than the predefined threshold (203), the second filter group generates a target location output by correlation matching using the second filter group using multiple model visual tracking (205) and a quality measure. If the quality of the response of the second filter group in the current frame is higher than a predefined threshold (206), the target location is updated with respect to the second filter group (208) and both of the filter groups are updated with the learning rate of a high learning rate (210, 211:) as in Eqn.2 and Eqn.3. If the quality of the second filter response is not high enough compared to the predefined threshold (206), the system detects occlusion (209). When the new frame is taken with the target information, the same procedure is applied as it is shown in Figure 2. The thresholds used to decide on the quality of the response of the filter groups are design parameters and can be adjusted according to the specific system requirements.

To summarize the philosophy behind this methodology, the filter group which is designed for adapting to smaller changes has the priority to be used as the actual response since abrupt changes are not expected normally. When the rapid changes starts to occur, the filter group designed to adapt to abrupt changes becomes active if the filter group for smaller changes cannot satisfy the quality requirements. In order to maintain the sustainability of the two filter groups, the filter group for smaller changes starts to be updated with the update parameters of the other filter group for abrupt changes. Hence, interaction between the multiple filter groups is used to tolerate the errors of each other in different conditions by using their corresponding learning rates when one of the filter groups starts to give low quality tracking results.

The procedure of the multiple visual tracking can be summarized as follows:

- A necessary and appropriate preprocessing procedure for incoming frame (201),

- Correlation matching using the first filter group of the multiple model visual tracking (202),

- Querying the quality of the response of the first filter group to the target in the current frame (203),

- Updating the target location according to the first filter group (204) if the querying result of the response of the first filter group is greater than a selected threshold (203),

- Updating the first filter group with a low learning rate and updating the second filter group with a high learning rate (207), - Correlation matching using the second filter group of the multiple model visual tracking (205) , if the result of the querying the quality of the first filter group is less than the selected threshold,

Querying the quality of the response of the second filter group to the target in the current frame (206),

- Updating the target location according to the second filter group if the result of the querying the quality of the first filter group is greater than a predefined threshold (208),

- Updating the first filter group with a high learning rate (210).

- Updating the second filter group with a high learning rate (211),

Detecting occlusion (209), if the result of querying the second filter group is less than the predefined threshold (206).

Target Bounding Box Generation

Target bounding box generation actually means target location and size estimation and divided into three substages: saliency map generation, learning rate calculation and temporal refinement of the saliency map, target selection.

Target bounding box generation method provides the ability of detecting the scale changes of the target through the video frames, adapting to the scale changes and updating the visual target model more appropriately than the updating the tracking model of a visual tracking method without the ability of detecting the scale changes. Moreover, this phase also provides an adaptive learning rate selection algorithm, which is designed to prevent mislearning of the target model in the cases of clutter or occlusion. By avoiding the mislearning of the target through the frames, a target model which is appropriate to be used in order to boost redetection of target after the target is lost due to occlusion, clutter or noise, is constructed. This extracted target model is also appropriate to be used in a shape based classifier. Saliency Map Generation

The saliency map of the region of interest, which is selected by the algorithm above, can be extracted by a saliency score calculation method. The recently proposed fast saliency extraction method by Y. Wei et.al., "Geodesic saliency using background priors", 2012, ECCV, in which the saliency problem is tackled from different perspective by focusing on background more than the object, can be an used as a saliency calculation tool. This method is capable of extracting a saliency map within few milliseconds even in embedded systems; however, it has two basic assumptions for the input image that should be guaranteed, namely boundary and connectivity. The boundary assumption is reflection of a basic tendency that photographer/cameraman do not crop salient objects among the frames. Therefore, the image boundary is usually background. The connectivity assumption comes from the fact that background regions are generally tend to be large and homogenous, i.e. sky, grass. In other words, most image patches can be easily connected to each other piecewise. In our case, these two assumptions are fulfilled during tracking by simply selecting initial target window including target, roughly centralized, and its immediate surroundings. Satisfying these two conditions, the salient regions are assumed to be the patches that are extracted by downscaling or by any super pixel extraction algorithm with high geodesic distance from the boundaries of the image that is assumed to correspond to piecewise-connected background regions. The geodesic saliency of a patch p is the accumulated edge weights along the shortest path from p to virtual background node b in an undirected weighted graph as in Eqn.4.

Note that, since patches close to the center of the image requires a longer path in order to reach the background, accumulations of weights tend to be larger in the center patches. Therefore, this method also favors the center image regions as more salient which is reasonable since salient regions tend to occur around the center of image.

Learning Rate Calculation and Temporal Refinement of the Saliency Map

Since target tracking is a continual process, it includes temporal information which can be used for target localization and size estimation. In the proposed methodology, current saliency map calculation is performed for each frame (301) and then considering previously generated saliency maps the updated saliency map is continuously learnt and given as input to the system for the next frame (302). The important thing is each saliency map may not represent the target with the same quality. Therefore, samples of higher quality should be weighted more in the updated saliency map. Actually, this quality measures the temporal consistency of the target. In this sense, adaptive learning rate calculation becomes very important especially for two reasons: First, due to noise or any imperfection of the sensor data saliency map may deviate from frame to frame. However, learning saliency map from deviated version will extract the common structure as target model; hence it can compensate the noise. Second, when the target is fully or partially occluded, the abrupt change in the saliency map is detected and learning rate is adjusted in a way to prevent target model that exists before occlusion.

For learning rate calculation, two parameters, saliency ratio and correlation score, are used which are calculated by using both updated and current saliency maps. Firstly, current saliency map is calculated and binarized for each image. Then the first parameter for learning rate, saliency ratio is calculated as in Eqn.5 for current saliency map (303), where dominant components represents the saliency values greater than binarization threshold and is the current saliency map. To be clearer, this metric is designed to measure distinctiveness of the target. In the cases where only the target has high saliency values, the saliency ratio will be 1 which means the target in scene is very distinctive. Hence, for extraction of target model this frame is very reliable and should be learnt with high learning rate.

The second metric is correlation score, D NCC (t), which is a very strong cue for detection of abrupt changes from the updated target model. To achieve this goal, normalized cross correlation between the target models, selected from updated saliency map using the target selection procedure (307) that will be described in detail in target selection part, and current saliency map is taken as in Eqn.6 (304).

Using the saliency ratio and the correlation score, the learning rate is calculated at each frame as in the Eqn.7 (305) and represented with symbol λ (t) at time t. Note that ranges of both metrics extends from 0 to 1 and if both are 1 the current target is overwritten to the updated target which is not desired since it clears out all temporal information. In order to prevent this, a maximum learning rate is restricted to alpha, a. Moreover, in order to prevent mislearning of target, a penalization constant, β, is used whenever target model and the best possible match has resemblance below the feedback threshold, which simply

means system updates target model whenever the measurement is considered to be secure.

After calculation of learning rate, the saliency map is updated,

according to the calculated learning rate at each frame as in the Eqn.8 (306). The natural response of such a learning framework is to learn more if the current salient component is worth considering. Moreover, the components which are consistent with the learnt saliency map are also considered to be learnt more.

Since correlation metric shows resemblance between target and current saliency map, it is also used to answer to the question 'when the feedback should be given to the visual tracking system as the actual location of the object'. The formulation in Eqn.9 is used for querying tracking feedback (308).

In Eqn.9, the IsFeedBack variable is a binary variable controlling the decision of giving feedback or not (Give Feedback if 1 else 0). If the correlation score is high enough, then the current saliency map is consistent with the previous behavior of the region of interest. This results in giving feedback to the visual tracking system since it is the signature of a secure measurement. Moreover, exploiting feedback mechanism results in compensation for false target initialization via its learning mechanism where the targets, which are not well-localized in initialization, are centralized via the feedback mechanism. On the other hand, the system should not accept location information coming from the current saliency map, since it indicates existence of obstacles or occlusion at the current frame.

One should note that all these saliency calculations are utilized in the window which is the output of the visual tracking algorithm, referred as the region of interest in this context, and illustrated as outer bounding box in Figure 4. After applying the presented target bounding box generation method, the inner bounding, which is capable of identifying the target location and size, is generated. This results in locating the object in the region of interest by generating a minimum size bounding box including the target instead of the region of interest given by the real time visual tracking method and the target bounding box and region of interest illustrations are given in Figure 4.

A simple illustration is given in Figure 5, Figure 6 Error! Reference source not found.and Figure 7, the illustrations for the 68 th frame, 437 th frame and 567 th frames, respectively. In these figures, the top-left window is dedicated for the original gray-scale image in which track bounding box, larger rectangle determining region of interest, is visualized together with target bounding box, small rectangle revealing target location and size. The top-middle figure shows the updated saliency map and the top-right figure illustrates the binarization result of the updated saliency map. The bottom-middle figure shows the current saliency map and the bottom-right figure shows the binarization of the current saliency map. The bottom-left figure shows the normalized cross correlation result of the updated and current saliency maps. When the target is partly or fully occluded as in the case shown in Figure 6, the updated and current saliency maps, top-middle and bottom-middle, respectively, in Figure 6 would be dissimilar. This is the case, when the temporal consistency is spoilt. The dissimilarity would yield low cross correlation between target model and current saliency map which prevents giving the location feedback. Moreover, the learning rate is decreased with the penalty term β as in Eqn.7 to prevent target model. After 510 frames (Figure 7), the occlusions coming from the trees and the moving person disappear and the system firstly starts to increase the learning rate due to effect of the saliency ratio metric. Then, this yields increase in correlation score and when the upper condition is satisfied in Eqn.7 system both omits the penalization term in Eqn.7, β, and starts to give the location of the target as feedback to the tracker. By this way, the algorithm presented here is not affected by occlusion and clutter more than the classical feedback mechanism for the visual tracker. Target Selection

Target selection procedure (307) is achieved in two steps: binarization and maximization of the regularization energy. Then, the target bounding box is outputted as the bounding box of the selected connected component.

Although minimum computational cost is desired in each step, using static threshold or suboptimal methods for binarization may be problematic. Thus, Otsu's method is used with slight refinement. The method of N. Otsu, "A threshold selection method from gray-level histogram", 1979, can be either defined as an exhaustive search for the threshold that either minimizes the within- class variance or maximizes between-class variance. The between-class variance is often calculated as given in Eqn.10: where is referred as class probabilities and are class means. After

some manipulations Eqn.10 can be written as in Eqn.ll. where μ is the mean value of the histogram. Since the purpose is to calculate the optimal threshold value T that maximizes , the problem can be solved by either

inserting Eqn.10 or Eqn.ll into the Eqn.12.

Note that using Eqn.10 and Eqn.ll directly results in Eqn.13 and Eqn.14 respectively;

where the number of pixels with gray level i is given with As it can be seen using Eqn.11 becomes slightly advantageous since constant μ term is dropped out. This slight modification results in one less multiplication in Eqn.14 than Eqn.13 which results in L less multiplication in exhaustive search used in Otsu's methodology.

After thresholding the saliency map, the connected component maximizing the regulanzation energy given by Eqn.15, i.e. the most salient region with minimum distance to the center, is selected as the target.

where is the vectorized form obtained by raster scanning the 2D label matrix with values 1 and 0 as foreground and background respectively, S is the saliency map vectorized similarly and are the centers of each connected

component and the initial window respectively.

The target bounding box generation and feedback decision procedure can be summarized as:

- Calculating current saliency map (301), - Calculating saliency ratio for the current saliency map (303),

Calculating the correlation score using normalized cross correlation between the target model selected (304) from the updated saliency map using target selection procedure (307) and current saliency map,

Calculating the learning rate using the saliency ratio and the correlation score (305),

- Updating the saliency map according to the calculated learning rate (306), to obtain the updated saliency map as input to the system for the next frame (302),

- Selecting target according to the updated saliency map (307),

- Querying tracking feedback (308).