Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD AND SYSTEM FOR CREATING IMAGES
Document Type and Number:
WIPO Patent Application WO/2018/104700
Kind Code:
A1
Abstract:
A method of generating an orthomosaic image of a geographical survey area is disclosed. The method comprises: a. recording a set of still images of the survey area using a land-borne camera, in which the camera has an optical axis that is angled downwardly with respect to a surface of the survey area and recording a geographical position corresponding to each image; b. comparing pairs of partially overlapping images to identify common features in the image to derive a model of the motion between the images; c. calculating a pose in respect of each image describing where the camera was in world coordinates and where it was looking when the image was captured; d. projecting each image onto a ground plane to create a set of image tiles; and e. stitching together a plurality of image tiles to create an orthostatic image corresponding to an area that is present in a plurality of the images recorded in step a. Procedures for detectin overlaps in images and identifying common features are also disclosed.

Inventors:
REMDE STEPHEN (GB)
TANATHONG SUPANNEE (GB)
SMITH WILLIAM ALFRED PETER (GB)
Application Number:
PCT/GB2017/053410
Publication Date:
June 14, 2018
Filing Date:
November 13, 2017
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
GAIST SOLUTIONS LTD (GB)
International Classes:
G06T3/40; H04N5/232
Domestic Patent References:
WO2001015081A22001-03-01
WO2008044911A12008-04-17
Foreign References:
CN101893443B2012-03-21
Other References:
GI-HONG KIM ET AL: "Road Infrastructure Data Acquisition Using a Vehicle-Based Mobile Mapping System", COMPUTER-AIDED CIVIL AND INFRASTRUCTURE ENGINEERING, BLACKWELL PUBLISHERS, MALDEN, US, vol. 21, no. 5, 1 July 2006 (2006-07-01), pages 346 - 356, XP002434071, ISSN: 1093-9687, DOI: 10.1111/J.1467-8667.2006.00441.X
Attorney, Agent or Firm:
HAMILTON, Alistair (GB)
Download PDF:
Claims:
Claims

1. A method of generating an orthomosaic image of a geographical survey area comprising: a. recording a set of still images of the survey area using a land-borne camera, in which the camera has an optical axis that is angled downwardly with respect to a surface of the survey area and recording a geographical position corresponding to each image; b. comparing pairs of partially overlapping images to identify common features in the image to derive a model of the motion between the images;

c. calculating a pose in respect of each image describing where the camera was in world coordinates and where it was looking when the image was captured; d. projecting each image onto a ground plane to create a set of image tiles; and e. stitching together a plurality of image tiles to create an orthostatic image corresponding to an area that is present in a plurality of the images recorded in step a.

2. A method according to claim 1 in which, prior to step b., a search is performed to identify a set of pairs of images that are likely to contain overlapping regions for comparison in step b.

3. A method according to claim 2 in which, in the search includes use of geometric heuristics to identify pairs of images that are likely to contain overlapping regions.

4. A method according to any preceding claim in which, in step b., motion between pairs of images is mapped using a homography that is a transformation valid when two images view the same planar scene from different positions.

5. A method according to claim 4 in which step b., includes finding feature matches between pairs of images that are consistent with a homography and computing the homography that is consistent with these matches.

6. A method according to claim 4 or claim 5 including, in step b., a step of optimising the estimates of camera poses created in step b., by identifying poses that would give homographies between views that are consistent with those estimated from the image pairs in step b.

7. A method according to claim 4 or claim 5 in which, in step b., images are projected onto a ground plane prior to comparison.

8. A method according to claim 7 in which in making the comparison in step b., it is assumed that a pair of images differ only by a 2D rotation and/or a translation.

9. A method according to any one of claims 4 to 8 in which estimated homographies are filtered to discard those likely to be a poor model of the transformation between two views.

10. A method according to claim 9 in which a homography is deemed to be poor if the number of common features identified in step b., is below a threshold.

11. A method according to claim 9 or claim 10 in which a homography is deemed to be poor if, when applied to a first of an image pair, the number of common features identified in step b., that it maps erroneously onto a second of an image pair is above a threshold.

12. A method according to any one of claims 9 to 11 in which a pose is deemed to be poor if it deviates from an initial estimate derived from a recorded geographical position by more than a threshold.

13. A method according to any one of claims 9 to 12 in which a homography is deemed to be poor if it implies non-smooth motion.

14. A method according to any preceding claim in which an objective function is applied in step c, ensuring that the estimated poses give homographies that are consistent with those estimated between overlapping images.

15. A method according to claim 14 in which for each of a pair of images, a homography is computed from the image to the ground plane, and this homography is applied to the 2D positions of matched image features to yield matched features on the ground plane, wherein the distance between pairs of matched features are residuals that are added to the objective function such that the square of these residuals is minimised.

16. A method according to any preceding claim in which one or more prior assumptions is made about the pose as part of the calculation in step c.

17. A method according to claim 16 in which the prior assumptions include one or more of a constant camera roll angle, a constant camera height, a constant camera pitch angle and a ground control point of known position.

18. A method according to claim 16 or claim 17 in which the prior assumptions include a GPS position.

19. A method according to any preceding claim in which, after step d., with each orthomosaic tile to be stitched, there is associated the set of images whose projections overlap that tile.

20. A method according to claim 19 in which the associated set is derived by computing the projection of the corners of each image and then performing an inside-polygon test for each tile.

21. A method according to claim 19 or claim 20 in which for each pixel in a tile there is applied a perspective projection operation into each image, giving a non-integer 2D image coordinate.

22. A method according to claim 21 in which the nonlinear distortion induced by the camera lens is applied to the 2D position to giving the final image coordinate.

23. A method according to any preceding claim in which linear interpolation is used to compute the colour at each point in an image to be stitched to give a reprojection of the image to the ground plane.

24. A method according to any preceding claim in which, in step e., a weighting function that chooses the "best" gradient for each tile pixel from each image is applied to create the stitched image.

25. A method according to claim 24 in which the weighting function is chosen such that the gradient is taken from the image that observed that point at the highest resolution.

26. A method according to any preceding claim in which, a mask is applied to exclude one or more regions of a recorded image.

27. A method according to claim 26 in which a static mask is applied to each recorded image, the static mask being the same for each image to which it is applied.

28. A method according to claim 26 or claim 27 in which a dynamic mask is applied to at least some recorded images, the dynamic mask varying from one image to another.

29. A method according to any preceding claim in which the stitched image is computed by solving a sparse system of linear equations.

30. A method according to any one of claims 29 in which the selected gradients provide targets that the stitched image must try to match and guide intensities that the stitched image must also try to the match during the stitching process.

31. A method according to claim 31 in which the guide is the average of the projected images.

32. A method according to any preceding claim in which, in step e., in which a check is made as to whether any adjacent tiles have already been stitched, and if they have, a small overlap into the existing tiles is added to the new tile.

33. A method according to claim 32 wherein in the overlap region, there is only the guide intensity constraint which is weighted very highly in the objective function and there is a smooth transition to a lower weighted guide constraint for the interior of the tile.

34. A method according to any preceding claim in which, in step a., an initial estimate of a pose is derived from the recorded geographical position.

35. A method according to claim 34 in which the initial estimate includes the direction of the optical axis of the camera being determined by the recorded GPS direction of orientation, the viewing angle of the imaged surface is as determined during an initial calibration of the system, and that transverse rotation is zero.

36. A method according to any preceding claim in which step a., is carried out on a land vehicle. 37. A method according to claim 36 in which the vehicle is a road vehicle.

38. A method according to any preceding claim in which steps b. to e. are carried out on a programmed computer.

39. A method of image processing comprising performing steps b. to e. according to any preceding claim.

40. Computing apparatus programmed to perform steps b. to e. according to any preceding claim.

Description:
METHOD AND SYSTEM FOR CREATING IMAGES

This invention relates to a method and system for creating images. In particular, it relates to a system capable of building virtual top-down view images of very large areas, for example, of road networks simply by driving a vehicle around the road network and collecting images with a camera oriented towards the road surface.

The term "mosaic" means an image that is constructed from a number of overlapping images. The type of images created by a system embodying the invention are referred to as "orthomosaics", reflecting the fact that they are a mosaic of many images (perhaps millions] and that they approximate an orthographic view of an area being surveyed from above, similar to an image that might be captured by a satellite.

There is a demand for high-resolution orthogonal imagery of large areas of terrain, typically road surfaces, to enable surveying and assessment for example to enable maintenance to be scheduled. Such imagery can be obtained from satellites, but this is a very costly operation, and requires favourable atmospheric conditions if sufficient detail is to be resolved.

An aim of this invention is to produce orthomosaic images using a ground-based system at a cost that is significantly less and in a more versatile manner than is the case for satellite imaging.

To this end, this invention provides a method of generating an orthomosaic image of a geographical survey area comprising:

a. recording a set of still images of the survey area using a land-borne camera, in which the camera has an optical axis that is angled downwardly with respect to a surface of the survey area and recording a geographical position corresponding to each image;

b. comparing pairs of partially overlapping images to identify common features in the image to derive a model of the motion between the images;

c. calculating a pose in respect of each image describing where the camera was in world coordinates and where it was looking when the image was captured;

d. projecting each image onto a ground plane to create a set of image tiles; and

e. stitching together a plurality of image tiles to create an orthostatic image corresponding to an area that is present in a plurality of the images recorded in step a.

This method computes camera poses without ever estimating 3D positions of feature points. This makes the dimensionality of the problem much lower than conventional structure-from-motion analysis, meaning it can be solved more quickly allow it to scale to much larger sets of images. This is made possible by assuming that the surface being imaged is approximately planar.

Prior to step b., a search may be performed to identify a set of pairs of images that are likely to contain overlapping regions for comparison in step b. The search may include use of geometric heuristics to identify pairs of images that are likely to contain overlapping regions. This means that the content of the images need not be compared at this stage. In step b., motion between pairs of images may be modelled using a homography that is a transformation valid when two images view the same planar scene from different positions.

Step b., may include simultaneously finding feature matches between pairs of images that are consistent with a homography and computing the homography that is consistent with these matches. Step b. may include a step of optimising the estimates of camera poses created in step b., by identifying poses that would give homographies between views that are consistent with those estimated from the image pairs in step b.

Advantageously, in step b., images are projected onto a ground plane prior to comparison. In such cases, it may be assumed that a pair of images differ only by a 2D rotation and/or a translation. Estimated homographies are, most advantageously, filtered to discard those likely to be a poor model of the transformation between two views. This allows unreliable transitions to be removed from the process of optimisation, so avoiding corruption of the motion estimation in step c. A homography can be deemed to be "poor" using several alternative heuristic tests. For example, a homography may be deemed to be poor if the number of common features identified in step b., is below a threshold. A homography may be deemed to be poor if, when applied to a first of an image pair, the number of common features identified in step b., that it maps erroneously onto a second of an image pair is above a threshold.

In step b., the method may include a step of optimising the estimated pose of the camera, by identifying poses that would give homographies between views that are consistent with those estimated from the image pairs in step b.

An objective function may advantageously be optimised in step c, to ensure that the estimated poses give homographies that are consistent with those estimated between overlapping images. To create the objective function, for each of a pair of images, a homography is computed from the image to the ground plane, and this homography is applied to the 2D positions of matched image features to yield matched features on the ground plane, wherein the distance between pairs of matched features are residuals that are added to the objective function such that the square of these residuals is minimised. These steps directly optimise how well the images will align when they are projected to the ground plane. This is entirely appropriate since production of a seamless mosaic is an aim of the invention. Other residual errors in the camera pose are included as appropriate. For example, a pose may be deemed to be poor if it deviates from an initial estimate derived from a recorded geographical position by more than a threshold, and a pose is deemed to be poor if it implies non-smooth motion. Other tests may be used and tests may be applied individually or in combination.

In step c. one or more prior assumptions may be made about the pose as part of the calculation in step c. These may include one or more of a constant camera roll angle, a constant camera height, a constant camera pitch angle and a ground control point of known position. Alternatively or additionally, the prior assumptions include a GPS position. In a preferred embodiment, after step d., with each orthomosaic tile to be stitched, there is associated the set of images whose projections overlap that tile. This can be achieved by computing the projection of the corners of each image and then performing an inside-polygon test for each tile. For each pixel in a tile there may be applied a perspective projection operation into each image, giving a non-integer 2D image coordinate. The nonlinear distortion induced by the camera lens may then be applied to the 2D position to giving the final image coordinate. Hence, for each image there is produced a set of 2D coordinates corresponding to each pixel in the ground plane tile. Linear interpolation may then be used to compute the colour at each point in an image to be stitched to give a reprojection of the image to the ground plane.

The final step in the pipeline is to combine the information from the ground plane projections of the images into a single, seamless orthomosaic. If the method were to compute a single mosaiced image covering the whole surveyed area, the problem would be intractable. Moreover, to enable viewing of the images over a slow network, the mosaic must be broken up into tiles covering different areas at different zoom levels. For this reason, stitching is most advantageously performed for tiles independently.

In step e., a weighting function that chooses the "best" gradient for each tile pixel from each image is typically applied to create the stitched image. The weighting scheme means that the gradient is taken from the image that observed that point at the highest resolution. This encourages the inclusion of maximum detail in the orthomosaic. Additionally or alternatively, to the weighting function, a mask may be applied to the image, the mask indicating pixels of the image that are to be excluded from the mosaic. The mask can be applied to remove artefacts from the image or other environmental objects that are not of interest. The mask may be a static mask is applied to each recorded image, the static mask being the same for each image to which it is applied. This may be used to exclude static artefacts such as visible parts of a vehicle. Alternatively or additionally a dynamic mask is applied to at least some recorded images, the dynamic mask varying from one image to another. This may be used to exclude changing artefacts such as shadows

Advantageously, the stitched image may be computed by solving a sparse system of linear equations. Such systems can be solved efficiently meaning the tile stitching process could potentially be done in real time, on demand as requests for tiles are made from the viewing interface. The selected gradients typically provide targets that the stitched image must try to match and guide intensities that the stitched image must also try to the match during the stitching process. The guide may be the average of the projected images.

To avoid seams at the boundaries of tiles, in step e., a check may be made as to whether any adjacent tiles have already been stitched, and if they have, a small overlap into the existing tiles is added to the new tile. In the overlap region, there may be only the guide intensity constraint which is weighted very highly in the objective function and a smooth transition to a lower weighted guide constraint for the interior of the tile. The result is that intensities at tile boundaries match exactly and any large intensity changes that would have occurred are made less visible as low frequency transitions. This is a beneficial step that tiles to be stitched one at a time but still produce a seamless mosaic when adjacent tiles are viewed next to each other.

In order to optimise and accelerate the estimation of poses in step a., an initial estimate of a pose may be derived from the recorded GPS position. The initial estimate typically includes the direction of the optical axis of the camera (yaw] being determined by the recorded GPS direction of orientation, the viewing angle of the imaged surface (pitch] is as determined during an initial calibration of the system, and that transverse rotation (roll] is zero.

Step a., is typically carried out on a land vehicle such as a road vehicle, such as a motor road vehicle (a van, or similar] or a human-powered vehicle, such as a bicycle. Steps b. to e. are typically carried out on a programmed computer.

From a second aspect, this invention provides a method of image processing comprising performing steps b. to e. according to any preceding claim.

From a third aspect, this invention provides computing apparatus programmed to perform steps b. to e. according to any preceding claim.

An embodiment of the invention will now be described in detail, by way of example, with reference to the accompanying drawings, in which:

Figure 1 shows, diagrammatically, a vehicle equipped with a camera that is suitable for use with a system embodying the invention;

Figure 2 is a schematic illustration of a system to generate orthomosaic map in an embodiment of the present invention;

Figure 3 is high-level process flow diagram showing steps used to produce an orthomosaic for a tile; Figure 4 illustrates projecting an image captured from a capturing device to the ground plane presented as tile grid with respect to the 2D Cartesian ground/reference coordinate system;

Figure 5 shows steps in matching features between a pair of images in an embodiment of the invention;

Figure 6 illustrates multiple projected images overlapping on the same tile;

Figure 7 illustrates a Map Tile Index data file;

Figure 8 is an illustration of the creation of ground plane image tile from a source image in a reverse order via a reverse remapping function;

Figure 9 is a process flow diagram showing steps to associate 2D ground coordinates in tile with pixel coordinates of the source image in order to create a ground plane image in a reverse order;

Figure 10 is an illustration of assigning weight coefficient to pixels in the captured image, in which the weight assigned is inversely proportional to the distance of the pixel to the camera;

Figure 11 is a process flow diagram showing steps to construct the inputs from the captured image in tile for the linear equation system to generate an orthomosaic tile;

Figure 12 is an illustration of an image of guide intensity as part of generating orthomosaic tile, in which the size of the guide image is enlarged such that the guide image includes the intensity of the neighbour orthomosaic tile at the overlap regions to avoid seams at the boundaries of the tile;

Figure 13 is an illustration of determining weight values when there exist stitched tiles at the boundaries of the current tile, in which the tile size is extended from n x n to m x m such that it includes the overlap regions with its neighbour tiles; and

Figure 14 illustrates making of images to exclude artefacts.

Introduction

A vehicle 10 for use with the present invention includes a camera 12 that is mounted on a high part of the vehicle pointing generally forward and angled downward. The vehicle 10 is also equipped with global navigation satellite system apparatus 14 typically, using the Global Positioning System (GPS]. As the vehicle is driven forward, the camera captures a sequence of images of the terrain immediately in front of the vehicle. These images are stored in a data capture and storage system 16 and the geographical location of each image, as determined by the global navigation satellite system, is recorded.

An initial step is to perform a camera calibration in order to compute an intrinsic calibration matrix K:

where f x and f y are the focal lengths in the x andy direction and c x and c y define the centre of proj ection. This calibration matrix describes the specific properties of how the camera projects the 3D world into a 2D image. In addition, a calibration is made for distortion parameters that model how the camera optics deviate from a simple pinhole camera model. These distortion parameters can be used for a nonlinear adjustment to images or image locations to remove the effect of these distortions. Image Processing

Data from the data storage device are transferred to a computer for processing in order to generate an orthomosaic image. The computer is programmed to process the data in a manner that will now be described.

Motion-from-homographies

The first stage of the processing applied to the image sequence is to compute an accurate pose (orientation and position] for the camera in every captured image. The pose needs to be sufficiently accurate that, when images are later projected to the ground plane, overlapping images have at least pixel-accurate alignment. Otherwise, there will be misalignment artefacts in the generated orthomosaic images. Although this is a structure-from-motion (SFM] problem. However, previous approaches for SFM are not applicable in this setting for two reasons:

• Images in which the scene is primarily the road surface are largely planar. This is a degenerate case for estimating a fundamental matrix and subsequently reconstructing 3D scene points.

• Typically, the number of 3D scene points matched between images and reconstructed by the SFM process is much larger than the number of pose parameters to be estimated. This means that SFM does not scale well to very large problems, e.g. those containing millions of images.

Similarly, methods based on Simultaneous Localisation and Mapping (SLAM] are not applicable. They require high frame rates in order to robustly track feature points over time. Sampling images at this frequency is simply not feasible when the aim is to build orthomosaics of thousands of kilometres of road.

Therefore, this embodiment provides an alternative that addresses the drawbacks of using SFM or SLAM for our problem. This approach will be called "motion-from-homographies".

Motion-from-homographies assumes that the scene being viewed is locally planar. This allows it to relate images that are close together in the sequence by a planar homography. The approach exploits the temporal constraints that arise from knowing the images in a trace come from an ordered motion sequence by only finding feature matches between pairs of images that are close together in the sequence. Therefore, it is only necessary to compute pairwise matches between image features and not reconstruct the 3D world position of image features. This vastly reduces the complexity of the optimisation problem that to be solved compared with known approaches. The number of unknowns is simply 6N for an image sequence of N images.

Motion Model

The pose of the camera is represented by a rotation matrix and a translation vector. The pose associated with the ith captured image in a sequence is R, and t,. The position in world coordinates of a camera can be computed from its pose as C ; =— R t, .

A world point is represented by coordinates W=[w V wf , where (w, v) is a 2D UTM coordinate representing position on the ground plane and w is altitude above sea level. Each camera has a standard right handed coordinate system with the optical axis aligned with the w axis. A world point in the coordinate system of the ith camera is given by:

w ; =R ; w+t (2]

The rotation is a composition of a number of rotation matrices. The first aligns the world w axis with the optical axis of the camera:

The system models vehicle orientation by three angles. This representation is chosen because the vehicle motion model leads to constraints that can be expressed very simply in terms of these angles. Hence, three rotation matrices in terms of these angles are defined:

1 0 0

R pitch(/ ) 0 cos()3) -sin()3) (5)

0 sin()3) cos()3) cos(y) -sin(/) 0

R roll (7) sin(y) cos(y) 0 (6)

0 0 1

The overall rotation as a function of these three angles is given by:

R(a, ) 8,7) = R roll (7)R pitch ( ) 8)R yaw (a)R w2c . (7)

Hence, the rotation of the ith camera depends upon the estimate of the three angles for that camera:

Initialisation

The system relies on GPS and an initial estimate of the camera height above the road surface to initialise the location of each camera:

GPS

i

V GPS

(9) j j y measured where iu i , V ; ) is the GPS estimate of the ground plane position of the ith camera and W measured is the measured height of the camera above the road surface in metres. This need only be a rough estimate as the value is subsequently refined.

To initialise rotation, the yaw angle from the GPS bearing is computed first.

A first step in doing this is to compute a bearing vector using a central difference approximation:

, GPS GPS

v G GPS (10)

+ - v ,

Second, this is converted into a yaw angle estimate: a, = atan2(-b ; 1 , b ; 2 ). (11)

The pitch is initialised to a measured value for the angle between the camera optical axis and the road surface and the roll to zero:

lint

0. (13) onsasuEQ

Again, , only need be roughly estimated since it is later refined.

Fast geometric heuristic techniques are used to identify pairs of images that may contain overlapping regions of the ground plane. Two different procedures are used in this embodiment to identify potential overlaps between: 1. images from the same trace (which will be referred to as "self- overlaps") and 2. images from different traces (which will be referred to as "between-trace overlaps"). The goal of this part of the image processing pipeline is to reduce the number of image pairs for which feature matching and match filtering must be performed.

Self-overlaps. Self-overlaps are detected according to two heuristics. First, the images come from a sequence. Moreover, if image capture is triggered using GPS and wheel tick odometry information, there is an approximately fixed distance between consecutive images. This means that it is possible to choose a constant offset Ο ε Ν + within which images can be expected to overlap: that is, an image can be expected to contain overlaps with images in the range i— O to i + O . This is referred to this as a sequence heuristic and defines the set of such potentially overlapping pairs for trace t as:

& = {(i,j) \ l≤iJ≤N t A 0 < j -i≤O}, (14) and hence I O s'eq N t O -0(0 + \) l 2 .

Second, since the trajectory of the capture vehicle may intersect with itself or even retrace previously covered ground, it is preferable to detect potential overlaps between images that are not close in the image sequence. This is done by comparing the ground plane projection of image centres. The distance between projected image centres is given by:

^centroid ( Z > J) ' (15)

Where H, is the homography that transforms from the ground plane to the ith image computed according to (21). The homography depends upon camera pose estimates which have not yet been optimised. Hence, for this test the homography is constructed using the initialisation from GPS location and bearing.

The search aims to find all pairs where the Euclidean distance between centroids is less than a threshold: Centroid = {(*> J) I Centroid J) < '} C 16 )

This is a 2D fixed-radius all nearest neighbours search, which can be computed in 0(N t + | Centroid |) time.

The set of potentially self-overlapping image pairs is then given by:

Between-trace overlaps: It is also useful to find potential overlaps between images from different traces. At this point in the process, performed pose optimisation on individual traces has already been performed, so pose estimates (R i t , t l t ) are available for every image in every trace.

If trace s and t contain potentially overlapping images, then a setO^^, s < t is stored, such that (/ ' , j) e if image / ' . in trace s is a potential overlap for image j in trace t. To compute this set, a sequence of tests of increasing complexity is performed, that so that the number of images considered by the more expensive and accurate tests is kept to a minimum.

To find candidate pairs from the entire image set, to begin with, images are spatially binned into tiles. Image corners are projected onto the ground plane to form a 4-sided polygon. An image is binned into a tile if the polygon overlaps within the tile boundaries. These tile-to-image maps are later used during mosaic stitching and the details of the binning process is described below. This operation is linear in the number of images. Any tile that contains images from more than one trace could contain images with between-trace overlaps.

For all pairs of images from different traces that are binned to the same tile, there is now performed a test for intersections between the bounding boxes of the ground plane projections of the images. If the bounding boxes overlap, a point-in-polygon test is performed. Two images are considered potentially overlapping if:

1. the centroid of one polygon is inside the other, or

2. at least three corners of one polygon are inside the other, or

3. two corners are inside the other and those are not adjacent, or

4. at least one corner is inside the other and the area of overlap is greater than a threshold. For the case 4, the area of overlap must be calculated. First, a polygon is formed from the corners inside the other polygon and the intersection points with the other polygon's edges. Then the area is computed using Meister's shoelace formula.

Feature Matching and Filtering

Having found pairs of images that potentially contain overlapping regions of the ground plane, the second step in the pipeline is to compute feature matches between images. This process is made to be robust by filtering the matches according to a restricted motion model. Any pair of images with an insufficient number of inlying matches is discarded. The feature matches between all other pairs are retained and used later in the pose optimisation process. Unlike in a conventional 3D structure-from- motion pipeline, this method computes only pairwise matches between image features. This is because only pose estimates are required so it does not reconstruct the 3D position of matched features. This vastly reduces the complexity of the pose optimisation process while being sufficient for the final goal of orthomosaic stitching.

The part of the scene which is of interest (the road surface] can be assumed to be locally planar. This assumption can be exploited during feature matching in two ways. First, features are extracted in images that have been approximately projected to the ground plane. This has the effect of normalising the appearance of features that lie on the ground plane (by approximately undoing the effect of perspective distortion] and improving the likelihood of matching the same point viewed from different locations. It also has the added benefit of distorting the appearance of features that do not lie on the road plane (such as buildings, signage or vehicles] so that they are less likely to be matched. Second, a restricted motion model is assumed with only three degrees of freedom which means that feature matches can filtered efficiently and robustly.

The matched features could be any feature that is repeatably detectable and that has a distinctive descriptor. For example, the "scale-invariant feature transform" algorithm as described in US-A- 6 711 293 may be used. The 2D location of each feature is undistorted using the distortion parameters obtained during calibration. The system then computes greedy matches between features in pairs of images that are within the overlap threshold. These matches are filtered for distinctiveness using Lowe's ratio test with a threshold of 0.6. This means that the distance in feature space to the first closest match must be no more than 0.6 of the distance to the second closest match. Even with this filter applied, the matches will still contain noise that will disrupt the alignment process. Specifically, they may include matches between features that do not lie on the road plane (such as buildings or signage] or between dynamically moving objects (such as other vehicles]. If such matches were retained, they would introduce significant noise into the pose refinementprocess. Therefore, these matches are removed by enforcing a constraintthat is consistent with a planar scene (the homography model] and further restricting motion to a model with only three degrees of freedom.

Since it is assumed that the scene is locally planar, feature matches can be described by a homography. In other words, if a feature with 2D image position χ ε ΐ 2 in image is matched to a feature with image position x e M. 2 in image j, then there is expected to exist a 3 x3 matrix H that satisfies:

where s is an arbitrary scale. This homography constraint is used to filter the feature matches. However, for the purposes of filtering, a stricter motion model is assumed than elsewhere in the process. Specifically, the assumption is made that the vehicle has two degrees of freedom to move in the ground plane and that its yaw angle may change arbitrarily. However, pitch and the height of the camera above the ground plane are assumed to be fixed to their measured values and that roll is zero. This allows parameterised a homography by only three parameters and enables matches to be ignored that would otherwise be consistent with but which would lead to incorrect motion estimates. For example, if a planar surface (such as the side of a lorry] is visible in a pair of images, then features matches between the two planes would be consistent with a homography model. However, they would not be consistent with the stricter motion model and will therefore be removed.

Under this motion model, a homography can be constructed between a pair of images H(w,v,ce ) based on the 2D displacement in the ground plane (w, v) and the change in the yaw angle CC . This homography is constructed as follows. First, place the first image in a canonical pose:

(19)

0

R 0 (20)

measured

The homography from the ground plane to this first image is given by:

H 1 = K [ ( R I X I:2 TL ] (21)

We define the pose of the second image relative to the first image as:

> measured n\

R 2 (a) = R(a, (22)

Therefore, the homography from the ground plane to the second image is parameterised by the ground plane displacement and change in yaw angle:

H 2 ( M ,v,a ) = K [(R 2 (a )) : i 2 t 2 (u,v)] (24) Finally, the homography can be defined from the first image to the second image as:

Η 1→2 (¾,ν,α) = Η 2 (¾,ν,α)Η ] 1 . (25) Local features in the ground plane images are computed. Let U/ denote the location in the ground plane image of feature Also computed and stored is , by projecting U / back into the image. (Note that this location is not distorted since it will later be projected back to the ground plane during pose optimisation). The are the undistorted locations of features in the image plane.

Features in the projected ground plane images can be expected to approximately retain their appearance and scale but the orientations of the same features in two different images will differ by a global 2D rotation. Hence, invariance to affine distortion is not required but rotational invariance at the initial matching stage is. While there are a number of feature descriptors that satisfy these criteria, in this example a scale-invariant feature transform (SIFT] algorithm is used, though any other suitable descriptor could be substituted. Greedy matches are computed between features in pairs of images identified as potentially overlapping using the methods in described for identifying potential overlapping images. These matches are filtered for distinctiveness using Lowe's ratio test with a threshold of 0.6. This means that the distance in feature space to the first closest match must be no more than 0.6 of the distance to the second closest match. Even with this filter applied, the matches will still contain noise that will disrupt the alignment process. Specifically, they may include matches between features that do not lie on the road plane (such as buildings or signage] or between dynamically moving objects (such as other vehicles]. If such matches were retained, they would introduce significant noise into the pose refinement process.

Filtering matches with a restricted motion model

Feature matches are now filtered such that they are geometrically consistent with a restricted motion model. Specifically, it is assumed that between a pair of images, the vehicle may translate in the u, v plane and that its bearing may change. In other words, pitch, roll and the height of the camera above the ground plane are assumed to remain fixed. This restricted transformation therefore only has three degrees of freedom. In the projected ground plane images, this means that a 2D proper rigid transformation (i.e., a translation and rotation] is sought. The least squares best fitting such transformation can be found in closed form using a restricted version of Procrustes alignment (allowing only translation and rotation]. To measure the error between a pair of matched feature locations after computing a best fitting transformation, the Procrustes distance is used and symmetrise by taking an average of the forward/backward transformation.

This restricted motion model is used in conjunction with the random sample consensus (RANSAC] algorithm to compute the set of inlying feature matches. From the greedy feature matches, two matches are randomly selected (the minimum required to fit a 2D proper rigid transformation]. The transformation is fitted and then inliers are defined as those with a Procrustes distance of <10 pixels. An example of the final result is shown in Figure 5. The top row shows a pair of input images. The feature matches after RANSAC filtering between features in the projected ground plane images are shown in the third row. For comparison, the second row shows the result of matching directly in the original images (subject to: first undistort the images, compute greedy feature matches, then fit a general homography to filter the matches]. It is clear that the ground plane images give more matches than the naive method and that they correspond to the correct transformation. Finally, the bottom of Figure 5 shows the alignment using the fitted 2D proper rigid transformation. Note that this is not the alignment used in the final orthomosaic stitching; it is just a visualisation of the coarse transformation used to filter feature matches.

The final set of inlying matches are stored as M.. s x where (f ,g) e M.. sx if feature /in the ith image in trace s is matched to feature g in the y ' th image of trace t. The set of matches between images in the same trace, is written as Ai* . . The locations of the features in the ground plane projected images are not used in the following pose optimisation and can be discarded at this point

Alternatively, given a set of tentative matches between a pair of images, the RANSAC is used to simultaneously fit the constrained homography model to the matches and remove matches that are outliers under the fitted model. Since our constrained homography depends on only three parameters, two matched points are sufficient to fit a homography. The fit is obtained by solving a nonlinear least squares optimisation problem. The RANSAC algorithm proceeds by randomly selecting a pair of matches, fitting the homography to the matches and then testing the number of inliers under the fitted homography. An "inlier" is defined as a point whose symmetrised distance under the estimated homography is less than a threshold (for example, 20 pixels in this embodiment]. This process is repeated, keeping track of the model estimate that maximised the number of inliers. Once RANSAC has completed, there exists a set of filtered matches between a pair of images that are consistent with the constrained motion model. Although the model is overly strict, the use of a relaxed threshold means that matches survive even when there is motion due to roll, changes in pitch or changes in the height of the camera.

Pose optimisation

Single trace data objective: The system now has initial estimates for the pose of every camera and also pairs of matched features between images that are close together in the sequence. A largescale nonlinear refinement of the estimated pose of every camera can now be performed. Key to this is the definition of an objective function comprised of a number of terms. The first term, ¾ata , measures how well matched features align in the image plane. This is referred to as the data term:

where ; is the 2D position in the image plane of the /th feature in image / ' , obtained by projecting the ground plane feature location back to the image, as described under the heading "Feature Matching". H is the homography from the ground plane to the ith image and is given by:

» * . ) , . · t f ] . (27) The function h homogenises a point:

Equation 26 shares some similarities with a process known as "bundle adjustment" in the general structure-from-motion problem. However, there are some important differences. First, rather than measuring "reprojection error" in the image plane, error when image features are projected to the ground plane is measured. Second, the objective depends only on the camera poses: it does not need to estimate any 3D world point positions. The first difference is important because it encourages exactly what is wanted: namely, that corresponding image positions should align in the final orthomosaic. The second difference is important because it vastly reduces the complexity of the problem and makes it viable to process very large sets of images.

To solve Equation 26, the system initialises using the process described above and then optimise using nonlinear least squares. Specifically, the Levenberg-Marquardt algorithm is used with an implementation that exploits the sparsity of the Jacobian matrix to improve efficiency. Moreover, some additional terms (described in the next section] are included to softly enforce additional prior constraints on the problem.

Priors

Since it is to be expected that the orientation of the vehicle with respect to the road surface will remain approximately constant, it is possible to impose priors on two of the angles. First, side-to-side "roll" is expected to be small, only being non-zero when the vehicle is cornering. Hence, the first prior simply penalises the variance of the roll angle estimates from zero:

The second angular prior penalises variance in the angle between the camera optical axis and the road plane, i.e. the pitch angle: { R ->U^ =∑( -^∑ ) . (30)

Next, variance in the estimated height of the camera above the road surface is penalised because this is expected to remain relatively constant:

Fourth, ground control points (GCPs) may be used. These are a set of N g points for which the ground plane coordinates (w 1 G€P , v 1 G€P ), ..., (w^ €P ,v^ €P ) are known and the correspondence of these points to pixels in some images is known. GCP correspondences are stored in the set G such that (/ ' , j, (x, y)) e G if pixel [x, y) in image corresponds to the y ' th GCP. Distance in the ground plane between the actual GCP position is penalised and that predicted by the pose of a camera which has a correspondence to a GCP:

({R,, t ( >£) = ∑ (32)

Finally, the estimated position of the camera in each image is encouraged to remain close provided by GPS in two ways. First, deviation from the absolute GPS position is penalised:

However, GPS can provide more accurate relative rather than absolute position information. Hence, a second prior can penalise deviation from the relative position change provided by GPS:

N,-l

ε*Β*({1Μ, }£) =∑ Ik - c i+1 ) -(c! mt - eft ) (34)

i=l The hybrid objective that is ultimately optimised is a weighted sum of the data term and all priors. Between-trace Overlaps

An orthomosaic of a complete town, city or road network is likely to comprise many traces (perhaps thousands). Each trace may overlap with many others and to obtain a seamless orthomosaic, all overlapping regions must be optimised together. One solution would be to combine all traces into a single one and run the optimisation previously described. However, even with the simplifying assumptions made, this approach does not scale to millions of images. Instead, an incremental process is used in which a single trace is optimised at a time but ensures that the new trace aligns with the previously optimised traces. This ensures that the size of any one optimisation problem is kept small.

Whenever a new trace is added into the network the fast search technique discussed above is used to find potential overlaps between the new trace and the existing traces and obtain filtered matches between them. Then, for each previously optimised trace s that overlaps with the new trace t, an additional cost is added:

Where Xy fl refers to the gt feature in the y ' th image in trace t and H t is the homography that transforms the ground plane to the j image in trace t, i.e. that is constructed from Ry and ty.

The overall optimisation for trace t comprises the single trace objective as described under "Single trace pose optimisation" with additional objectives of the form (35), one for each overlapping trace. Map Tile Index Generation

The image of the entire ground plane is presented as a raster map in which any pixel of the image is associated with the coordinates in the ground/reference coordinate system. Depending on the level of detail of the ground plane (which defines the ground resolution indicating how much distance on the ground is represented by a single pixel], the dimension of the map can be extremely large. To enable creating and viewing of the image of the entire ground plane over the end product platform (an example of the implementation of this invention is the web], the image is cut into non- overlapping grids of equal size called tiles. Each tile is identified by a coordinate describing its position on the ground plane.

Given the estimated pose of the camera (as discussed above] for every captured image together with the intrinsic camera parameters obtained during calibration, the captured images are then projected onto the ground plane as illustrated in Figure 4. As the position determined as part of the camera pose is defined with respect to the ground/reference frame, the projected ground-plane image is thus georeferenced. Depending on the camera trajectory, velocity of the camera motion, frame rate, field of view of the camera and other relevant factors, multiple projected images can overlap on the ground plane of the same tile.

Since the orthomosaic for each tile is created by combining into a single image the information from all the captured images that are projected into that tile and because each tile typically contains many overlapping images as presented in Figure 6 the Map Tile Index database needs to be generated and updated whenever new images are acquired. The Map Tile Index database stores for each tile a data file containing image indices that are included in that tile.

The procedure to determine the captured image is visible in which the tile index (or indices] is as follows. Referring to Figure 2, after the camera poses are estimated, image pixel coordinates

X = [x, yf of the four corners of the captured images (for all images of the same size, the coordinates are equal] are projected onto the ground plane by using the equation below, which returns the 2D ground coordinates U Optionally, provided camera intrinsic parameters and distortion coefficients, the pixel coordinates of the four image corners may be undistorted to obtain distortion-free coordinates prior to projection. u = A(H ~ (36] where H is the homography defined as Equation (27], and h is the function defined in Equation (28]. The projected ground coordinates u are then intersected with the tile boundary to determine in which tile index (or tile indices] it is to be included. The simple method to perform the boundary intersection is to compute the bounding box coordinates from the four projected corners. Then the index of the current captured image is included into the Map Tile Index data file of the intersected tiles.

A Map Tile Index data file for a tile may contain its level of detail, its coordinates with respect to the 2D Cartesian ground coordinate system and all the image indices that are visible in the tile coverage. Since the images that are visible in the tile area may come from difference sources, for example multiple image sequences, the data file may include the index of the image sequences together with the indices of the images in the sequences. An example structure of Map Tile Index data file for a tile index is illustrated as Figure 7. For the data presented in Figure 7, Z denotes the level of detail of the ground plane aka zoom factor, and and Vindicate the coordinates of the tile on the reference ground plane. S ; indicates the index of the image sequence and 1 1 represents the index of the images in the sequence S ; .

Orthomosaic Tile Creation

Since each tile typically contains many overlapping images as illustrated in Figure 6, in order to produce a single image for that tile, the information from all these images must be combined in some way. The naive approach of simply averaging the images leads to very bad results (seams at projected image boundaries, loss of detail through blurring caused by imprecise alignment]. To overcome the mentioned problems, this present invention combines the overlapping images by blending in the gradient domain. This is a standard technique in image stitching and seeks to preserve high frequency detail while hiding gradual changes in low frequency variations. However, the present approach contains a number of significant novelties. First, the image-to-ground plane projection procedure is encapsulated into a single function in the reverse order, which offers two advantages including speed and avoiding linear interpolation of a nonlinear mapping. Second, the weighting function used in this invention ensures that, among the overlapping images, the "best" gradient is selected for each pixel in the tile image which encourages the inclusion of maximum detail. Third, the intensity values of all the pixels in the final tile images are computed by solving a sparse system of linear equations which can be solved almost, if not, in real time. Fourth, the present invention ensures that there are no seams at the boundaries of tiles even when each tile is created independently. The detailed description of the implementation procedure, as shown in Figure 3, is explained as follows. Image to Ground Plane Remapping

The process of creating an orthomosaic for a tile starts by projecting all the images that belong to the tile, as stored in the Map Tile Index data file, onto the area on the ground plane that is covered by the tile. When mentioning "projecting an image to the ground plane", it theoretically means projecting every pixel coordinates of the source image to the ground plane defined with respect to the 2D Cartesian ground coordinate system. Instead of deriving the 2D ground coordinates from the pixel coordinates, the system of the present embodiment encapsulates the image-to-ground plane projection procedure into a single remapping function in the reverse direction. The process of creating a ground plane image from the captured image in the reverse order is illustrated as Figure 8, and the process flow diagram of the remapping function is shown in Figure 9. The single remapping function described in the present invention offers two advantages including speed and avoiding linear interpolation of a nonlinear mapping.

The process of projecting a captured image to a tile on ground plane is as follows. Since a tile is a fixed sized plane on the ground and has a fixed coordinate on the georeferenced ground coordinate system, the coordinates of each pixel in the tile image (also referred to as ground-plane image] can be determined. The process of generating a ground-plane image from a captured image is done in the reverse order in which the colour of pixels in the ground-plane image are obtained from their corresponding pixels in the captured image. The relationship between pixels in the ground-plane image and the rth camera captured image is defined by the perspective projection matrix P.

R, t

K (37)

0, 1x .3 1

To determine the pixel coordinates x = (x', y') of (u , v) , this is done by forward perspective projection using the following equation.

(χ', y') is a distortion-free pixel coordinates. Since the captured image is distorted, the obtained coordinates must also be added the distortion factors such that the obtained coordinates are the distorted (x, y) in the same way as the captured image. Knowing where in source image the pixel (u , v) on the ground-plane image is associated with, it is possible to then assign the image intensity I (x, y) to the pixel (w , v) . Finally, a bilinear interpolation is performed to obtain the projected ground-plane image.

Seamless Orthomosaic Tile Stitching

Each tile will typically contain many overlapping ground-plane images which must be combined to generate a single tile image from all the projected ground-plane images. The naive approach of simply averaging the images leads to very low quality results (seams at projected image boundaries and loss of detail through blurring caused by imprecise alignment). To overcome this, blending is performed in the gradient domain. This is a standard technique in image stitching and seeks to preserve high frequency detail while hiding gradual changes in low frequency variations. However, some significant modifications to gradient domain stitching are incorporated which make it possible to stitch tiles independently whilst still ensuring that they are seamless when placed side by side.

This is done by constructing a sparse linear equation system of which the unknowns are the image intensity for every pixel in the output orthomosaic tile. The constant terms of the linear system, which are computed as illustrated in Figure 11, are composed of gradients in the horizontal and vertical direction, denoted as Gr. and G , and the guide intensity image, defined as I . The procedure of computing G x , G y and I is described as follows.

For each ground-plane image, weight coefficient, in the range of [0,1], for every pixel is determined as an element of the weight image V of size nxn, which is the same as the tile size. The pixels with high weight means that they are observed at the high resolution while the pixels with low weight are those observed at the low resolution. For example, high weight coefficients are assigned to the pixels with short distances to the camera, in which for this system are the pixels at the bottom part of the captured image, while the pixels with large distances to the camera are assigned with low weight. This arises because the camera is not directed vertically downwards but tilted towards the road surface. By defining the size of the captured image to be r rows and c columns, and the origin of the image (0,0] is at the top-left corner, the weight assigned to the pixels at the row This scheme is illustrated as Figure 10.

The procedure of computing V for a ground-plane image is performed in the same way as the image to ground plane projection procedure as illustrated in Figure 8. Note that this step is computed only once as the weight computed from the raw image, W, is same for every image in the system. However, a slight change is made in Step 3 and 4 of the process in Figure 9. In Step 3, the weight coefficient at the pixel [x,y) of the captured image is obtained instead and assign to V at the pixel [u, v) in Step 4. The final weight image VKfor the ground plane image is then generated by performing a bilinear interpolation.

We denote the intensity at pixel [x,y) in the desired ground plane image as I[x,y). The gradient in the x direction can be approximated using forward differences as:

dJ(x,y) *I(x+ly)-I(x,y) (39) Similarly for the direction

d y l(x, y) ~ I(x, y) -I(x, y+l) (40)

Given all the gradient images for each ground-plane image, the target gradient images for the horizontal and vertical direction G x and G y are created by choosing the "best" gradient among all the gradient images. This is done by, for each pixel [x,y) of G x , finding which ground plane image has the highest value of V (x, ) then assigning the gradient G x [x, y) of that ground plane image to the best gradient image G x . The same procedure is applied to G y . This scheme encourages the inclusion of maximum detail.

In addition to using weight image as an indication of level of detail for each pixel, the weighting may be combined with a mask to remove undesired regions as presented in Fig 14 (middle row). Pixels in the mask that have their value set to zero (black] indicate the part of the scene to be excluded from the final results. These may include shadows cast by the vehicle and undesired objects in the scene. A mask can be produced for each image or fixed across a whole dataset to remove part of the vehicle such as the bonnet that is visible in the captured image.

This process is illustrated in Figure 14. An integration of weight image and mask can be used to remove undesired part in the scene. In Figure 14, the top row contains images captured with unwanted objects. The middle show masks corresponding to unwanted objects and bottom combine weight image with mask. The left column illustrates masking of a shadow cast by the vehicle. The middle column illustrates masking of undesired objects on the road. The right column illustrates masking of part of the vehicle present in the captured image (the bonnet in this case]. Pixels with darker colour have lower weight and black equals zero weight

Masks can be provided manually, computed using a separate semantic segmentation process or, in the case of shadows, ray-traced using a 3D model of the van and an estimate of the sun position. A mask comprises a per-pixel weight This can be used to deal with a variety of phenomena:

• Shadows: Shadow boundaries introduce a large gradient that causes artefacts in the blended result. Therefore, shadow boundaries, or points close to the boundary, are given a low or zero weight The interior of shadows can still be used for stitching and can be given non-zero weight. However, such pixels should not be used as part of the "guide" image as they will cause the blended result to be overly darkened in shadowed regions.

• Visibility of part of the vehicle: If part of the vehicle is visible in the captured images, this can be removed by using a static mask; that is, the same mask for every image.

• Other vehicles, real-world objects not of interest: Semantic segmentation can be used to classify each pixel as belonging to one of a set of possible categories. Any pixel given a non- road label can be assigned a zero weight in the mask.

The selected gradients provide the final orthomosaic that the stitched image must try to match. However, providing gradient constraints alone would leave an unknown constant offset To remove this indeterminacy and to encourage the overall brightness and colour to match the original captured images, guide intensities are included, which the stitched image must also try to match. The guide intensity image, I , is simply the average of all the ground-plane images, as illustrated as Step 4 in

Figure 11. In other words, detail is recovered from the gradient in the best image but overall brightness from the average of all images. The overall problem can be written as a sparse linear system of equations. Such systems can be solved efficiently meaning the tile stitching process could potentially be done in real time, on demand as requests for tiles are made from the viewing interface. However, as the orthomosaic is computed independently for each tile, this causes seams at the boundaries of tiles when they are presented alongside other tiles. To avoid seams at the boundaries of tiles, overlaps are included with previously stitched neighbouring orthomosaic tiles as presented in Figure 12. This is done by extending the size of the average image for e pixels equally at each side to create a small overlap with the neighbour tiles. This enlarged image of size MXTJi is referred to as the guide intensity image, I , presented as Step 5 in Figure 11. To create the guide intensity image at the boundary region, the first step is to check whether any adjacent tiles have already been stitched. If they have, a small overlap (e] is added to the new tile of the neighbour tile into the enlarged tile, as presented again in Figure 12. If not, the overlap region for that neighbour tile is set to an invalid number [NaN) such that they will not be included in the linear equation system.

In the overlap region, there is only the "guide intensity" constraint which is weighted very highly in the objective function. In the implementation, the weights for the pixels in that region are set to 1. This enforces the intensities of the pixels near the boundaries to be blended smoothly with its adjacent tiles. In order to retain the details of the current tile while still transition smoothly from its boundaries, the weights of the interior pixels will constantly increase from 0 at the outer pixels of the original tile size for a number of e pixels until it reaches λ , meaning the weight increases λ/ e every time it moves towards to centre of the tile. Then, the weights remain at this value for every pixel in interior region. The result is that intensities at tile boundaries match exactly with the adjacent stitched tiles and any large intensity changes that would have occurred are made less visible as low frequency transitions. This is a key step that allows us to stitch tiles one at a time but still produce a seamless mosaic when adjacent tiles are viewed next to each other.

At this point, the constant terms of the linear equation system are all defined which include Gr., G^, and I g . The sparse system of the linear equations can then be constructed as follows.

If all image intensities of the target orthomosaic are stacked into a long vector I e M"' , compute gradients in the vertical and horizontal directions by multiplication with a gradient coefficient matrix

J G e IR" X " and J G e " xn respectively. J G I computes a vector containing the horizontal gradient at each pixel, and J G I for a vector containing the vertical gradient Both J G and J G are sparse, since only two elements in each row are non-zero. Hence, the structure of J G looks like:

1 1 0

J, 1 1 0 (41)

For pixels on the bottom or right boundary, it is possible to switch to backward finite differences to compute the gradient

The sparse system of the linear equations for our seamless image stitching is constructed as:

where gx is a vector containing the horizontal gradients for all pixels for which the gradient is defined. Similarly, gy contains the vertical gradients, , is the weight associated with the /th pixel, computed according to the scheme shown in Figure 13. Although the system is sparse and thus quick to solve, the speed can still be improved by reducing the number of equations for guide intensity constraints. Instead of including all the pixels in \ a , only a small number of guided intensity pixels are included which their weights are equal to λ . This is because it is an over-constrained system and the number of equations in the system is sufficient to be solved using only equations with gradient terms and guide intensities at the boundaries. Not only does this improve speed, the quality of the stitched image is also improved because it avoids the solution being encouraged towards the overly smooth average image.

Orthomosaic Map Generation

Since each orthomosaic tile is generated in such a way that it produces no seams at the boundaries between tiles, an orthomosaic for any region can be constructed by simply placing an appropriate subset of the tiles side by side according to its tile positions. In practice, a user interacts with the map by changing the region of interest (e.g. by dragging the map] or the zoom level, which generates requests for a set of tiles. These are either returned directly (if they were precomputed in advance] or stitched on-the-fly in an on-demand implementation. In principle, the entire orthomosaic for the whole captured region could be created by concatenating all stitched tiles together into one large map.