Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
IMAGE ALIGNMENT
Document Type and Number:
WIPO Patent Application WO/2023/275555
Kind Code:
A1
Abstract:
A computer processing system for use in image alignment is configured to use a machine learning model (6) to extract a first set of features (26) from a first image (2), extract a second set of features from a second image (4; S103), and compare the first and second sets of features to identify respective first and second subsets of one or more features that are common to both sets of features. A mapping between each of the first subset of features (26) and a corresponding feature of the second subset of features is generated (S104) and used to determine a predicted transformation for aligning the first image (2) with the second image (4; S105).

Inventors:
ZHENG JIANQING (GB)
LIM NGEE HAN (GB)
PAPIEZ BARTLOMIEJ (GB)
Application Number:
PCT/GB2022/051687
Publication Date:
January 05, 2023
Filing Date:
June 30, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV OXFORD INNOVATION LTD (GB)
International Classes:
G06T3/00; G06T7/00
Foreign References:
CN112639880A2021-04-09
Other References:
JIAN-QING ZHENG ET AL: "D-Net: Siamese based Network with Mutual Attention for Volume Alignment", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 25 January 2021 (2021-01-25), XP081866915
ZHENG JIAN-QING ET AL: "D-net: Siamese Based Network for Arbitrarily Oriented Volume Alignment", 3 October 2020, COMPUTER VISION - ECCV 2020 : 16TH EUROPEAN CONFERENCE, GLASGOW, UK, AUGUST 23-28, 2020 : PROCEEDINGS; [LECTURE NOTES IN COMPUTER SCIENCE ; ISSN 0302-9743], PAGE(S) 73 - 84, ISBN: 978-3-030-58594-5, XP047581552
VICTOR VILLENA-MARTINEZ ET AL: "When Deep Learning Meets Data Alignment: A Review on Deep Registration Networks (DRNs)", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 6 March 2020 (2020-03-06), XP081616139
Attorney, Agent or Firm:
DEHNS (GB)
Download PDF:
Claims:
Claims

1. A computer processing system for use in image alignment, the computer processing system configured to use a machine learning model to: extract a first set of features from a first image; extract a second set of features from a second image; compare the first and second sets of features to identify respective first and second subsets of one or more features that are common to both the first and second sets of features; generate a mapping between each of the first subset of features and a corresponding feature of the second subset of features; and use the mapping to determine a predicted transformation for aligning the first image with the second image.

2. The computer processing system of claim 1 , wherein the first image and the second image are three-dimensional volumes.

3. The computer processing system of claim 1 or 2, wherein the first image and the second image comprise computer tomography scans

4. The computer processing system of any preceding claim, wherein the machine learning model comprises an encoder stage comprising one or more convolution blocks configured to extract the first and/or second sets of features from the first and/or second images respectively

5. The computer processing system of claim 4, wherein the convolution blocks comprise residual down-sampling blocks.

6. The computer processing system of claim 4 or 5, wherein one or more of the convolution blocks comprises one or more atrous convolution layers.

7. The computer processing system of any of claims 4 to 6, wherein the machine learning model comprises a decoder stage comprising one or more convolution blocks, each comprising a skip connection received from the encoder stage of the model. 8. The computer processing system of any preceding claim, wherein the machine learning model comprises: a first branch and a second branch, wherein the first branch is arranged to extract the first set of features from the first image and the second branch is arranged to extract the second set of features from the second image; and an inter-branch module, configured to compare the first and second sets of features to identify the respective first and second subsets of features that are common to both the first and second sets of features.

9. The computer processing system of claim 8, wherein the first and second branches respectively comprise one or more convolution blocks, and wherein the inter-branch module is connected to one or more of the convolution blocks of the first and second branches

10. The computer processing system of claim 9, wherein parameters between the first branch and the second branch are shared.

11. The computer processing system of any of claims 8 to 10, wherein the inter- branch module is configured to use, in an attention mechanism: i) the first set of features as a query and the second set of features as a key, and ii) the second set of features as a query and the first set of features as a key.

12. The computer processing system of any of claims 8 to 11 , wherein the inter- branch module is configured to generate the mapping between each of the first subset of features and a corresponding feature of the second subset of features.

13. The computer processing system of any preceding claim, wherein the machine learning model is configured to generate the mapping based on a similarity between each of the first subset of features and the corresponding feature of the second subset of features

14. A method of implementing a machine learning model for use in image alignment, the method comprising: inputting a first image and a second image to a machine learning model; using the machine learning model to: extract a first set of features from the first image; extract a second set of features from the second image; compare the first and second sets of features to identify respective first and second subsets of one or more features that are common to both the first and second sets of features; generate a mapping between each of the first subset of features and a corresponding feature of the second subset of features; and use the mapping to determine a predicted transformation for aligning the first image with the second image.

15. The method of claim 14, further comprising applying the predicted transformation to the first image to generate an aligned first image.

16. The method of claim 15, further comprising superimposing the aligned first image onto the second image to generate a combined image.

17. The method of claim 16, further comprising determining, from the combined image, an estimate of cartilage thickness.

18. The method of any of claims 14 to 17, further comprising cascading the machine learning model and a further machine learning model.

19. A method of training a machine learning model for use in image alignment, wherein the machine learning model comprises a plurality of updatable parameters and is configured, in an inference mode, to: receive a first image and a second image; and according to the parameters, determine a predicted transformation for aligning the first image with the second image; the training method comprising: receiving a first training image and a second training image, wherein the first training image is misaligned with the second training image by a known transformation; inputting the first training image and the second training image to the model; extracting a first set of features from the first training image; extracting a second set of features from the second training image; comparing the first and second sets of features to identify respective first and second subsets of one or more features that are common to both the first and second sets of features; generating a mapping between each of the first subset of features and a corresponding feature of the second subset of features; using the mapping to determine a predicted transformation for aligning the first training image with the second training image; comparing the predicted transformation with the known transformation to update the parameters of the model.

20. A computer readable storage medium storing computer software code comprising instructions which, when executed on a computer processing system, cause the computer processing system to carry out the method of any of claims 14 to 19.

Description:
Image Alignment The present invention relates to image alignment techniques, particularly to methods and systems employing machine learning for use in image alignment.

Damage to cartilage is an important indicator of osteoarthritis. Magnetic Resonance Imaging (MRI) is a commonly used method, but this technique can be time- consuming and expensive, with a poor spatial resolution (on the order of 20pm) for preclinical imaging of small animals’ cartilage (e.g. for cartilage thickness £ 100pm for mice).

Computer Tomography (CT) and micro-CT are less expensive techniques, but contrast enhancement may be required in order to obtain an image of cartilage using CT scans. The shape of the cartilage can be extracted by comparing images of cartilage with and without contrast-enhancement.

However, this process typically requires accurate alignment of the images. Conventional alignment methods involve manual positioning and annotation of the images, meaning that they can be time-consuming. Furthermore, in many scenarios, such as the alignment of animal tibiae, it is difficult to identify similar features manually owing to the significant shape variation. Therefore, the alignments resulting from such methods can be inaccurate.

Therefore, the Applicant has identified that there is a need for an improved method of determining a misalignment in images that overcomes the above-mentioned shortcomings. When viewed from a first aspect, the invention provides a computer processing system for use in image alignment, the computer processing system configured to use a machine learning model to: extract a first set of features from a first image; extract a second set of features from a second image; compare the first and second sets of features to identify respective first and second subsets of one or more features that are common to both the first and second sets of features; generate a mapping between each of the first subset of features and a corresponding feature of the second subset of features; and use the mapping to determine a predicted transformation for aligning the first image with the second image.

When viewed from a second aspect, the invention provides a method of implementing a machine learning model for use in image alignment, the method comprising: inputting a first image and a second image to a machine learning model; using the machine learning model to: extract a first set of features from the first image; extract a second set of features from the second image; compare the first and second sets of features to identify respective first and second subsets of one or more features that are common to both the first and second sets of features; generate a mapping between each of the first subset of features and a corresponding feature of the second subset of features; and use the mapping to determine a predicted transformation for aligning the first image with the second image.

Thus, it will be appreciated that the present invention allows a misalignment between a pair of images to be quantified in terms of a predicted transformation (e.g. a rotation, translation, scaling etc.) that should be applied to one of the images in order to align it with the other. A machine learning model is used to extract features from each of the input images and to determining a mapping between the features that are common to both images. The model is configured to use this mapping to determine prediction for the transformation that will map the first image onto the second image.

Machine learning models comprise a plurality of parameters (e.g. weights or biases) that determine how an input to the model is propagated through the various layers of the model in order to generate an output. The parameters can be tuned, so as to increase the accuracy of the model, by “training” the model on a set of training data.

Thus, when viewed from a third aspect, the invention provides a method of training a machine learning model for use in image alignment, wherein the machine learning model comprises a plurality of updatable parameters and is configured, in an inference mode, to: receive a first image and a second image; and according to the parameters, determine a predicted transformation for aligning the first image with the second image; the training method comprising: receiving a first training image and a second training image, wherein the first training image is misaligned with the second training image by a known transformation; inputting the first training image and the second training image to the model; extracting a first set of features from the first training image; extracting a second set of features from the second training image; comparing the first and second sets of features to identify respective first and second subsets of one or more features that are common to both the first and second sets of features; generating a mapping between each of the first subset of features and a corresponding feature of the second subset of features; using the mapping to determine a predicted transformation for aligning the first training image with the second training image; comparing the predicted transformation with the known transformation to update the parameters of the model.

The first (e.g. training) image and/or the second (e.g. training) image may be two- dimensional (2D) images or three-dimensional (3D) volumes. It will be appreciated by the skilled person that references to the first and second images herein may be equally applicable to the first and second training images. Indeed, it is preferable that there is a similarity between the (e.g. types and/or formats of the) first/second images (input to the model during inference mode) and the training images (input to the model during training mode), as this can help to improve the accuracy of the trained model. In a set of preferred embodiments, the first (e.g. training) image and/or the second (e.g. training) image comprise medical scans. The images may comprise computer tomography (CT) or ultrasound scans. The first (e.g. training) image and/or the second (e.g. training) image may comprise scans of a human body or an animal body. In a set of preferred embodiments, one of the first or second images does not show a contrast agent, whereas the other one of the images does show a contrast agent. The image that does not show a contrast image may be obtained before a contrast is applied to the subject of the image, while the other image may show the subject after the contrast has been applied. By comparing aligned images of cartilage with and without contrast-enhancement, the shape of cartilage may be extracted.

In some embodiments, the method of training a model for use in image alignment further comprises applying a synthetic transformation function to an unmodified (i.e. pre-modification) first training image to generate the first training image. The unmodified first training image may be in alignment with the second training image before being modified by the transformation function. Preferably, applying the synthetic transformation function to the unmodified first training image comprises applying the known transformation. The known transformation may comprise a known rotation and/or a known translation and/or a known scaling. The known transformation may take any suitable or desired value(s), including zero. Preferably, the first training image is misaligned with the second training image by both a known rotation and a known translation.

The predicted transformation may comprise a predicted rotation and/or a predicted translation and/or a predicted scaling. Comparing the predicted transformation with the known transformation to update the parameters of the model may comprise inputting the predicted transformation and the known transformation into a loss function. The parameters may be updated using back propagation.

In some embodiments, the method of implementing a machine learning model further comprises applying the predicted transformation to the first image to generate an aligned first image. The method may comprise superimposing the aligned first image onto the second image (or vice versa) to generate a combined image. The method may comprise determining, from the combined image, an estimate of cartilage thickness.

The machine learning model may be configured to extract any suitable or desired features. This may depend on the type of images that are to be aligned (i.e. the first image and the second image). The machine learning model may be configured to extract any of: edges, corners, shapes, textures, contrast gradients and/or semantic features. The first subset is a subset of the first set of features. The second subset is a subset of the second set of features.

The first and second sets of features may respectively comprise only one feature. However, preferably the first and second sets of features each comprise a plurality of features. The first and second subsets of features may respectively comprise only one feature. However, preferably the first and second subsets of features each comprise a plurality of features. Preferably the first set of features does not include features extracted from the second image (e.g. it includes only features extracted from the first image). Preferably the second set of features does not include features extracted from the first image (e.g. it includes only features extracted from the second image).

The machine learning model may comprise an encoder stage, a decoder stage and/or a regression stage. The (e.g. encoder stage of the) model may comprise one or more convolution blocks, configured to extract the first and/or second sets of features from the first and/or second images respectively. The convolution blocks may comprise residual down-sampling (Res-down) blocks, i.e. comprising skip connections between one or more layers. This may help to improve the learning capacity of the model.

The decoder stage of the model may comprise one or more convolution blocks. The decoder-stage convolution blocks may comprise one or more (e.g. four) residual up-sampling (Res-up) blocks. Preferably the decoder stage comprises a single branch (e.g. of convolution blocks). The convolution blocks may be arranged in series, such that the output of a former block is provided as in input to a latter block. Each convolution block of the decoder stage may comprise a skip connection received from (e.g. a corresponding (e.g. convolution) block in) the encoder stage of the model. Preferably each convolution block of the decoder stage comprises a skip connection received from (e.g. a corresponding convolution block connected to) the inter-branch module. In some embodiments, the decoder stage comprises a (e.g. single) decoder-stage convolution block for each encoder-stage convolution block that is connected to the inter-branch module The decoder stage of the model allows the spatial information of the extracted features to be restored so that the predicted transformation can be determined.

One or more of the convolution blocks may comprise one or more atrous convolution layers. Atrous convolution may help to improve the accuracy of the model by providing a wider field of view in the convolution for the same computational cost. The atrous convolution layer(s) may obtain more absolute spatial information by enlarging the receptive field of the convolution block. One or more of the convolution blocks may comprise a pooling layer, e.g. implementing max pooling.

In some embodiments, the machine learning model comprises a Siamese Encoder Decoder Structure. The (e.g. encoder stage of the) model may comprise a first and a second branch (e.g. of convolution blocks), wherein the first branch is arranged to receive the first (training) image and the second branch is arranged to receive the second (training) image. The (e.g. convolution blocks of the) first branch and the (e.g. the convolution blocks of the) second branch may be configured to extract the first set of features and the second set of features respectively from the first and second (training) images. The first set and/or the second set of features may comprise information for identifying the respective location (e.g. co-ordinates) of the features within the images.

The parameters (e.g. weights) between the first branch and the second branch may be shared. The provision of two branches allows for the sharing of learned features between the first and second images. The model may be configured to concatenate the respective outputs of the first and second branches. The respective outputs of the first and second branches (or the output of the concatenate block) may be provided to the decoder stage of the model. Preferably, the (e.g. encoder stage of the) machine learning model further comprises an inter-branch module configured to compare the first and second sets of features to identify the respective first and second subsets of features that are common to both the first and second sets of features. The inter-branch module may be known as a Mutual Non-Local link (MNL) module, as the features extracted to the first and second subsets are mutual (i.e. common to both the first and second images) and non-local (i.e. the features may exist in any part of the first or second images). The inter-branch module may comprise one or more inter-branch blocks.

The number of convolution blocks and/or inter-branch blocks in the model may depend on the size of memory available. It may depend on the size of the first and second images. In some embodiments, the first branch of the model comprises e.g. between four and eight, e.g. six convolution (e.g. Res-down) blocks. In some embodiments, the second branch of the model comprises between two and ten, e.g. between four and eight, e.g. six convolution (e.g. Res-down) blocks. Preferably the first branch comprises the same number of convolution blocks as the second branch. The convolution blocks may be arranged in series, such that the output of a former block is provided as in input to a latter block. The (e.g. inter-branch blocks of the) inter-branch module may be (e.g. respectively) connected to one or more (e.g. four) of the convolution blocks of the first and second branches. In some embodiments, the inter-branch module is connected to one or more (e.g. four) latter convolution blocks of the first and second branches. The inter-branch module may be connected to all of the convolution blocks of the first and second branches.

The first image and the second image (and thus, the first set of features and the second set of features respectively) may comprise similar “inter-branch” features, i.e. where, for a particular feature in the first image, there is a similar feature in the second image. The first image and/or the second image may comprise similar “intra-branch” features, i.e. where, for a particular feature in the first image (or the second image), there is a similar feature in the same image. It will be understood that the presence of similar intra-branch features is not necessarily indicative of a similarity between the first image and the second image.

The (e.g. inter-branch blocks of the) inter-branch (e.g. MNL) module may implement an extended attention mechanism that covers mutual connection between the first and second branches. A naive approach to identify common features between the first image and the second image might be to concatenate the first image and the second image to generate a concatenated image, and then, in an attention mechanism, use the concatenated image as a query and a copy of the concatenated image as a key. However, using this method, both inter-branch features and intra-branch features may be identified.

In embodiments of the present invention, however, preferably the inter-branch (e.g. MNL) module is configured to use, in an extended ‘mutual attention’ mechanism, the first set of features as a query, and the second set of features as a key. The inter-branch module may be configured to use the second set of features as a query, and the first set of features as a key. Preferably the inter-branch module is configured to do both, i.e. in a first operation, to use the first set of features as a query and the second set of features as a key and, in a second operation, to use the second set of features as a query and the first set of features as the a key. This can avoid the scenario in which intra-branch features are identified unnecessarily.

Thus, the inter-branch module may be configured to identify similar “inter-branch” features between the first image (e.g. the first set of features) and the second image (e.g. the second set of features), but to exclude (e.g. from the first and second subsets) similar “intra-branch” features, as defined above. By excluding intra-branch features from the first and/or second subsets of features, the accuracy of the model may be improved.

The inter-branch (e.g. MNL) module is preferably configured to calculate an embedded Gaussian similarity representation from the first and second sets of features. This can allow the model to highlight the correspondence between features of the first and second images with high precision.

Preferably, the inter-branch module is configured to generate the mapping between each of the first subset of features and a corresponding feature of the second subset of features. The inter-branch module may comprise a retrieval mapping function arranged to receive the respective outputs from (e.g. one or more convolution layers in) the (e.g. (encoder stage) convolution blocks of the) first branch and the second branch. This allows first and second sets of features to be compared in order to generate the mapping. Preferably, the machine learning model is configured to generate the mapping based on a similarity between each of the first subset of features and the corresponding feature of the second subset of features, e.g. a particular feature in the first subset may be mapped to a most similar feature in the second subset.

Preferably the regression stage comprises one or more (e.g. two) fully connected layers configured to use the mapping to determine the predicted transformation for aligning the first image with the second image. The predicted transformation may be determined using location information (e.g. coordinates) for each of the first (or second) subset of features.

In some embodiments, the method comprises cascading the machine learning model and one or more (e.g. substantially similar) further machine learning models. The method may comprise applying the predicted transformation to the first image (or the second image) to generate an adjusted first image (or adjusted second image), and inputting the adjusted first image (or adjusted second image) and the second image (or first image) to the further machine learning model. Cascading two or more models in this manner can improve the prediction accuracy of the multi stage model. The first model may be used to provide a coarse alignment of the first image and the second image. The further model(s) may be used to provide a finer alignment of the first image and the second image.

The computer processing system of the present invention may comprise a PC, laptop, tablet, smartphone or any other suitable processing device. The computer processing system may comprise one or more processors. The computer processing system may comprise memory for storing data (e.g. the first and/or second images or parameters for the model). The computer processing system may comprise memory for storing computer software for execution by one or more processors of the processing system.

When viewed from a further aspect, the invention provides a computer readable storage medium storing computer software code comprising instructions which, when executed on a computer processing system, cause the computer processing system to train a model for use in image alignment by: receiving a first training image and a second training image, wherein the first training image is misaligned with the second training image by a known transformation; inputting the first training image and the second training image to the model; extracting a first set of features from the first training image; extracting a second set of features from the second training image; comparing the first and second sets of features to identify respective first and second subsets of one or more features that are common to both the first and second sets of features; generating a mapping between each of the first subset of features and a corresponding feature of the second subset of features; using the mapping to determine a predicted transformation for aligning the first training image with the second training image; updating weights of the model by comparing the predicted transformation with the known transformation.

When viewed from a further aspect, the invention provides a computer readable storage medium storing computer software code comprising instructions which, when executed on a computer processing system, cause the computer processing system to implement a machine learning model for use in image alignment by: inputting a first image and a second image to a machine learning model; using the machine learning model to: extract a first set of features from the first image; extract a second set of features from the second image; compare the first and second sets of features to identify respective first and second subsets of one or more features that are common to both the first and second sets of features; generate a mapping between each of the first subset of features and a corresponding feature of the second subset of features; and use the mapping to determine a predicted transformation for aligning the first image with the second image.

The computer processing system may comprise one or more input peripherals for accessing (e.g. receiving) the first and/or second (e.g. training) images. The computer processing system may comprise a camera or (e.g. CT) scanner for capturing the first and/or second images.

The software may be stored on a non-transitory computer-readable storage medium, such as a hard-drive, a CD-ROM, a solid-state memory, etc., or may be communicated by a transitory signal such as data over a network.

Features of any aspect or embodiment described herein may, wherever appropriate, be applied to any other aspect or embodiment described herein.

Where reference is made to different embodiments or sets of embodiments, it should be understood that these are not necessarily distinct but may overlap.

Certain preferred embodiments of the invention will now be described, byway of example only, with reference to the accompanying drawings in which:

Figure 1 is a process diagram illustrating the alignment process according to an embodiment of the present invention;

Figures 2a-d show more detailed block diagrams of the neural network shown in Figure 1;

Figure 3 is a flowchart of the steps of running the neural network of Figure 1 in ‘inference mode’;

Figure 4 is a flowchart of the steps for training the neural network of Figure

1 ;

Figure 5 is a process diagram illustrating the process of training and implementing the neural network of Figure 1; and

Figure 6 shows the results of testing a neural network in accordance with an embodiment of the present invention.

There are many different industrial situations in which there is a desire to quantify the misalignment between a pair of images, e.g. and to subsequently align the images. As will now be described, embodiments of the present invention provide systems and methods that are able to achieve this.

Figure 1 shows a process diagram illustrating the alignment of a first image volume 2 and a second image volume 4. The first and second image volumes 2, 4 are Computer Tomography (CT) scans of a mouse tibia. The size s of the three- dimensional (3D) image volumes 2, 4 is given by s = d x h x w, where d, h and w refer respectively to the thickness, height and width of the input volumes 2, 4

The first volume 2 is a scan of the tibia with contrast-enhancement, while the second volume 4 is a scan of the same tibia without contrast-enhancement. By super-imposing the first volume 2 onto the second volume 4, the thickness of the cartilage in the tibia can be determined.

However, as the first and second volumes 2, 4 are obtained from separate scans, it cannot be guaranteed (and indeed it is unlikely) that the two volumes 2, 4 will be in alignment in terms of either rotation or translation.

The first volume 2 and the second volume 4 are processed, augmented, synthesized using intensity and spatial transformations and then provided as inputs to a neural network 6.

The neural network 6 comprises an encoder stage 8, a decoder stage 10 and a regression stage 12.

The encoder stage 8 comprises two Siamese encoder branches 9a, 9b of six residual down-sampling (Res-down) blocks 14, wherein the Res-down blocks 14 have shared parameters between the two branches 9a, 9b. The encoder stage 8 further comprises four Mutual Non-Local link (MNL) blocks 16 that mutually link the four latter pairs of the Res-down blocks 14a.

The decoder stage 10 comprises a concatenation block 18 and four residual up- sampling (Res-up) blocks 20 with skip connections received from the Res-down blocks 14 with the same indices.

The regression stage 12 comprises two fully connected layers 22 for regression, the output from which is a rotation value Q and a translation value t. The rotation Q and translation t values output by a trained model are those values which, when applied to the first volume 2, rotate and translate the first volume 2 into (or towards) alignment with the second volume 4. Applying the predicted rotation Q and translation i to the first volume 2 and superimposing the resulting volume with the second volume 4 gives the aligned output image volumes 24.

The structure of the neural network 6 is shown in more detail in Figures 2a-d.

Figure 2a shows an overview of the structure of the neural network 6 of Figure 1. Figure 2b shows the structure of each Res-down block 14. Figure 2c shows the structure of each Res-down block 14a that is linked with the MNL block 16. Figure 2d shows the structure of each Res-up block 20.

In Figures 2a-2d, the symbol 18 is a concatenate function, ® is a matrix product, ® is a pixel-wise addition, F is a retrieving function and y is a unary function. C3/a refers to a 3x3x3 convolution with atrous rate = a. A refers to the activation function, T refers to ‘transpose’, MP refers to a max-pool operation, and FC refers to a fully- connected layer i is the block number, d, h,w and c respectively denote the thickness, height, width and channel number of the input volume/feature maps for each branch. Solid arrows indicate data forward connections, whereas dotted arrows indicate skip connections.

Operation of the neural network 6 in inference mode (i.e. once it has been trained) will now be described with reference to Figures 1 and 2a-d, as well as the flowchart of Figure 3.

In a first step S101, a trained neural network model is initialised. The method of training the neural network 6 is described below with reference to Figure 4.

In a second step S102, the pair of image volumes 2, 4, to be aligned are input to the trained model. The first volume 2 is the “moving” volume, as it is the volume to which the predicted rotation Q and translation t values will be applied in order to align the first volume 2 with the “fixed” second volume 4.

Each Res-down block 14 comprises a plurality of convolution layers followed by an activation layer. In a third step S103, the Res-down blocks 14 are configured to use convolution techniques to extract feature points 26 from the first volume 2 and the second volume 4. The first branch 9a is configured to extract a first set of feature points 26 from the first volume 2, while the second branch 9b is configured to extract a second set of feature points 26 from the second volume 4.

The feature points 26 are distinctive geometries located within the image volumes 2, 4 that can be used as guides for determining how the first volume 2 should be rotated and/or translated in order to be mapped onto the second volume 4. There may be feature points 26 within the first set that do not appear in the second set, and vice versa.

In a fourth step S104, the MNL blocks 16 are configured to identify the feature points 26 that are common to both the first set of feature points 26 and the second set of feature points 26, and to determine a mapping between the common feature points 26.

MNL is generally defined by the following: where ® denotes the matrix multiplication; Y m ,Y f denote the respective first and second sets of features from two branches 9a, 9b; Y m2f , Y f2m are the output signals from an MNL block 16; F is a retrieval function for the similarity measurement between the two input volumes 2, 4, based on the embedded Gaussian similarity representation for retrieving:

F(T 1; y 2 ) := softmax(Y / ® W T ® W ® Y 2 ) (2) and Y is a unary function:

Y(U) := w ® Y (3) for applying a matrix of trainable weights W to the set of features Y. The use of the embedded Gaussian function as the retrieval function may allow the neural network 6 to highlight the correspondence between features of the two input volumes 2, 4 more precisely than other retrieval functions, as the softmax function is focussed on a narrower range of features.

The MNL blocks 16 are configured to establish inter-branch links between feature points 26 (i.e. features that are common to both input volumes 2, 4), but to ignore intra-branch links (i.e. features that occur multiple times in an input volume 2, 4). In the MNL blocks 16, two matching matrices, from the second input branch 9b to the first input branch 9a and the inverse, are computed by a retrieval mapping F of each pair of inter-branch feature vectors, which are used to correspond and connect the voxels of the feature maps between the two branches 9a, 9b. This means that the MNL blocks 16 are able to capture the global-range connection of similar features between two branches 9a, 9b. The retrieval function F only takes into account the features that are common to both input volumes 2, 4.

Although the down-sampling in the encoder stage 8 allows features within each input volume 2, 4 to be determined and mapped together, this operation also causes the loss of spatial information. In a fifth step S105, the Res-up blocks 20 of the decoder stage 10 restore this spatial information of the feature points 26, and the fully connected layers are configured to obtain the predicted rotation Q and translation t values that are output from the network 6.

In a sixth step S106, the first input volume 2 is rotated by Q and translated by t to align the first volume 2 with the second volume 4. The aligned volumes 24 are output in a final step S107.

Figure 4 shows a flowchart of the steps for training the neural network 6 of Figure 1.

In a first step S201, the network is initialised and default internal parameters are set.

In a second step S202, two training image volumes 3, 5 from a set of training volumes are selected and are input to the neural network 6. Each pair of training volumes 3, 5 are initially aligned, but are subsequently misaligned by applying a spatial transform f syn to one 3 (i.e. the “moving” volume) of each pair. The spatial transform function f syn introduces a known synthetic rotation Q and a known synthetic translation t, which are applied to the moving volume 3 before the training volumes 3, 5 are input to the neural network 6. Different transform functions f syn are applied to each pair of training volumes 3, 5 so that the neural network 6 is trained on training data with a wide range of variability. This helps to improve the accuracy of the neural network 6 when it is run in inference mode. In subsequent steps S203 - S205, the neural network 6 operates in substantially the same way as in inference mode, which is described above with reference to steps S103 - S105 of Figure 2. Thus, the network 6 is configured to output a predicted rotation value Q and a predicted translation value t.

In a sixth step S206, the predicted rotation Q and translation t values are respectively compared a loss function £ with the synthetic rotation Q and translation t values that were actually applied to the moving volume 3.

In a seventh step S207, the parameters of the neural network 6 are updated according to the output of the neural network 6 using back propagation and it is determined in a subsequent step S208 whether the neural network 6 is sufficiently trained.

If it is determined in step S208 that further training is required, then the process returns to step S202, and a further pair of training volumes 3, 5 is input to the neural network 6, with a different synthetic rotation Q and synthetic translation t having been applied to the moving volume 3.

Once it is determined in step S208 that no further training is required, the parameters of the model 6 are saved. The neural network 6 may now be run in inference mode in order to predict a rotation Q and translation t for mapping two input volumes of unknown misalignment.

Multiple networks 6 may be trained with different respective training sets having varying ranges of synthetic rotation and translation. The networks 6 may then be cascaded into a multi-stage model, which can then be run in inference mode to align pairs of volumes with greater accuracy.

Figure 5 is a process diagram illustrating the process of training and implementing the neural network 6 of Figure 1. Figure 5 shows an implementation in which trained models 6 to 6n are cascaded in order to improve the accuracy in the alignment of the output aligned volumes 24. Each subsequent network 6a-n in the multi-stage model enables progressively finer alignment of the input volumes 2, 4. It will be appreciated by those skilled in the art that many variations and modifications to the embodiments described above may be made within the scope of the various aspects and embodiments of the invention set out herein.

Experimental Results

One hundred ex-vivo micro-CT volumes were acquired from the tibiae of fifty subjects (mice) with the volume size and resolution of 512 x 512 x 512 and 10 x 10 x 10pm 3 /vox. The subjects varied from zero weeks to twenty weeks post osteoarthritic surgery. Before the data acquisition, each subject was cut to the suitable size for the machine. The tibial bones were scanned twice: immediately after being contrasted as well as over forty-eight hours after washing out the contrast.

The collected volumes were thresholded to eliminate the impacts of their spatial transformations on network prediction. The voxels’ values of the volumes were then normalised into the range 0-1 for stable gradient propagation. Finally, the input volumes were sub-sampled with linear interpolation to a volume size of 64 x 64 x64 and resolution 80 x 80 x 80pm 3 /vox in order to fit the scale of the network.

In the training phase, each pre-processed CT volume X was augmented before being fed into the network by randomly translating with a displacement along 3 axes, rotating around an arbitrary axis uniformly distributed in the sphere surface, intensity scaling with a random coefficient, and also applying a threshold value to extend the dataset size. Then the two inputs of the network, fixed volume X f ·= X and moving volume X m ·= f syn (X ) were synthesised with the augmented volume X, and the synthetic transformation f syn for training, as described above. The network was trained with various numbers of iterations to identify whether it converges.

A multi-stage (two-stage) network was also tested in a synthetic and a real testing setting.

For large-range transformations (LRT), the one-stage neural network achieved lower than 5° rotation error (RE) in around 88% of cases, lower than sub-voxel translation error (TE) in around 68% of cases and higher than 80% Dice Similarity Coefficient (DSC) in around 64% of cases.

For small-range transformations (SRT), the one-stage network achieved a RE of less than 5° in over 99% of cases, and lower than sub-voxel TE was achieved in over 80% of cases.

The multi-stage (two-stage) network was found to outperform the one-stage network. In over 97% of cases, RE < 5° and sub-voxel TE were achieved.

Figure 6 shows the output aligned volumes 124 for four pairs of input volumes 102 used in a real experiment for a one-stage network (g t ) (left) and a multi-stage (two- stage) network (g 2 (@gi- D - net)) (right). Each pair of input volumes 102 is shown with the volumes superimposed, so that the misalignment can be seen.

The top row (D-net) of outputs 124 shows the results of networks using standard convolution, whereas the bottom row (Atrous D-net) shows the results of networks using atrous convolution.

It can be seen that there is little overlap between the volume pairs in the initial input volumes 102, compared with substantially greater overlap in the output aligned volumes 124.

It can also be seen hat both the one-stage network and the multi-stage network were able substantially to align all of the pairs of input volumes, with the multi-stage network providing an improved alignment in most cases.