METHOD OF PARALLEL IMPLEMENTATION IN DISTRIBUTED MEMORY ARCHITECTURES

Title:

METHOD OF PARALLEL IMPLEMENTATION IN DISTRIBUTED MEMORY ARCHITECTURES

Document Type and Number:

WIPO Patent Application WO/2022/162386

Kind Code:

Abstract:

The present techniques relate to a method for a parallel implementation of a sequential Monte Carlo (SMC) method of modelling an industrial process on a distributed memory architecture, and a system for implementing the same. The method may comprise receiving, from at least one sensor, a measurement of at least one parameter within the physical system, wherein the at least one parameter is related to the true state of the physical system; and implementing, on a server comprising a distributed memory architecture, a sequential Monte Carlo (SMC) process using a plurality of statistically independent particles and the at least one measured parameter to estimate the true state of the physical system, wherein the distributed memory architecture has a plurality of cores each of which are ranked. The method and architecture provide an efficient parallel implementation by effectively parallelising a redistribute step which may be considered to be a constituent part of a resampling step. The SMC method may be used to perform state estimation of dynamic or static models under non-linear, non-Gaussian noise.

Inventors:

VARSI ALESSANDRO (GB)
MASKELL SIMON (GB)

Application Number:

PCT/GB2022/050238

Publication Date:

August 04, 2022

Filing Date:

January 28, 2022

Export Citation:

Click for automatic bibliography generation Help

Assignee:

UNIV LIVERPOOL (GB)

International Classes:

G06F9/50

Other References:

VARSI ALESSANDRO ET AL: "A Single SMC Sampler on MPI that Outperforms a Single MCMC Sampler", 24 May 2019 (2019-05-24), XP055919281, Retrieved from the Internet [retrieved on 20220509]
VARSI ALESSANDRO ET AL: "A Fast Parallel Particle Filter for Shared Memory Systems", IEEE SIGNAL PROCESSING LETTERS, IEEE, USA, vol. 27, 4 August 2020 (2020-08-04), pages 1570 - 1574, XP011809361, ISSN: 1070-9908, [retrieved on 20200916], DOI: 10.1109/LSP.2020.3014035
F. GUSTAFSSON: "Particle filter theory and practice with positioning applications", IEEE AEROSPACE AND ELECTRONIC SYSTEMS MAGAZINE, vol. 25, July 2010 (2010-07-01), pages 53 - 82, XP011316492
R. DELGADO-GONZALON. CHENOUARDM. UNSER: "A new hybrid Bayesian-variational particle filter with application to mitotic cell tracking", 2011 IEEE INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING: FROM NANO TO MACRO, March 2011 (2011-03-01), pages 1917 - 1920, XP031944912, DOI: 10.1109/ISBI.2011.5872784
J. M. COSTA: "Estimation of tumor size evolution using particle filters", JOURNAL OF COMPUTATIONAL BIOLOGY, 2015
Z. WUS. LI: "Reliability evaluation and sensitivity analysis to ac/uhvdc systems based on sequential monte carlo simulation", IEEE TRANSACTIONS ON POWER SYSTEMS, vol. 34, July 2019 (2019-07-01), pages 3156 - 3167, XP011730876, DOI: 10.1109/TPWRS.2019.2896228
P. J. VAN LEEUWENH. KNSCHL. NERGERR. POTTHASTS. REICH: "Particle filters for high dimensional geoscience applications: A review", QUARTERLY JOURNAL OF THE ROYAL METEOROLOGICAL SOCIETY, vol. 145, 2019
H. LOPESR. TSAY: "Particle filters and Bayesian inference in financial econometrics", JOURNAL OF FORECASTING, vol. 30, 2011, pages 168 - 209
C. A. NAESSETHF. LINDSTENT. B. SCHN: "High-dimensional filtering using nested sequential Monte Carlo", IEEE TRANSACTIONS ON SIGNAL PROCESSING, vol. 67, August 2019 (2019-08-01), pages 4177 - 4188, XP011735585, DOI: 10.1109/TSP.2019.2926035
J. ZHANGH. JI: "Distributed multi-sensor particle filter for bearings- only tracking", INTERNATIONAL JOURNAL OF ELECTRONICS - INT J ELECTRON, vol. 99, 2012, pages 239 - 254
F. LOPEZL. ZHANGJ. BEAMANA. MOK: "Implementation of a particle filter on a gpu for nonlinear estimation in a manufacturing remelting process", 2014 IEEE/ASME INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENT MECHATRONICS, July 2014 (2014-07-01), pages 340 - 345, XP032628374, DOI: 10.1109/AIM.2014.6878102
F. LOPEZL. ZHANGA. MOKJ. BEAMAN: "Particle filtering on gpu architectures for manufacturing applications", COMPUTERS IN INDUSTRY, vol. 71, 2015, pages 116 - 127, XP029226472, DOI: 10.1016/j.compind.2015.03.013
L. M. MURRAYA. LEEP. E. JACOB: "Parallel resampling in the particle filter", JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, vol. 25, no. 3, 2016, pages 789 - 805
A. VARSIJ. TAYLORL. KEKEMPANOSE. PYZER KNAPPS. MASKELL: "A fast parallel particle filter for shared memory systems", IEEE SIGNAL PROCESSING LETTERS, vol. 27, 2020, pages 1570 - 1574, XP011809361, DOI: 10.1109/LSP.2020.3014035
M. BOLICP. M. DJURICSANGJIN HONG: "Resampling algorithms and architectures for distributed particle filters", IEEE TRANSACTIONS ON SIGNAL PROCESSING, vol. 53, July 2005 (2005-07-01), pages 2442 - 2450, XP011135003, DOI: 10.1109/TSP.2005.849185
M. BOLICA. ATHALYES. HONGP. DJURIC: "Study of algorithmic and architectural characteristics of gaussian particle filters", SIGNAL PROCESSING SYSTEMS, vol. 61, 2010, pages 205 - 218, XP019832279
R. ZHUY. LONGY. ZENGW. AN: "Parallel particle phd filter implemented on multicore and cluster systems", SIGNAL PROCESS., vol. 127, October 2016 (2016-10-01), pages 206 - 216
F. BAIF. GUX. HUS. GUO: "Particle routing in distributed particle filters for large-scale spatial temporal systems", IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, vol. 27, no. 2, 2016, pages 481 - 493, XP011593914, DOI: 10.1109/TPDS.2015.2405912
K. HEINEN. WHITELEYA. CEMGIL: "Parallelizing particle filters with butterfly interactions", SCANDINAVIAN JOURNAL OF STATISTICS, vol. 47, no. 2, 2020, pages 361 - 396
S.SUTHARSANT. KIRUBARAJANT. LANGM. MCDONALD: "An optimization-based parallel particle filter for multitarget tracking", IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS, vol. 48, April 2012 (2012-04-01), pages 1601 - 1618
A. VARSIL. KEKEMPANOSJ. THIYAGALINGAMS. MASKELL: "Parallelising particle filters with deterministic runtime on distributed memory systems", IET CONFERENCE PROCEEDINGS, 2017, pages 11 - 18
O. DEMIRELI. SMALW. NIESSENE. MEIJERINGI. SBALZARINI: "Ppf- a parallel particle filtering library", IET CONFERENCE PUBLICATIONS, vol. 2014, no. 10, 2013
S. MASKELLB. ALUN-JONESM. MACLEOD: "A single instruction multiple data particle filter", IEEE NONLINEAR STATISTICAL SIGNAL PROCESSING WORKSHOP, 2006, pages 51 - 54, XP031157616
J. THIYAGALINGAML. KEKEMPANOSS. MASKELL: "MapReduce particle filtering with exact resampling and deterministic runtime", EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, vol. 2017, October 2017 (2017-10-01), pages 71
A. VARSIL. KEKEMPANOSJ. THIYAGALINGAMS. MASKELL: "A single smc sampler on mpi that outperforms a single mcmc sampler", EPRINT ARXIV: 1905.10252, 2019
M. ARULAMPALAMS. MASKELLN. GORDONT. CLAPP: "A tutorial on particle filters for online nonlinear/non-gaussian Bayesian tracking", TRANS. SIG. PROC., vol. 50, no. 2, 2002, pages 174 - 188
J. D. HOLT. B. SCHONF. GUSTAFSSON: "On resampling algorithms for particle filters", 2006 IEEE NONLINEAR STATISTICAL SIGNAL PROCESSING WORKSHOP, September 2006 (2006-09-01), pages 79 - 82
T. LIM. BOLICP. M. DJURIC: "Resampling methods for particle filtering: Classification, implementation, and strategies", IEEE SIGNAL PROCESSING MAGAZINE, vol. 32, no. 3, 2015, pages 70 - 86, XP011577481, DOI: 10.1109/MSP.2014.2330626
A. HEGYIL. MIHAYLOVAR.BOELZ. LENDEK: "Parallelised particle filtering for freeway traffic state tracking", 2007 EUROPEAN CONTROL CONFERENCE (ECC, 2007, pages 2442 - 2449, XP032752313
K. EMAMIT. FERNANDOH. H. LUH. TRINHK. P. WONG: "Particle filter approach to dynamic state estimation of generators in power systems", IEEE TRANSACTIONS ON POWER SYSTEMS, vol. 30, no. 5, 2015, pages 2665 - 2675
A. DOUCETS. SNCAL: "Fixed-lag sequential monte carlo", 2004 12TH EUROPEAN SIGNAL PROCESSING CONFERENCE, 2004, pages 861 - 864, XP032760175
P. D. MORALA. DOUCETA. JASRA: "Sequential monte carlo samplers", JOURNAL OF THE ROYAL STATISTICAL SOCIETY. SERIES B (STATISTICAL METHODOLOGY, vol. 68, no. 3, 2006, pages 411 - 436
T. L. T. NGUYENF. SEPTIERG. W. PETERSY. DELIGNON: "Efficient sequential monte-carlo samplers for bayesian inference", IEEE TRANS-ACTIONS ON SIGNAL PROCESSING, vol. 64, March 2016 (2016-03-01), pages 1305 - 1319, XP011597963, DOI: 10.1109/TSP.2015.2504342
K. PETERSENM. NIELSENS. BRANDT: "Computer Vision - ECCV 2010", 2010, SPRINGER, article "A static smc sampler on shapes for the automated segmentation of aortic calcifications", pages: 666 - 679
X. LUOT. KITASAKAK. MORI: "Medical Image Computing and Computer-Assisted Intervention - MICCAI 2011", 2011, SPRINGER, article "Manismc: A new method using manifold modeling and sequential monte carlo sampler for boosting navigated bronchoscopy", pages: 248 - 255
F. NIELSEN, INTRODUCTION TO HPC WITH MPI FOR DATA SCIENCE, September 2016 (2016-09-01)
S. WHITEN. VEROSKYT. NEWHALL: "A cuda-mpi hybrid bitonic sorting algorithm for gpu clusters", 2012 41ST INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOPS, September 2012 (2012-09-01), pages 588 - 589, XP032265740, DOI: 10.1109/ICPPW.2012.82
E. JEREMIAHS. SISSONL. MARSHALL PRICER. MEHROTRAA. SHARMA: "Bayesian calibration and uncertainty analysis of hydrological models: A comparison of adaptive metropolis and sequential monte carlo samplers", WATER RESOURCES RESEARCH - WATER RESOURCES, vol. 47, July 2011 (2011-07-01)
G. ZHUX. LIJ. MAY. WANGS. LIUC. HUANGK. ZHANGX. HU: "A new moving strategy for the sequential monte carlo approach in optimizing the hydrological model parameters", ADVANCES IN WATER RESOURCES, vol. 114, 2018, pages 164 - 179

Attorney, Agent or Firm:

APPLEYARD LEES IP LLP (GB)

Download PDF:

View/Download PDF PDF Help

Claims:

Claims 1. A method for estimating a true state of a physical system, the method comprising receiving, from at least one sensor, a measurement of at least one parameter within the physical system, wherein the at least one parameter is related to the true state of the physical system; and implementing, on a server comprising a distributed memory architecture, a sequential Monte Carlo (SMC) process using a plurality of statistically independent particles and the at least one measured parameter to estimate the true state of the physical system, wherein the distributed memory architecture has a plurality of cores each of which are ranked; wherein implementing the SMC method comprises determining the number of copies of particles required for each particle; and redistributing copies of particles which are to be duplicated across the distributed memory architecture, wherein redistributing includes moving the particles which are to be duplicated to the cores having lowest ranks to create gaps in the higher ranked cores using a rotational nearly sort process; when the rotational nearly sort process is complete, filling the gaps with the required number of copies of the particles which are to be duplicated using a rotational split process; and when the rotational split process is complete, filling the remaining gaps in each core using a sequential redistribute process; wherein after redistributing, each of the plurality of cores only has copies of the particles which are to be duplicated and the total number of copies of each particle across the plurality of cores within the distributed memory architecture equals the determined number of copies. 2. The method of claim 1, wherein moving the particles using the rotational nearly sort process comprises determining, for each particle to be duplicated, the number of particles in lower ranked cores which are not to be duplicated; and obtaining, for each particle to be duplicated, an associated binary expression of the number of particles which are not to be duplicated.

3. The method of claim 2, wherein moving the particles which are to be duplicated is an iterative process comprising scanning, for each of the particles to be duplicated, a bit of the associated binary expression; when the scanned bit is equal to one, rotating the associated particle to a lower ranked core; and repeating in sequence the scanning and rotating steps for each bit. 4. The method of claim 3, wherein the iterative process initially scans a least significant bit of each binary expression and finally scans a most significant bit of each binary expression. 5. The method of any one of the preceding claims, wherein filling the gaps using the rotational split process comprises computing a minimum shift value for each of the particles to be duplicated, wherein the minimum shift value is a value representing the minimum number of shifts each of the particles must take to securely distance itself from the particles in the lower ranked cores; and computing a maximum shift value for each of the particles to be duplicated, wherein the maximum shift value is a value representing the maximum number of shifts each of the particles may take to end up in a gap. 6. The method of claim 5, wherein computing the minimum shift value associated with a particle to be duplicated comprises summing for each lower ranked core one less than the number of copies which are required of each particle in the lower ranked cores. 7. The method of claim 6, wherein computing the maximum shift value associated with a particle to be duplicated comprises summing the minimum shift value with one less than the number of copies which are required of the associated particle. 8. The method of any one of claims 5 to 7, comprising obtaining, for each particle to be duplicated, an associated binary expression of the minimum shift value and the maximum shift value. 9. The method of claim 8, wherein creating the gaps is an iterative process comprising scanning, for each of the particles to be duplicated, a bit of the associated binary expression of the minimum shift value and the maximum shift value; when both scanned bits are equal to one, rotating the associated particle to a higher ranked core by a number of positions corresponding to the scanned bits; and repeating in sequence the scanning and rotating steps for each bit. 10. The method of claim 9, further comprising when only one scanned bit is equal to zero, splitting the number of copies to be duplicated to determine excess duplicates and rotating the excess duplicates to a higher ranked core by a number of positions corresponding the scanned bits. 11. The method of any one of the preceding claims, wherein filling the remaining gaps in each core using a sequential redistribute process comprises duplicating particles within the memory of each core. 12. The method of any preceding claim, further comprising obtaining values for each of the particles; computing weights for each of the particles, wherein the weight indicates the resemblance of the obtained values to the true state; normalising the computed weights; determining the number of copies which are required based on the normalised weights; and after redistributing the particles, resetting the weights. 13. The method of claim 12, wherein determining the number of copies ncopiesⁱ required for each particle i is computed from: where cdfⁱ is the cumulative density function of the weights up to and including the i^th particle and ^ is drawn from a uniform distribution with u ~ [0,1). 14. The method of any preceding claim, wherein the SMC process is selected from a particle filter, a fixed-lag SMC sampler and an SMC sampler.

15. An architecture for estimating a true state of a physical system, the architecture comprising at least one sensor for measuring at least one parameter within the physical system, wherein the at least one parameter is related to the true state of the physical system; and a server comprising a distributed memory architecture having a plurality of cores, wherein the cores are uniquely identified by a rank and the cores are configured to implement a sequential Monte Carlo (SMC) process on the distributed memory architecture using a plurality of statistically independent particles to estimate the true state of the physical system using the at least one measured parameter; wherein implementing the SMC process by the cores comprises determining the number of copies of particles required for each particle; and redistributing copies of particles which are to be duplicated across the distributed memory architecture, wherein redistributing includes moving the particles which are to be duplicated to the cores having lowest ranks to create gaps in the higher ranked cores using a rotational nearly sort process; when the rotational nearly sort process is complete, filling the gaps with the required number of copies of the particles which are to be duplicated using a rotational split process; and when the rotational split process is complete, filling the remaining gaps in each core using a sequential redistribute process; wherein after redistributing, each of the plurality of cores only has copies of the particles which are to be duplicated and the total number of copies of each particle across the plurality of cores within the distributed memory architecture equals the determined number of copies.

Description:

Method of Parallel Implementation in Distributed Memory Architectures Field [0001] The present techniques relates to a method for a parallel implementation of a sequential Monte Carlo method of modelling an industrial process on a distributed memory architecture, and a system for implementing the same. Background [0002] Sequential Monte Carlo (SMC) methods are a well-established family of algorithms to solve state estimation problems related to dynamic and static models, given a stream of data (i.e. measurements). The key idea is to employ a user-defined proposal distribution to draw N samples (i.e. particles) which approximate the probability density function of the state of the model. Then a correction step, called resampling, is used to correct the particle degeneracy that the sampling technique introduces. The application domain may be vast and diverse and may range, for example, from positioning [1] to medical research [2] [3], risk analysis [4], weather forecasting [5], financial econometrics [6] and broadly speaking any research or industrial field in which it is important to collect data and make predictions afterwards. [0003] Modern applications of SMC methods have increasingly demanding accuracy and run-time constraints which can be met in several ways. For example, the number of particles could be increased [9] [10], or more sophisticated proposal distributions could be used as described in [7] or more measurements could be used if possible [8]. However, applying any of these solutions is likely to significantly slow down the run-time which could also become even more problematic if the constraint on the measurement rate is strict. An alternative solution to meet the run-time constraints without losing accuracy could be to employ parallel computing. [0004] SMC methods are parallelisable, but it is not trivial to achieve an efficient parallelisation. The resampling step, which is necessary to respond to particle degeneracy [24], is indeed a challenging task to parallelise. This is due to the problems encountered in parallelising the constituent redistribute step whose textbook implementations achieve ^(N) time complexity on a single core. [0005] On Shared Memory Architectures (SMAs), it has been proved that redistribute can achieve _^(^^^^^) time complexity. Examples can be found in [9] to [12] for GPUs. However, High Performance Computing (HPC) applications need to use Distributed Memory Architectures (DMAs) to overcome the limitations in modern SMAs of low memory capacity and low degree of parallelism (DOP). _[0006] On DMAs, parallelization is more complicated as the cores cannot directly access the other cores’ memory without exchanging messages. Three DMA solutions (along with mixed versions of them) are presented in [13], [14]: Centralized Resampling (C-R), Resampling with Proportional Allocation (RPA) and Resampling with Non Proportional Allocation (RNA). These approaches have been re-interpreted and improved several times, such as in [15]-[18]. In C-R, a central unit gathers the particles from all cores, performs redistribution and scatters the result back to the cores, making the communication increase linearly with degree of parallelism (DOP). RPA is randomised as redistribute is partly or potentially entirely centralized to a single or subset of master cores, leading to strongly non-deterministic data movement. In RNA, although the central unit is simpler than in RPA, after performing local resampling, particle exchange between neighbour cores is cyclically performed until redistribution is achieved, risking heavy redundant communication. In [18] to [20], it has been proved that such strategies have accuracy or stability issues, especially for unbalanced workload, large _^ or large DOP. Summary [0007] According to the present invention there is provided an apparatus and method as set forth in the appended claims. Other features of the invention will be apparent from the dependent claims, and the description which follows. We also describe a method for estimating a true state of a physical system, the method comprising receiving, from at least one sensor, a measurement of at least one parameter within the physical system, wherein the at least one parameter is related to the true state of the physical system; and implementing, on a server comprising a distributed memory architecture, a sequential Monte Carlo (SMC) process using a plurality of statistically independent particles and the at least one measured parameter to estimate the true state of the physical system, wherein the distributed memory architecture has a plurality of cores each of which are ranked. Implementing the SMC method may comprise determining the number of copies of particles required for each particle; and redistributing copies of particles which are to be duplicated across the distributed memory architecture. Redistributing includes moving the particles which are to be duplicated to the cores having lowest ranks to create gaps in the higher ranked cores using a rotational nearly sort process; when the rotational nearly sort process is complete, filling the gaps with the required number of copies of the particles which are to be duplicated using a rotational split process; and when the rotational split process is complete, filling the remaining gaps in each core using a sequential redistribute process. After redistributing, each of the plurality of cores only has copies of the particles which are to be duplicated and the total number of copies of each particle across the plurality of cores within the distributed memory architecture equals the determined number of copies. [0008] We also describe an architecture for estimating a true state of a physical system, the architecture comprising at least one sensor for measuring at least one parameter within the physical system, wherein the at least one parameter is related to the true state of the physical system; and a server comprising a distributed memory architecture having a plurality of cores, wherein the cores are uniquely identified by a rank. The cores are configured to implement a sequential Monte Carlo (SMC) process on the distributed memory architecture using a plurality of statistically independent particles to estimate the true state of the physical system using the at least one measured parameter. Implementing the SMC process by the cores may comprise determining the number of copies of particles required for each particle; and redistributing copies of particles which are to be duplicated across the distributed memory architecture. Redistributing includes moving the particles which are to be duplicated to the cores having lowest ranks to create gaps in the higher ranked cores using a rotational nearly sort process; when the rotational nearly sort process is complete, filling the gaps with the required number of copies of the particles which are to be duplicated using a rotational split process; and when the rotational split process is complete, filling the remaining gaps in each core using a sequential redistribute process. After redistributing, each of the plurality of cores only has copies of the particles which are to be duplicated and the total number of copies of each particle across the plurality of cores within the distributed memory architecture equals the determined number of copies. [0009] The above method and architecture provide an efficient parallel implementation by effectively parallelising the redistribute step which may be considered to be a constituent part of a resampling step. The SMC method may be used to perform state estimation of dynamic or static models under non-linear, non-Gaussian noise. The plurality of particles may be termed hypotheses or samples and may be generated at every step. The plurality of particles may be sampled from a user-defined proposal distribution such that each particle represents the probability density function (pdf) of the true state of the physical system. Each particle may be assigned to a weight such that the weights provide information on which particle best describes the real state of the physical system. [0010] The following features may apply to the method and the architecture. [0011] The number of particles (N) may be a power of two. Similarly, the number of cores (P) may be a power of two. Each core may own elements. [0012] The rotational nearly sort process may be described as a process which ensures that the particles which are to be duplicated are separated from those which are not to be duplicated. The particles which are not to be duplicated are considered empty spaces (or gaps, the terms may be used interchangeably) which can be substituted with copies of the particles which are to be duplicated. Moving the particles using the rotational nearly sort process may comprise determining, for each particle to be duplicated, the number of particles in lower ranked cores which are not to be duplicated; and obtaining, for each particle to be duplicated, an associated binary expression of the number of particles which are not to be duplicated. Determining the number of particles not to be duplicated may be considered as counting the zeros that each of the particles can see in the lower ranked cores. The number of zeros may be considered as the number of positions to shift by down into the lower ranked cores. Shifting may be shifting to a lower ranked core and also where the core contains more than one particle, i.e. more than one partition, shifting to a lower partition within a core. [0013] Moving the particles which are to be duplicated may be an iterative process. The iterative process may comprise scanning, for each of the particles to be duplicated, a bit of the associated binary expression; rotating based on the scanned bit and repeating in sequence the scanning and rotating steps for each bit. When the scanned bit is equal to one, the associated particle may be rotated to a lower ranked core. When the scanned bit is equal to zero, the associated particle may not be rotated (or shifted, the terms are used interchangeably, as well as related nouns such as shifts and rotations). A least significant bit of each binary expression may be scanned first. Then each of the adjacent bits may be scanned in sequence. Finally, a most significant bit of each binary expression may be scanned. [0014] There may be an optional leaf shift stage if the number of cores (P) is fewer than the number of particles (N). The leaf shift stage may be performed before the scanning and rotating steps. In the leaf shift stage, a core may transfer all the particles to a neighbouring core. [0015] The goal of the rotational split phase is to fill the gaps in a way that balances the workload across the cores. This can be achieved by making enough room between the particles that have to be copied and creating the copies in these new empty spaces. While these two steps could be performed one after another, here it is shown that it is possible to perform both at the same time. [0016] Filling the gaps using the rotational split process may comprise computing a minimum shift value for each of the particles to be duplicated and computing a maximum shift value for each of the particles to be duplicated. The minimum shift value may be a value representing the minimum number of shifts each of the particles must take to securely distance itself from the particles in the lower ranked cores; and the maximum shift value may be a value representing the maximum number of shifts each of the particles may take to end up in a gap. [0017] Computing the minimum shift value associated with a particle to be duplicated may comprise summing for each lower ranked core one less than the number of copies which are required of each particle in the lower ranked cores. Computing the maximum shift value associated with a particle to be duplicated may comprise summing the minimum shift value with one less than the number of copies which are required of the associated particle. The calculations may be: max_shitfs ⁱ = min_shifts ⁱ + ncopies ⁱ − 1 = csum ⁱ − i − 1 where ^^^^^^^ ^{^} is the number of copies which is required for each particle, and csum ⁱ is the inclusive cumulative sum over ncopies up to the i ^th element. More details are provided below. [0018] For each particle to be duplicated, an associated binary expression of the minimum shift value and the maximum shift value may be obtained. Creating the gaps may be an iterative process comprising scanning and rotating steps which are repeated. For each of the particles to be duplicated, a bit of the associated binary expression of the minimum shift value and the maximum shift value may be scanned. When both scanned bits are equal to one, the associated particle may be rotated to a higher ranked core by a number of positions corresponding to the scanned bits. When only one scanned bit is equal to zero, the number of copies to be duplicated may be split to determine excess duplicates and the excess duplicates may be rotated to a higher ranked core by a number of positions corresponding to the scanned bits. When no scanned bit is equal to one, there is no rotation. The rotation may be determined by the bit which is scanned, for example particles are rotated by decreasing power of two numbers of bit position. [0019] The scanning may be in the opposite direction to the scanning in the rotational nearly sort process. A most significant bit of each binary expression may be scanned first. Then each of the adjacent bits may be scanned in sequence. Finally, a least significant bit of each binary expression may be scanned. [0020] There may be an optional leaf shift stage if the number of cores (P) is fewer than the number of particles (N). The leaf shift stage may be performed after a certain number of scanning and rotating steps. [0021] The goal of the sequential redistribute phase is to fill the remaining gaps in each core which are not filled by the rotational split phase. Filling the remaining gaps in each core using a sequential redistribute process may comprise duplicating particles within the memory of each core. [0022] The SMC method may further comprise obtaining values for each of the particles; computing weights for each of the particles, wherein the weight indicates the resemblance of the obtained values to the true state; normalising the computed weights; determining the number of copies which are required based on the normalised weights; and after redistributing the particles, resetting the weights. The obtaining, computing, normalising, determining, redistributing, and resetting steps may be repeated at each iteration. The values for the each of the particles may be obtained from the proposal distribution. This step, along with computing weights for each of the particles, may be termed importance sampling (IS). Determining the number of copies ^^^^^^^ ^{^} required for each particle ^ may be computed from: where cdf ⁱ is the cumulative density function of the weights up to and including the i ^th particle and ^ is drawn from a uniform distribution with u ~ [0,1). The bracket operator is the ceiling function which rounds up the input number to the next integer, unless the input number is an integer already in which case the output of the ceiling function is identical to the input. [0023] There are various embodiments of the SMC process. For example, the SMC process may be selected from a particle filter, a fixed-lag SMC sampler and an SMC sampler. For particle filters, there may be one measurement per iteration and resampling is typically only used if needed. For SMC samplers, there may be only one measurement which is collected, and the iterations are run given the same measurement. The method finalises estimates with recycling and resampling is used only if needed. For fixed-lag SMC, a history of i > 1 measurements may be used at each iteration (where l is the lag). The method samples and resamples trajectories of i > 1 particles. [0024] For a particle filter, the weight may be calculated from: are the weights for time steps ^ and ^ − 1, ^ _^ ^{^} and ^ _^ ^{^} ^ _^ are the values of the particles for time steps t and ^ − 1, ^(^ _^ ^{^} , ^ _^^^ ^{^} ^ _^^ ) is the incremental posterior, commonly called target or posterior distribution, q() is the proposal distribution defined by the user, and ^ _^ is a measurement of the system collected at time ^. [0025] For an SMC sampler, the weight may be calculated from: where and ^ are the weights for iterations ^ and ^ are the values of the particles for iterations t and is the pdf of the state, q() is the proposal distribution defined by the user, Y _t is the measurement of the system used in every iteration and L() is a user defined backward kernel. [0026] For a fixed-lag SMC sampler, a new measurement may be collected from the at least one sensor but the previous l measurements may also be used. The weight may be calculated from: where and are the weights for time steps t and are the sampled particles from the old trajectories, are the sampled particles for the new trajectories, l is the lag, the p() terms are posterior for the new and the old trajectories, q() is the proposal distribution defined by the user, Y _t-l:t is the set of previously collected measurements of the system and L() is a user defined backward kernel. [0027] We also describe a (non-transitory) computer readable medium carrying processor control code which when implemented in a system causes the system to carry out the method described above. [0028] Although a few preferred embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that various changes and modifications might be made without departing from the scope of the invention, as defined in the appended claims. Brief description of drawings [0029] For a better understanding of the present techniques, and to show how embodiments of the same may be carried into effect, reference will now be made, by way of example only, to the accompanying diagrammatic drawings in which: [0030] Figure 1a is a schematic illustration of an infrastructure comprising a distributed memory architecture; [0031] Figure 1b is a schematic illustration of the server layer of Figure 1a; [0032] Figure 2a is a flowchart of a method of implementing the SMC module of the system of Figure 1b; [0033] Figure 2b is a schematic illustration of the steps in a reduction algorithm which is used to calculate sum; [0034] Figure 3 is a schematic illustration of the redistribute step in the method shown in Figure 2a; [0035] Figure 4 is a flowchart of the steps within the rotational nearly sort phase of Figure 3; [0036] Figure 5a is a schematic illustration of the rotational nearly sort algorithm of Figure 4; [0037] Figure 5b is a schematic illustration of inclusive/exclusive cumulative sum phase which is used in the rotational nearly sort algorithm of Figure 4; [0038] Figure 6 is a flowchart of the steps within the rotational split phase of Figure 3; [0039] Figure 7 is a schematic illustration of the rotational split algorithm of Figure 6 along with the final sequential redistribute phase; [0040] Figure 8a plots, for N = 2 ¹⁶ and M = 1 (where M is the particle dimension), the run-time of the redistribute phase of the overall algorithm of Figure 2a against increasing numbers of cores (P) comparing the results for the new RoSS method shown in Figure 3 with two prior art methods labelled B-R and N-R; [0041] Figures 8b and 8c, like Figure 8a, plot the comparative run-time of the redistribute phase for the new method and prior art methods against increasing numbers of cores (P) forN = 2 ²⁰ and for N = 2 ²⁴, respectively; [0042] Figure 8d plots the run-time of the redistribute phase for the new method and prior art method N-R against increasing numbers of cores (P) for N = 2 ²⁴ with measured results shown for P < 256 and expected results shown forP > 256; [0043] Figures 9a to 9c plot the run-time of the overall implementation in a real-world example of the particle filter against increasing numbers of cores (P) comparing the results which incorporate the new RoSS method for the redistribute phase with two prior art methods labelled B-R and N-R, the number of particles increases for each of Figures 9a to 9c; [0044] Figure 9d plots the run-time of the particle filter for the new method and prior art method N-R against increasing numbers of cores (P) for N = 2 ²⁴ with measured results shown for P < 256 and expected results shown for P > 256; [0045] Figure 10a shows the workload percentage for each phase of the overall algorithm as it changes with increasing number of cores (P) for the real-world example and compares the results which incorporate the new RoSS method for the redistribute phase with two prior art methods labelled B-R and N-R; [0046] Figure 10b shows the workload percentage for each phase of the overall algorithm as it changes with increasing number of cores (P) for a second real-world example, in this case using an SMC sampler, and compares the results which incorporate the new RoSS method for the redistribute phase with two prior art methods labelled B-R and N-R. [0047] Figures 11a to 11c plot the run-time of the overall implementation in the real- world example of the SMC sampler against increasing numbers of cores (P) comparing the results which incorporate the new RoSS method for the redistribute phase with two prior art methods labelled B-R and N-R, the number of particles increases for each of Figures 11a to 11c; and [0048] Figure 11d plots the run-time of the SMC sampler for the new method and prior art method N-R against increasing numbers of cores (P) for N = 2 ²⁴ with measured results shown for P < 256 and expected results shown for P > 256. Detailed description of drawings [0049] Figure 1a schematically illustrates an architecture for implementing the present techniques. The infrastructure comprises four separate layers: the system layer 100, the sensor layer 200, the client layer 300 and the server layer 400. The system layer 100 comprises the physical system which is to be modelled, i.e. the system for which the true state X _t is to be estimated. As explained in more detail below, the estimate is obtained using an SMC method which requires a measurement Y _t which is somehow related to the true state X _t. The measurement Y _t is obtained by the sensors in the sensor layer. The sensor layer 200 comprises at least one sensor for measuring parameters of the system to estimate the true state. The sensors may be wireless, including satellites, may be embedded within the physical system or connected by cables. [0050] The client layer 300 may comprise any suitable client device, e.g. a laptop or desktop PC, which may monitor the estimate ^ _^ of the state. The client layer 300 is connected both to the sensors, e.g. via an antenna or other wireless connection and to the server layer 400. The server layer 400 may be remote from the client layer 300 (i.e. in a different location). As shown, the client layer may request the new measurements from the sensor layer 200 and receive the new measurements from the sensor layer 200. The client layer 300 may then send the received measurements to the server layer 400 together with a model of the physical system 100. In other words, the client layer has two roles: the first is to request and collect the measurements and pass to the server; the second is to receive a new estimate of the state ^ _^ from the server such that it can be printed out for the users. The user may be an engineer who designs the model, i.e. a source code describing X _t, Y _t, the proposal, the target, the initial distributions and the backward kernel L(). [0051] The server layer 400 may be a supercomputer on which parallel methods may be installed and run. The server layer 400 may be a high performance computing machine comprising hardware cores (or processors, the terms may be used interchangeably) connected via ethernet cables or Infiniband. In particular, the server layer 400 may run an SMC method as described in more detail below. At each iteration, the server waits for the client to send a new measurement and then runs a full SMC iteration (i.e. IS, normalise, ESS, MVR, redistribute, reset and estimate in sequence) to produce a new estimate f _t and send it back to the client. All parallelisation methods (including the one for redistribute which is the focus of the method described below) are installed in the server, since they parallelise tasks which are fully compatible with any model. [0052] The client layer must also send the model to the server and compile it with the rest of the SMC method components to generate the executable file. A job request may then be submitted to the server such that the process can start. For example, a job request may be done by submitting a bash file, containing the path to the executable file, the number of cores and the memory to be allocated and other required inputs such as N,N ^∗, T _SMC , l and variant which are described in more detail below. [0053] Further details of the server layer are shown in Figure 1b. The server comprises a distributed memory architecture (DMA) executing an SMC module 40. DMAs are a type of parallel system which are inherently different from SMAs. The DMA comprises a plurality of memory blocks 10, 110, 210 and a plurality of discrete cores 20, 120, 220, connected via a common communication network 30. Three memory blocks and associated cores is merely illustrative. In this environment, the memory is distributed over the cores and each core 20, 120, 220 can only directly access its own private memory 10, 110, 210. Exchange of information stored in the memory of the other cores is achieved by sending or receiving explicit messages through a common communication network 30. [0054] The main advantages of a DMA relative to an SMA include larger memory and number of cores, scalable memory and computation capability with the number of cores and a guarantee of there being no interference when a core accesses its own memory. The main disadvantages are the cost of communication and the consequent data movement. This may affect the speed-up relative to a single core. [0055] In order to implement the methods (or algorithms, the terms are used interchangeably) discussed below, a Message Passing Interface (MPI) may be used. MPI is one of the most common application programming interfaces (APIs) for DMAs. In this model, the total number of cores is represented by P and each core is uniquely identified by a rank p = 0, 1, … ,P − 1. The cores are connected via communicators and use send or receive communication routines to exchange messages. [0056] In this arrangement, the SMC module 40 may be distributed across multiple components. The SMC module implements a sequential Monte Carlo method and may be used to perform state estimation of dynamic or static models under non-linear, non- Gaussian noise. There are a range of different implementations of SMC methods, including sequential importance resampling (SIR) particle filters. SMC methods apply the importance sampling principle to make Bayesian inferences. [0057] The general idea of SMC methods consists of generating N statistically independent hypotheses called particles or samples at every step t. The population of particles (or samples) x _t is sampled from a user-defined proposal distribution such that ^ _^ represents the probability density function (pdf) of the state of a model. Each particle ^ _^ ^{^} is then assigned to an unnormalized weight ^ ^{^} such that the array of weights provides information on which particle best describes the real state of interest. Thus, we have: [0058] As explained below, the methods implemented on this system use the divide- and-conquer paradigm and therefore there are always a power of two number of cores to balance the communication between them. For the same reason, the number of particles N will also be always a power of two number such that the particles x _t and every array related to them, e.g. the weights w _t will be equally spread amongst the cores. This means that every core will always own exactly elements of all the listed arrays whose indexes will be spread over the cores in increasing order. More precisely, given a certain N, P pair the i ^th particle (where 0 ≤ i ≤ N − 1) will always be given to the same core with MPI rank . The space complexity is then O (1) whenP = N. [0059] Figure 2a is a flowchart illustrating the method of implementing an SMC method on the DMA. As set out above, the SMC method generates N statistically independent hypotheses called particles (or samples) x _t at every given iteration t. As shown in Figure 2a, the aim of the final step S118 is to estimate the current true state x _t of the dynamic or static system. As explained in more detail below, in the case of SMC samplers only, the estimate is improved by performing a recycling step. Therefore, for brevity, S118 represents the actual estimate operation (which is required in any SMC method) followed by recycling if required. Since the particles (which are possible inferences of the true state) are weighted based on their resemblance with x _t, feedback from the system is required to correctly assign a weight to each particle. The initial steps include initialising the weights S100 and obtaining initial values for the particles S102. These two steps may be performed in any order. After that, the SMC iterations start by measuring the system S104 as indicated in Figure 2a. As described in detail below, particle filters and fixed-lag SMC samplers collect a new measurement at every iteration, while the SMC samplers simply use the first measurement at all iterations. Therefore, for brevity, S104 is intended to represent generically the measurement strategy of any SMC method. The measurement provides the feedback from the system and at every iteration t a new measurement Y _t may be collected. Y _t must be some measurable physical quantity which is either part of X _t itself or related to X _t by some physical law. [0060] At the initial iteration t = 0, no measurement has been collected yet, so Initialising the weights may be done by setting each weight to the same value. The prior data ^ _^ may be given and the initial values for the particles may be initially drawn from the initial distribution because this is the best assumption without feedback, thus we have X _t ∈ ℝ ^M Y _t ∈ ℝ ^D where ^ _^ is the current true state of the system, M is the dimension of the state, Y _t is a measurement of the system sent to the server at iteration t, D is the dimension of the measurement, is the weight at iteration t = 0 to be applied to each particle , N is the number of particles, q ₀() is the initial proposal distribution defined by the user. [0061] Once the initial values are obtained and for any iteration t > 0 measurements may be collected and the values for the particles may be drawn from the proposal distribution (step S106), for example as defined in equation (1). This phase along with the weighting procedure in equation (2) may be termed importance sampling (IS). The obtained weights may then be computed as a function of ,and normalised (steps S108 and S110), for example as defined in equations (2) and (3). where is a particle at iteration t, q is the proposal distribution defined by the user, Y _t is a measurement of the system, ^ ^th is the weight of the i particle at iteration t, is the normalised weight of the i ^th particle at iteration ^ and N is the number of particles. Equation (3) thus represents a normalise step and is used because the weights need to sum up to 1.0 (i.e.100% of all probabilities), just like any pdf does. [0062] The particles may be subject to a phenomenon called degeneracy which (within a few iterations) make all weights but one decrease towards 0. This is because the variance of the weights is proven to increase at every iteration as described in [24]. This is addressed by a resampling phase which may be considered to repopulate the particles by eliminating the most negligible ones and duplicating the most important ones. This resampling phase may not always be triggered and for example may be only triggered when needed, e.g. when the (approximate) effective sample size N _eff is below a threshold N ^∗ (e.g. N/2). There is thus a decision at step S111 as to whether resampling is required, and this requires a computation of the effective sample size (ESS) and is defined by [0063] In some cases, the user may purposely decide to perform resampling always. This can be done by setting N ^∗ = N + 1. In this case, SMC is typically found in the literature under the name of Bootstrap Filter. Several (biased or unbiased) resampling schemes exist [25], [26] but they all repopulate the particles in three steps. [0064] When it is determined that resampling is required, the first step (step S112) of the resampling may be to define the number of copies of the i ^th particle which are required (ncopies ⁱ). Various techniques may be used to generate ncopies ⁱ but regardless of the method of generation, the following statements are true: [0065] One method of performing this step is Systematic Resampling, as described in [25], but is also termed Minimum Variance Resampling (MVR) in several referenced works, e.g. [12], [19], and [21]-[23]. The key idea of MVR is to first compute the cumulative density function (CDF) of , then draw u ~ [0,1) from a uniform distribution and finally compute each ncopies ⁱ as follows: The bracket operator is the ceiling function which rounds up the input number to the next integer, unless the input number is an integer already. [0066] Once the number of copies is determined, the second step of resampling is a redistribute step (S114) in which each particle is duplicated as many times as ncopies ⁱ. The detail of the redistribute step is described in more detail below. The final step of resampling is to reset all the weights to the initial value (i.e.1/N). [0067] After resampling, every iteration in the SMC method is completed by making an estimate f _t of the true state X _t by computing the mean of the particles ^ _^ as follows: [0068] This procedure goes on for T _SMC SMC iterations. [0069] There are various embodiments of SMC which may be used, and three variants are described in detail below. The main difference between each variant is the equation defining the weights – equation (2) above. [0070] Particle filters (PFs) are described for example in [24] and are SMC methods that work well on dynamic models, which are typically used in real-time contexts where the true state changes frequently. In the PF, the SMC iterations are commonly called time steps. A new measurement Y _T is available at each time step and thus the client and sensor layers of Figure 1a communicate frequently. For example, if we are interested in using particle filters to predict the traffic level over time in a busy city, the sensors, e.g. satellites may obtain the positions of vehicles (see [27]). Another example could be an application of particles filters to estimate the state of a power system for security purposes which requires cabled sensors to measure the voltage of generators (see [28]). The description below in relation to Figures 9a to 9d show the results of a particle filter estimating the state of a vacuum arc remelting device for alloy production. [0071] In this SMC variant the weight equation is: where the p() term, commonly called target or posterior distribution, is computed as the product of ( ) , the prior distribution, and ^ ( ) , the likelihood distribution. Thus, the weight equation may be expressed as: where ^ and are particles at time t and t − 1, q() is the proposal distribution defined by the user, Y _t is a measurement of the dynamic system collected at time t, ^ and are the weights of the i ^th particle at time t and t − 1. [0072] Particle filters typically work well when the sensor layer always provides low- latency measurements. However, that is not always possible due to time delays in the external environment. Fixed-lag (FL) SMC such as described in [29] are usually used rather than particle filters in this scenario. In FL SMC, a new measurement Y _t is collected from the sensors and sent to the server. However, the server will also keep in memory, the previous ^ measurements Y _t-l:t-1, where ^ is commonly called lag. Hence l + 1 measurements Y _t-l:t are considered every time, such that the filter can retrospectively reprocess the previous measurements in light of that most recently received. Therefore, instead of sampling N particles ^ _^ ^{^} , FL SMC samples N trajectories of l + 1 particles from the old trajectories as follows: where are the old trajectories. [0073] Particles filters may be intuitively viewed as FL SMC with ^ = 0. Therefore, the equation above can be substituted in the generic algorithm which is installed in the server. Depending on the chosen SMC variant, ^ is automatically set to 0 to work in a general-purpose way. The particle trajectories are weighted by the following equation which defines the weights: where is a user defined backward kernel, which describes the correlation between the old trajectory and the new trajectory. Here the p() terms can also be expressed as likelihood-prior products as for the particle filter, but here it is omitted for brevity. [0074] SMC samplers as described in [30] work well on static models, hence those scenarios where the true state is assumed to be constant for a considerably large amount of time, such that very frequent measurements (e.g. every few seconds as in real-time applications) will not lead to a significantly different state estimation. SMC samplers then collect a single measurement at t = 1. During the SMC iterations, the same data is re-proposed to the server by the client and each new estimate f _t is simply a more accurate state estimate than it was in the previous SMC step. In other words, one can think of SMC samplers as particle filters which are specialised in maximising the state estimation accuracy given a single measurement, which requires lots of computation and hence is only practical if the application is off-line. Here the weight equation is calculated as follows: Once again, the p() terms can also be expressed as likelihood-prior products as for the PF, but here it is omitted for brevity. After the SMC iterations, SMC samplers optimise each state estimate, f _t, such that, at the t ^th iteration, the output takes the following value: where the normalisation constants c _τ may be computed as follows: This optimisation technique is called recycling and was first proposed in the context of SMC samplers in [31]. To be efficient, this operation is computed recursively at each SMC iteration, i.e. by updating step by step its numerator, ^^^^, and denominator, ^^^^, otherwise it would require the cores holding a lot of state information. [0075] SMC samplers find application in several domains, such as parameter tuning for system design or medical applications for disease diagnosis [32], [33], where the state is assumed to be constant on the testing day. The description below in relation to Figures 11a to 11d show the results of an SMC sampler estimating the water runoff levels in a hydraulic system. [0076] The following is an example of the pseudo code for an algorithm which may be used to implement the SMC method described in Figure 2a: Algorithm 1 – Sequential Monte Carlo Input: T _MSC , N, N ^∗, l, variant, model Output: t _f 1: Y ₀ ← Initialise_Measurement() 2: x ₀, w ₀ ← Initialise(model) 3: if variant ≠ FL-SMC then 4: l ← 0, only FL-SMC uses particle and data trajectories 5: end if 6: for t ← 1; t < T _SMC ; t ← t + 1 do 7: if variant ≠ SMC-S or t = 1 then 8: Y _t ← New_Measurement() 9: else 10: Y _t ← Y ₁, measurement is constant in SMC-S 11: end if 12: if variant = SMC-S then compute the following for Recycling 13: wsum ← MPI_Allreduce(w _t-1, … ), sum of w _t-1 in (13) 14: end if 15: x _t-l:t, w _t ← IS(model, X _t-l:t-1 , W _t-1, Y _t-l:t , variant) see (9) and all variants of (2) 16: ← Normalise(w _t), see (3) 17: N _eff ← ESS ( ), see (4) 18: if N _eff < N ^∗ then perform resampling 19: ncopies ← MVR( , see (6) 20: x _t-l:t ← Redistribute(x _t-l:t , ncopies, N) 21: 22: end if 23: t _f ← Estimate(x _t), see (7) 24: if variant = SMC-S then improve estimate with recycling 25: 26: if t = 1 then initialise numerator and denominator in (12) 27: nsum ← f _tc _t, dsum ← c _t 28: else update numerator and denominator in (12) 29: nsum ← nsum + f _tc _t, dsum ← dsum + c _t 30: end if 31: f _t ← Recycling(nsum, dsum), see (12) 32: end if 33: end for After an initialise step, each iteration produces a new state estimate given one or more measurements. Each iteration requires the following steps (or components): importance sampling (IS), normalise, ESS, resampling and estimate (followed by recycling in the case of SMC samplers). As explained above, particle filters (PF), SMC samplers and fixed-lag SMC are examples of SMC methods. For particle filters, there is one measurement per iteration and resampling is typically only used if needed. For SMC samplers, only one measurement is collected, and the iterations are run given the same measurement. The method finalises estimates with recycling and resampling is used only if needed. For fixed-lag SMC, a history of l > 1 measurements is used at each iteration. The method samples and resamples trajectories of l > 1 particles. For each of particle filters, SMC samplers and fixed-lag SMC there is a variant when resampling is always used, and this is called Bootstrap Filter. [0077] Each of the components of the SMC module may be parallelised as explained below by one of four parallelisation methods: embarrassing parallel, reduction, cumulative sum and fully-balanced redistribute (such as the new RoSS method). In the process described above, the steps reset, initialise, equations (6), (9) and all variants of (2) are parallelisable using embarrassing parallel because the cores can work simultaneously and independently on each i ^th element of these operations. They are thus element-wise operations and achieve 0(1) time complexity for P = N cores. The steps represented by equations (3), (4), (7) and (13) require sum and can be easily parallelised by using reduction. The time complexity of any reduction operation scales logarithmically with the number of cores, more precisely as defined below: i On MPI, reduction can be computed by calling MPI_Reduce or MPI_Allreduce. An example of a reduction to compute sum for N = 8 elements and P = 4 MPI cores is shown in Figure 2b. [0078] Calculating the cumulative density function (CDF) of the weights requires cumulative sum which is also termed prefix sum or scan in the literature. Parallel cumulative sum scales as above. On MPI, parallel cumulative sum can be computed by calling MPI_Scan or MPI_Exscan as described for example in [34]. An example for N = 8 elements and P = 4 MPI cores is shown in Figure 5b. The redistribute step used in the process above and described below in detail is a new fully-balanced and deterministic redistribute step. By contrast, some known techniques which are used in the redistribute step are impossible to parallelise in an element-wise fashion. For example, a textbook implementation of redistribute is described as sequential redistribute (S-R) and the algorithm is shown in the following pseudo-code: Algorithm 2 – Sequential Redistribute (S-R) Input: x, ncopies, n output: x _new 1: i ← 0, initialise the index of the output array 2: for j ← 0; j < n; j ← j + 1 do 3: for k ← 0; k < ncopies ^j; k ← k + 1 do 4: the particle is duplicated 5: i ← i + 1, update the index of the output array 6: end for 7: end for It is impossible to simply divide equally the iteration space across the cores because each ncopies ⁱ randomly changes at every iteration ^ as it may be equal to any integer number between 0 and N. Therefore, an element-wise parallelisation would be extremely unbalanced. On DMAs, parallelisation is even more problematic because the cores can only directly access their own private memory. [0079] The fully-balanced redistribute step of the present techniques has a perfectly balanced network where the cores deterministically perform the same computation and send/receive the same number of messages. Since the result is exactly the same as S-R, a resampling step using the fully-balanced redistribute process described below provides an exact result. A fully-balanced redistribute step is one in which the cores deterministically perform the same computation and send/receive the same number of messages. An alternative way of parallelising the redistribute step is a central unit approach. In this approach, the central unit(s) may be overworked in comparison with the remaining cores because they perform extra duties, such as decision making, particle duplication, particle routing to the other cores or a combination of them. Some central unit approaches may sacrifice accuracy to have faster communication between the cores [20], or they might be non-deterministic [14], [15] which potentially translates to little or no scalability when, in the worst case, the workload is centralised to a single core [19]. [0080] As shown below, the redistribution in the new method happens in logarithmic time complexity by using a three-phase approach where the first two phases take 0(log ₂N) steps. After that, S-R can be performed locally in constant time, as explained below. Figure 3 illustrates the goal of each phase through a practical example with N = 8 particles andP = 4 MPI cores. In Figure 3, the value of each particle x ⁱ (where the subscript t is omitted for brevity) is denoted by a capital letter for brevity but it is noted that the value is actually a vector of ^ real numbers. [0081] Before the redistribute step, the values of each of the eight particles x ⁰ to x ⁷ are distributed across the cores in order, i.e. the lowest ranked core p = 0 has the values of the two lowest numbered particles x ⁰ and x ¹. The number of copies of each particle (ncopies ⁱ) is determined as described above and is stored with the value of the particle. As shown, zero copies of particles x ⁰, x ¹, x ², x ⁵ and x ⁷ are required together with 1 copy of particle x ³, 5 copies of particle x ⁴, and 2 copies of particle x ⁶. In the first phase of the redistribute step, all particles which must be duplicated are moved to the left (i.e. moved to cores having lower ranking) using a technique termed rotational nearly sort. This also automatically creates empty spaces, or gaps, on the higher ranked cores, that can be filled with particles to duplicate. In other words, the purpose of this phase is to ensure that the particles which are to be duplicated are separated from those which are not to be duplicated. As shown in Figure 3, after rotational nearly sort, the particles x ³, x ⁴ and x ⁶ along with their number of copies to create are moved to the two lowest ranked cores. [0082] The next phase of the redistribute step is to fill the empty spaces on the higher ranked cores with the required number of copies. However, instead of directly filling the gaps, the next phase is termed rotational split and creates room to the right (i.e. in the next highest core) of each particle value to duplicate and fills the gaps at the same time. [0083] The final phase, illustrated in Figure 7, is S-R and is performed after the rotational split phase. This final phase is to redistribute the particles within each MPI core. This way, all the remaining gaps from rotational split will be filled. Thus, as shown, the two copies of the value of particle x ⁶ are split across the highest ranked core (p = 3) which is being used. [0084] Figure 4 shows the approach of the rotational nearly sort phase which may be described qualitatively as follows. The strategy for getting every particle that must be duplicated to the left is to first count the zeros that each of these particles can see on their left side (step S200), i.e. determining how many particles in lower ranked cores are not to be duplicated. The number of zeros is the number of positions to shift by onto the left and thus the terms zeros and shifts may be used interchangeably in this context. Referring to Figure 3, particles x ³ and x ⁴ see 3 zeros on their left while particle x ⁶ sees 4 zeros. In the next step S202, the number of zeros may be expressed as binary numbers, for example 3 = (011) _^. The next step is to scan a bit of the binary number (step S204). The scanning starts with the least significant bit (LSB) and moves to the most significant bit (MSB). There is a determination (step S206) as to whether the scanned bit is equal to 1. If the bit is equal to 1, the value of the particle will be rotated to the left (step S208) by a number of positions which is based on the bit which has been scanned. If the bit is equal to 0, there is no rotation (step S210). If there are more bits as determined in step S212, the next bit is scanned and step S204 to S212 are repeated until there are no more bits. [0085] In other words, at every iteration, the bits will be scanned and rotations to the left will be performed if and only if the scanned bit is 1. For the example of particle x ³, at the first iteration, the bit is 1 and thus there is rotation to the left by one position because this is the first iteration. In the second iteration, the relevant bit is also equal to 1 and thus again there is rotation but in this case by two positions because this is the second bit (and second iteration). In the third (and final) iteration, the relevant bit is 0 and thus there is no rotation. Since all particles shift to the left and follow an LSB to MSB order strategy, it is impossible that two consecutive particles for which ncopies ⁱ > 0 will collide, because during the rotations, they will at most catch up with each other. For example, particles x ⁴ and x ⁶ are separated by a zero and must rotate by three and four positions respectively. However, by the time x ⁶ has to rotate by four, x ⁴ will have already rotated by three positions. n0086] The following is an example of an algorithm which may be used to implement the rotational nearly sort process described in Figure 4: Algorithm 5 – Rotational Nearly Sort Input: x, ncopies, Output: x, ncopies 1: x, ncopies, zeros ← S-NS(x, ncopies, n), see Algorithm 3 2: shitfs ← MPI_Exscan(zeros, … ), compute shifts 3: if P < N then perform leaf stage of the binary tree 4: partner ← (p − 1)&(P − 1), i.e. neighbour 5: if (shifts & (n − 1)) > 0 then at least one of the LSBs is 1 6: for i ← 0; i < n; i ← i + 1 do 7: if i < shifts & (n − 1) then 8: Send x ⁱ , ncopies ⁱ to partner 9: ncopies ⁱ ← 0, delete the sent particle 10: else 11: Shift particle to the left by shifts & (n − 1) positions 12: end if 13: end for 14: Send shifts − shifts & (n − 1) (number of shifts left) to partner 15: shifts ← shifts − shitfs & (n − 1), update shifts 16: else 17: Send an array of −1s to partner (Message to reject) 18: end if 19: Accept or reject the received particles and shitfs 20: end if 21: for k ← 1; k ≤ log ₂P; k ← k + 1 do binary tree 22: partner ← (p − 2 ^k-1)&(P − 1) 23: if (shifts & n2 ^k-1) > 0 then the new LSB of shifts is 1 24: for i ← 0; i < n; i ← i + 1 do 25: Send x ⁱ , ncopies ⁱ to partner 26: ncopies ⁱ ← 0, delete the sent particle 27: end for 28: Send shifts − shifts & n2 ^k-1 (number of shifts left) to partner 29: shifts ← shifts − shifts & n2 ^k-1, update shifts 30: else 31: Send an array of −1s to partner (Message to reject) 32: end if 33: Accept or reject the received particles and shifts 34: end for [0087] Figure 5a illustrates a correct and efficient implementation of the rotational nearly sort phase which is described technically below. As set out in Figure 4, the first step of the phase is to determine the number of zeros in the lower ranked cores. In the detailed implementation of Figure 5a, this is implemented in two stages. In a first stage, a nearly sort routine is applied to the particles within the memory of each core, for example by calling a routine such as the sequential nearly sort (S-NS) illustrated in the following pseudo-code: Algorithm 3 – Sequential Nearly Sort (S-NS) Input: x, ncopies, n Output: x _new, ncopies _new , zeros 1: left ← 0, right ← n − 1, initialise left and right indexes 2: for i ← 0; i < n; i ← i + 1 do 3: if ncopies ⁱ > 0 then copy the particle to the left 5: left ← left + 1, update the left index 6: else copy the particle to the right 7: , 8: right ← right − 1, update the right index 9: end if 10: end for 11: zeros ← n − left, compute the number of zeros Such a routine only takes iterations. As shown in Figure 5a, the particles are moved to the left of the core in which they are located. For example, particle x ³ is moved from the right side to the left side of the core p = 1. [0088] It is noted that at this point, it is possible to prove that the particles within core p must shift to the left by as many positions as the number of zero elements in ncopies owned by the cores with a lower rank. If zeros is the array which counts the number of ncopies ⁱ = 0 within each core, each element of shifts can be initialised as follows: 0 ≤ zeros ⁱ ≤ N, ∀i 0 ≤ shifts ⁱ ≤ N, ∀i where ^^^^^^ is the array to keep track of the remaining shifts, zeros is the array which counts the zeros within each core and ^ is the number of the core. [0089] The next stage shown in Figure 5a is to parallelise equation (14) using parallel exclusive cumulative sum after each core ^ has initialised zeros ^p to the sum of zeros within its memory, at the end of the routine S-NS. Figure 5b illustrates an example of the process for exclusive cumulative sum. The first stage is to calculate: csum ^j ← csum ^i-1 + array ^j ∀j = 1,2, … As we can see, for core p = 0 and j = 0 within the core, the array has a value of 2 and thus the value for the first location in csum is also 2. All the other values in the array are 0 and thus the value for each entry in csum is 2. In the next step, the final value from csum is duplicated to form {coreSum, sum}. The cores are partnered and sum is exchanged with the partner and the received value is added (as in the reduction operation). For the higher ranked partner, the received value is also added to the value of coreSum. In the final stage: csum ^j ← csum ^j-1 + array ^j ∀j = 1,2, … [0090] Returning to Figure 5a, we have 2 shifts for core p = 1, 3 shifts for core p = 2 and 4 shifts for core p = 3; the core p = 0 will never perform shifts since it is on the left by definition. Once the number of shifts has been calculated, as set out in Figure 4, the number is expressed in binary notation and the particles and the number of the remaining shifts needed are shifted by increasing power of two numbers of position, depending on the bits of the binary number (shifts ^p). This translates to a bottom-up binary tree structure which will complete the task in 0(log ₂N) time. For the example, we have 2 = (010) ₂ shifts for core p = 1, 3 = (011) _^ shifts for core p = 2 and 4 = (100) ₂ shifts for core p = 3. [0091] If P < N, as shown in Figure 5a, a leaf shift stage may be performed before starting the actual tree-structure routine described below. In this additional leaf shift stage, the MPI cores send all the particles to their neighbour core, when the bitwise & of shifts ^p ^ − 1 is positive. In simple terms, the leaf stage takes care at once of the rotations referring to the first bits of shifts ^p. For the example in Figure 5a, the number of cores is 4 and the number of particles is 8 and thus the additional leaf shift stage may be used. In this example, only the value E for which 5 copies are required is moved from the core p = 2 to its neighbouring core p = 1. [0092] Examples of the bitwise & are shown below. In the first example, N = 64, P = 8 such that In this example, we wish to mask the LSBs of a number. If the input number is 43, this is written in binary as (101011) _^. The mask is equal to , i.e. 7 which in binary is (000111) _^. The bitwise & result is thus (000011) ₂. In the second example, we select the δ ^th bit of a number for δ = 1, 2,…, log ₂N. In this example, δ = 4 and N = 64. The input number is 43, this is written in binary as (101011) ₂. The mask is equal to 2 ^δ-1, i.e.8 which in binary is (001000) ₂. The bitwise & result is thus (001000) _^. In both examples, the & operation of two bits equals 1 only if both bits are 1, otherwise it equals 0. [0093] After the optional leaf stage, the actual tree-structure routine can start. At every k ^th stages of the tree (k = 1, 2,…, log ₂P), any core ^ will send to its partner p − 2 ^k-1 all its particles (i.e. x and ncopies) and s^^^^^ ^{^} (i.e. the number of remaining shifts after the current rotation), if and only if: [0094] This corresponds to checking a new bit of shifts ^p, precisely the one which is significant to this stage. At every stage, the particles will then shift by an increasing power of two number of positions and shifts get updated in 0(1). Thus, as shown in Figure 5a, when k = 1, each particle in core p = 1 is shifted by 2 and the particle in core p = 3 is not moved. When k = 2, the particles which are moved are shifted by 4 and thus as shown the particle in core p = 3 is moved. Therefore, since shifts ^p ≤ N − 1, because a particle must shift at most from one end to the other end, the overall achieved time complexity is equal to 0(log ₂N). At the end of the rotational nearly sort phase, as shown in Figure 3 and Figure 5a, ncopies has the following shape: ncopies = [ λ ⁰, λ ¹, … , λ ^m-1, 0, … , 0] (15) [0095] Figure 6 shows the approach of the rotational split phase which may be described qualitatively as follows. In order to have enough gaps for the particles to be safely duplicated, for each ncopies ⁱ > 0, to anticipate the space needed for the duplicates of the associated particle there would need to be ncopies ⁱ − 1 zeros to follow on the right (i.e. to be available in the higher ranked cores). It is also necessary to split and place copies of the particle x ⁱ onto the zeros that follow in order to progressively balance the workload across the cores. Both these aims may be achieved by computing two quantities namely the minimum shifts for each particle x ⁱ to securely distance itself from all the particles to its left, and the maximum splits and shifts for the copies of x ⁱ to safely end up in one of the zeros that follow. [0096] The minimum number of shifts may be calculated (step S300) by summing up all the (ncopies ⁱ − 1) terms on the left of the i ^th particle. Thus, for the example in Figure 3 and the particle in position ^ = 2, the minimum number of shifts is (ncopies ⁰ − 1) + (ncopies ¹ − 1), i.e. (1 − 1) + (5 − 1) = 4. The maximum number of splits and shifts which are required may be calculated (step S302) by summing the minimum number of splits with the value of (ncopies ⁱ − 1). For the example in Figure 3 and the particle in position i = 2, the minimum number of shifts is 4 and (ncopies ² − 1) = 1 such that the maximum number of shifts is then 4 + 1 = 5. Similarly, for the particle in position i = 1, the minimum number of shifts is (ncopies ⁰ − 1) = 0 and maximum number of shifts is (ncopies ⁰ − 1) + (ncopies ¹ − 1) = 4. [0097] The next steps of the method are to express the minimum and maximum as binary numbers (steps S304, S306). Thus, the minimum and maximums for the particle in position i = 2 are expressed as (100) ₂ and (101) ₂, respectively. The minimum and maximums for the particle in position i = 1 are expressed as (000) ₂ and (100) ₂, respectively. The binary numbers are then scanned and rotations are based on the scanned bits as in the rotational nearly sort process, However, in the rotational split process, the binary numbers are scanned in the opposite direction. In other words, progressively smaller power of two rotations to the right are carried out depending on the MSB of the binary notation for the minimum and maximum shifts. This is because the particles are to be scattered rather than gathered together. For each particle, three scenarios may occur: none of the copies must move; all of them must rotate; or only some must split and shift while the remaining ones keep still. [0098] The scanning and moving may be implemented as shown in steps S310 onwards. For example, there is a determination as to whether one or both of the scanned bit in each of the binary numbers is 1 (step S310). The particle does not move if neither scanned bit is 1 (step S314). If at least one scanned bit is 1, there is a determination as to whether both bits are 1 (step S312). If both scanned bits are 1, all copies are moved to the right by the number of positions corresponding to the bit which has been scanned (step S316). For example, for the particle in position ^ = 2, the MSB of both the maximum and minimum shifts expressed in binary notation is 1 and therefore all copies of x ² are misplaced and must rotate by the number of positions corresponding to the bit which has been scanned, e.g.2 ² = 4. [0099] If only one of the scanned bits is 1, some copies must be split and rotated while the others must stay in place. The number of copies to be split is equal to how many duplicates of ^ ^{^} are excess and should stay at least as many positions ahead as the rotations related to the scanned bits. The number of excess duplicates may be obtained (step S318) by subtracting from the sum of all ncopies ⁱ up to x ⁱ the new position after this round of rotation (equal to i plus the rotations to perform). For example, for the particle in position i = 1, the MSB of the maximum and minimum shifts are different and the number of excess duplicates is ncopies ⁰ + ncopies ¹ − (i + 4) = 1. The excess duplicates are then split and rotated (step S320). In this example, the single excess duplicate is copied and rotated by 4 positions and the other copies are maintained in place. [00100] Regardless of whether all, some or no copies are moved, if there are more bits as determined in step S322, the next bit is scanned and steps S310 to S322 are repeated until there are no more bits. As explained in more detail below, it is necessary to use a final process, after rotational split, in order to redistribute within each MPI core. [00101] The following is an example of an algorithm which may be used to implement the rotational split process described in Figure 6: Algorithm 6 – Rotational Split Input: x, ncopies, N, P, Output: x, ncopies 1: csum ← MPI_Cumulative_Sum(ncopies, N, P) 2: min_shifts ⁱ ← csum ⁱ − ncopies ⁱ − i, ∀i = 1, 2, … , n − 1 if ncopies ⁱ > 0 3: max_shifts ⁱ ← csum ⁱ − i − 1, ∀i = 1, 2, … , n − 1 if ncopies ⁱ > 0 4: for k ← 1; k ≤ log ₂P; k ← k + 1 do binary tree 5: 6: for i ← 0; i < n; i ← i + 1 do 7: of (max_shifts ⁱ & N2 ^-k) > 0 then at least one MSB is 1 8: if (min_shifts ⁱ & N2 ^-k) > 0 then two MSBs are 1 9: copies_to_send ⁱ ← ncopies ⁱ 10: ncopies ⁱ ← 0 11: else only one MSB is 1. Compute particles in excess 12: copies_to_s end ⁱ← csum ⁱ − i − N2 ^-k 13: ncopies ⁱ ← ncopies ⁱ − copies_to_send ⁱ 14: end if 15: star te ← csum ⁱ − copies_to_send ⁱ, if i is first 16: Send x ⁱ , copies_to_send ⁱ to partner and send also starter if i is first 17: else 18: Send −1 to partner (Message to reject) 19: end if 20: end for 21: Accept or reject the received particles and starter. Reset starter to 0 if all particles are sent and none is accepted 22: csum ⁱ ← csumi-1 + ncopies ⁰, re-initialise csum 23: csum ⁱ ← csum ^i-1 + ncopies ⁱ, ∀i = 1,2, … , n − 1, update csum 24: Update min_shifts and max_shifts as in step 2 and 3 25: end for 26: if P < N then perform leaf stage of the binary tree 27: for i ← n − 1; i ≥ 0; i ← i − 1 do 28: if csum ⁱ > n(p + 1) then do inter-core shifts 29: copies_to_send ⁱ ← min (csum ⁱ − n(p + 1), 30: ncopies ⁱ) 31: ncopies ⁱ ← ncopies ⁱ − copies_to_send ⁱ 32: Send x ⁱ , copies_to_send ⁱ to partner 33: else 34: Send −1 to partner (Message to reject) 35: end if 36: if min_shifts ⁱ > 0 then do internal shifts 37: Shift particle to the right by min_shifts ⁱ positions 38: end if 39: enf for 40: Accept or reject the received particles 41: end if [00102] Figure 7 illustrates a correct and efficient implementation of the rotational split and sequential redistribute phases. Rotational split is described technically below. The goal of this phase is to fill the gaps in a way that balances the workload across the MPI cores. Balancing the workload across the MPI cores is equivalent to having: which is essentially equation (5) above applied locally. This can be achieved by making enough room between the particles that have to be copied and creating the copies in these new empty spaces. While these two steps could be performed one after another, in this section, it is shown that it is possible to perform both at the same time in 0(log ₂N). For clarity, we believe it is more intuitive to illustrate how to achieve the first step alone and then how to extend that to accomplish the result of equation (18). [00103] As set out above, after the rotational nearly sort phase, ncopies has the shape shown in equation (15). Therefore, making enough room between the particles that have to be copied easily translates to shifting the particles to the right until ncopies has the new shape: ncopies = [ λ ⁰, 0, … , 0, λ ¹, 0, … , 0, λ ^m-1, 0, … , 0 (19) where for each ncopies ⁱ > 0, ncopies ⁱ − 1 zeros follow. [00104] As shown in Figure 7, the first step of the rotational split process is to perform an inclusive cumulative sum (csum) over the ncopies. The inclusive cumulative sum is shown in Figure 5b. The steps are the same as for exclusive cumulative sum except that in the final stage: csum ^j ← csum ^j-1 + array ^j ∀j = 1,2, … [00105] The value of csum ⁱ can be used to calculate the minimum and maximum number of shifts (steps S300 and S302 of Figure 6) for the i ^th particle as well as the number of excess duplicates to send also termed copies_to_send (step S318 of Figure 6) using: max_shifts ⁱ = min_shifts ⁱ + ncopies ⁱ − 1 = csum ⁱ − i − 1 (21) copies_to_send ⁱ = csum ⁱ − i − N2 ^-k (22) where csum [00106] To achieve the new shape for ncopies in equation (19), we consider each min_shifts ⁱ in binary notation and scan the bits from the MSB to the LSB. The particles are rotated to the right by decreasing power of two numbers of bit position, if and only if the MSB of min_shifts ⁱ is 1. This corresponds to using a top-down binary tree structure of height 0(log ₂N). At each stage k of the tree (for k = 1, 2,…, log ₂P), any core with rank p sends particles N2 ^-k positions ahead to its partner with rank p + . By repeating these steps recursively, ncopies will have the shape in (19) but not (18). That is because for each x ⁱ that must be copied more than once, we are rotating all its copies by the same number of bit positions and we are not considering that some of these copies could be split and rotated further to ensure ncopies had the shape of (18). This can be achieved by filling some of the gaps that a strategy for ensuring ncopies has the shape of (19) alone would create. [00107] Therefore, as shown in Figure 6, we do not only consider the minimum number of shifts but also consider the maximum number of shifts that any copy of x ⁱ has to perform, without causing collisions. Since (19) aims to create ncopies ⁱ − 1 gaps for each i such that ncopies ⁱ > 0, max_shifts ⁱ is calculated as set out above. Thus, at each k ^th stage, we check the MSB of both max_shifts ⁱ and min_shifts ⁱ to infer copies_to_semd ⁱ and the number of copies of x ⁱ which must rotate by N2 ^-k positions. If both of them are equal to 1, trivially we send copies_to_send ⁱ = ncopies ⁱ copies of x ⁱ (i.e. all copies are moved as per step S316 of Figure 6). [00108] However, if only the MSB of max_shifts ⁱ is equal to 1, copies_to_send ⁱ < ncopies ⁱ. In this case, the number of copies to split is equal to how many of them must be placed from position i + N2 ^-k to achieve perfect workload balance. This is equivalent to computing how many copies cause it to be the case that csum ⁱ > i + N2 ^-k. Therefore, the number of excess duplicates to send also termed copies_to_send (step S318 of Figure 6) is calculated using equation (22) above and this number of copies is sent and the remaining ones do not move. [00109] Considering the example in Figure 7 in more detail, at the first stage k = 1, we can see that the value G which requires 2 copies (for which i = 2) has values of max_shifts ⁱ = 5 and min_shifts ⁱ = 4. Thus, the MSB of both max_shifts ⁱ ind min_shifts ⁱ is 1 and all copies are moved by 4 places. By contrast, at the first stage k = 1, the value E (for which ^ = 1) which requires 5 copies has values of max_shifts ⁱ and min_shifts ⁱ of 4 and 0. Thus, only the MSB of max_shifts ⁱ is equal to 1, copies_to_send ⁱ is calculated as 1 and just one value of E is moved by 4 places. In the second stage, k = 2, the value E (for which i = 1) which requires 4 copies has values of max_shifts ⁱ and min_shfits ⁱ of 3 and 0. Thus, only the next bit value of max_shifts ⁱ is equal to 1, copies_to_send ⁱ is calculated as 2 and two values of E are moved by 2 places. [00110] In case P < N, after log ₂P stages, each core ^ will perform a leaf stage. Here, the LSBs of max_shifts ⁱ and min_shifts ⁱ are checked at once. Inter- core shifts are performed or splits and shifts are performed only if they send copies to the neighbour, i.e. if any csum ⁱ > n(p + 1) where ^(^ + 1) is the first memory address of the neighbour’s private memory. Internal shifts are also considered only if min_shifts ⁱ > 0, which will make enough room to receive particles from the neighbour since min_shifts ⁱ = max_shifts ^i-1 + 1 (see equations (20) and (21)). As shown in Figure 7, a leaf stage is applied after the second stage to move one of the two copies of E to the neighbouring location (from i = 3 to i = 4). [00111] In order to ensure logarithmic time complexity, one needs to update csum ⁱ, max_shifts ⁱ and min_shifts ⁱ in . This can be done if the cores send starter = csum ^j − copies_to_send ^j, where j is the index of the first particle to send, having ncopies ^j > 0. Because the particles are rotated by scanning the bits of in max_shifts ⁱ and min_shifts ⁱ and by using an MSB-to-LSB strategy, it is impossible for a particle to overwrite or get past another one. Hence each core can safely see the received starter as the sum of ncopies for the cores with lower rank and use it to re- initialise and update csum ⁱ in ^ as set out below. This strategy guarantees that csum ⁱ is always correct only for any index i such that ncopies ⁱ > 0, but those are the only indexes we are interested in. Once csum ⁱ is computed, equations (20) and (21) are trivially parallelisable. The pseudocode for rotational split is found in Algorithm 6. The update for csum ⁱ is: csum ⁱ = csum ^i-1 + ncopies ⁱ [00112] It is easy to infer that both min_shifts ⁱ < N − 1 and max_sh ≤if Nts − ⁱ 1 because a particle copy could at most be shifted, or split and shifted from the first to last position. Therefore, because the shifts or the splits and shifts to perform decrease stage by stage by up to a factor of two, the achieved time complexity of rotational split is 0(log ₂N). Additionally, no collision can occur because each particle copy is shifted by at least min_shifts ⁱ and by at most max_shifts ⁱ. [00113] The final phase, which is illustrated in Figure 7, is the sequential redistribute (S-R) process, whose pseudo code is illustrated in Algorithm 2. This process is called to fill the remaining gaps in each core. After rotational split, S-R can be invoked locally and will finish in ^ iterations because the particles are now equally distributed across the MPI ranks. [00114] Thus, in summary the overall redistribute step (step S114 of Figure 2a) comprises a rotational nearly sort phase and a rotational split phase only if P > 1 before calling S-R. Thus, the achieved time complexity may be inferred as ^(N) for P = 1, 0(log ₂N) for P = N cores and for any 1 ≤ P ≤ N is: [00115] The first term in (23) represents S-R which is always performed and all the steps which are called only once for any P > 1, such as S-NS. The second term describes the log ₂P stages of the rotational nearly sort and rotational split phases ^ during which we update, send and receive up to ^ particles. The overall algorithm may be termed rotational nearly sort and split (RoSS) redistribute and the pseudo code may be written as: Algorithm 7 – Rotational Nearly Sort and Split (RoSS) input: x, ncopies, N, P, Output: x x: if P > 1 then perform load balancing 2: x, ncopies ← Rotational_Nearly_Sort(x, ncopies, N, P, n, p), see Algorithm 5. ncopies has now shape (15) 3: ^, ^^^^^^^ ← Rotational_Split(^, ^^^^^^^, N, P, ^, ^), see Algorithm 6. ^^^^^^^ has now shape (19) and is fully-balanced 4: ^^^ ^^ 5: ^ ← S-R(^, ^^^^^^^, ^), see Algorithm 2. Each core duplicates particles independently within its memory [00116] Thus, all tasks in the SMC take either ^(1) or ^(^^^ _^N) time complexity, which brings down the overall time complexity to be equal to ^(^^^ _^N). Even if it was possible to perform a fully-balanced redistribute in ^(1), the time complexity of SMC will still be ^(^^^ _^N) due to normalise, ESS, MVR, estimate and recycling which all require reduction or cumulative sum as set out in the table below. [00117] The RoSS redistribute process described above could be improved in different ways. For example, a hybrid shared DMA may be used rather than a normal DMA by simply substituting the sequential bits in RoSS with their shared memory equivalent, by using, e.g. GPUs or mainstream CPUs. For example, S-R could be substituted with its OpenMP version described in [12] (which on SMAs only achieves ^(^^^ _^N) as well). Since hybrid shared DMAs have fewer MPI nodes than pure DMAs, the number of messages will be trivially lower. [00118] Another idea that can save messages is to check the maximum MSB for all ^^^^^^ ^{^} and all ^^^_^^^^^^ ^{^} at the beginning of rotational nearly sort and rotational split respectively. This will allow us to finish the execution of either of the two steps early and avoid sending unnecessary messages. For example, if ^^^^^^^ happens to be nearly-sorted already, all bits in ^^^^^^ will be 0, meaning that no messages should be sent and rotational nearly sort could be skipped entirely. However, this practice is only potentially faster as it is highly non-deterministic; therefore, we strongly do not recommend it for real-time applications. [00119] We also denote that an ^-ary tree structure with ^ > 2 cores per each node would theoretically save MPI messages versus a binary tree structure because the tree height would decrease from ^(^^^ _^N) to ^(^^^ _^N). However, the operations to compute per each node would also increase and would be much more complicated to code, meaning that the overall run-time could be equal or worse. This is why when it comes to arrays, binary trees are widely considered more efficient. The table below illustrates the time complexity of each task of SMC on DMAs using RoSS: [00120] We also note that this document describes a redistribute algorithm for particles having fixed size ^. In some applications, the particle size can change at run- time, specifically across different particles and/or SMC iterations. To handle this case, one only needs first to identify the size of the biggest particle (which can be computed in 0(log ₂N) by using reduction) and then extend the other particles by adding extra dimensions with uninformative values, such as NaNs (i.e. not a numbers). After that, the particles have all the same size and can be redistributed by using any redistribute algorithm, such as prior art variants or the novel invention described here. [00121] There exist other Monte Carlo methods that require resampling (and hence redistribute). Notable examples are Bootstrap methods and Jackknife methods, where IS, normalise, ESS and recycling are not used, and resampling and estimate are always performed to respectively correct the degeneracy of the initial distribution and produce new state estimates. [00122] The following Figures show the numerical results of redistribute first (Figures 8a to 8d) and then for an exemplary SMC module. All experiments in this section are conducted for N = 2 ¹⁶ , 2 ²⁰ , 2 ²⁴ particles and for up to P = 256 cores. Every reported speed-up or run-time represents the median value of 20 runs collected for the same N, P pair and computed in comparison with the baseline, i.e. the results for P = 1. Each SMC test is run for T _SMC = 100 SMC iterations. Resampling is computed every time to ensure the frequency of redistribution is the same. All software to run the described algorithms have been written in C++ and run on the same cluster whose technical specifications are provided in the table below: [00123] Figures 8a to 8d compare single iterations of RoSS redistribute with two other known algorithms which may be termed N-R and B-R. N-R is the nearly sort based redistribute algorithm described in [23]. B-R is an algorithm which we refer to as Bitonic sort based redistribute and is described in [22]. The pseudo code for both N-R and B-R is summarised in algorithm 4 below. A single algorithmic summary is used because they only differ in the sorting strategy they apply on ncopies and they therefore both achieve 0(log ₂N) ²) time complexity. When implementing a particle filter, each of RoSS, N-R and B-R are used in the redistribute step. The other phases of the whole algorithm are unaltered and to make sure that equation (5) is guaranteed when we quantify performance, ncopies is generated using MVR with a random Gaussian vector of normalised importance weights in input. This ensures that the statistics of the weights are similar to those encountered when considering the particle filter examples and that the performance gains that we quantify the benefit of the novel component can be expected to be similar to those encountered when that component is used in a particle filter. Algorithm 4 – Bitonic/Nearly sort based Redistribute (B-R/N-R) ^^^^^: ^, ^^^^^^^, ^^^^^^: ^ 1: if P > 1 then Bitonic or nearly sort the particles 2: if Bitonic sort is used as in [19] then 3: MPI_Bitonic_Sort(x, ncopies, n) 4: else nearly sort is used as in [23] 5: MPI_Nearly_Sort(x, ncopies, n) 6: end if 7: end if, ncopies has now shape (15) 8: for k ← 1; k ≤ log ₂P; k ← k + 1 do distribute 9: ^^^^ , the cores of each tree node of stage ^ perform cumulative sum over ncopies 10: pivot ← Linear_Search(ncopies, csum, n), pivot is 0 if not found 11: pivot ← MPI_Allreduce(èivot, … ), broadcast pivot to all cores of the same tree node 12: 13: x, ncopies ← MPI_Rotational_Shifts( the cores of each tree node of stage k express r in base-2 and rotate the particles by r positions according to its bit 14: end for 15: x ← S-R(x, ncopies, n), see Algorithm 2. Each core duplicates particles independently within its memory [00124] The table below summarises the time complexity of each task of SMC on DMAs in the single core and parallel case for B-R or N-R: [00125] Bitonic sort which is used in B-R is a fast deterministic comparison based parallel sorting algorithm which has recently been implemented on a cluster of graphic cards [35]. In [21], it has been shown that, by using a top-down divide-and-conquer approach, redistribute can be parallelised in fully-balanced fashion. Starting from the root node, the key idea consists of sorting ncopies and moving the particles at every stage of a binary tree. This can be achieved by searching for a particular index called pivot which perfectly divides the node into two balanced leaves. In order to find pivot, parallel inclusive cumulative sum is performed over ncopies and then pivot is the first index where cumulative sum is equal to or greater than . Once pivot is identified, the node is split. Sorting the particles, using Bitonic sort, is required to make sure that the splitting phase can be performed deterministically in 0(1). This routine is repeated recursively log ₂P times. For any number of cores P ≤ N it is demonstrated that Bitonic sort performs the number of comparisons defined by the equation below [00126] The first term in equation (16) describes the number of steps to perform Bitonic sort locally and the second term represents the data movement to merge the keys between the cores. Since (16) converges to 0((log ₂N) ²) for P = N cores and Bitonic sort is performed log ₂P times, we can infer that this redistribute in [21] achieves 0((log ₂N) ²) time complexity. In [20], the idea of using sort recursively is also applied to a dynamic scheduler for RPA and RNA on MPI. [00127] As described in [22], B-R improves the algorithm in [21] by making a subtle consideration: the workload can still be divided deterministically if we perform Bitonic sort only once. After this single sort, the algorithm moves on to another top- down routine where we use rotational shifts to move all particles on the left side of pivot up to the left side of the node. This way the father node gets split into two balanced leaves. This algorithm is recursively performed log ₂P times until the workload is equally distributed across the cores; then S-R is called. Rotational shifts are faster than Bitonic sort because the achieved time complexity is equal to 0(log ₂N) and, since Bitonic sort is performed only once, the overall time complexity is improved to 0((log ₂N) ²) for P = N or equal to (16) for any P ≤ N. In [22], B-R has been implemented on MapReduce and although it was significantly better than the algorithm in [21], its run-time for 512 cores was at best 20 times worse than S-R. In [19], B-R has been ported to DMAs by using MPI. The results have shown substantial speed-up for up to 128 cores, proving that MPI is a more suitable framework than MapReduce to perform B-R. [00128] B-R can be further improved as described in [23] to give the proposed algorithm nearly sort based redistribute (N-R). This improvement substitutes Bitonic sort with an alternative algorithm called nearly sort. It has been proven that one does not actually need to perfectly sort the particles to divide the workload deterministically, but only needs to guarantee ^^^^^^^ has shape of (15), which is described again below: ^^^^^^^ In order to achieve this property, the particles are first nearly sorted within each core by calling sequential nearly sort (S-NS) in Algorithm 3. Then they are recursively merged as in Bitonic sort by using the same sorting network as used for Bitonic sort. Since S- NS can be performed in linear time and the number of sent/received messages is equal to (log ₂P) ², we can infer that nearly sort costs the number of iterations defined below: [00129] Figures 8a to 8d show that the gap between the proposed approach and the two other algorithms tends to significantly increase with P. RoSS is comparable with N-R and roughly two to three times faster than B-R for P = 2 cores but is up to eight times faster for P = 256 than both N-R and B-R. [00130] The cluster, Barkla, which is used is also very small in comparison with large existing systems such as Summit which can provide up to 2.4 million cores. Therefore, the gap between RoSS and N-R/B-R is likely to increase to much larger than eight and this will be realised in settings involving more than P = 256 cores. Figure 8d shows the expected run-times for N = 2 ²⁴ and for P > 256, computed by linear interpolation of the measured run-times for P ≤ 256. RoSS is expected to be roughly 75 times faster than N-R for P = 2 ¹⁶ cores. [00131] Particles filters are used to solve filtering problems of a dynamic model. Figures 9a onwards illustrate the impact of the RoSS redistribute phase when compared to using N-R or B-R on an exemplary model for which N > 2 ¹⁶ particles were proven necessary to meet accuracy constraints and up to N = 2 ²⁴ were used. The chosen model is a vacuum arc remelting (VAR) which is a secondary melting process used in the final stage of alloy production. In this stage, a continuous arc strikes between an electrode and a solidifying ingot, making the metal melt off the electrode and then fall onto a melt pool to solidify. Since this process is done in a vacuum, a lot of impurities are removed, resulting in higher quality of the finished product. During the VAR process, it is fundamental to keep track of the liquid pool depth which, however, cannot be measured directly. A parallel particle filter can then be used for this case. The following dynamic model is for alloy 718 and is thoroughly explained and used in [9], [10]. Here for brevity, we only highlight the most important details. [00132] X _t is nine-dimensional and contains information about the electrode thermal boundary layer ∆, the electrode gap ^, the ram position X _ram , the electrode M _e, the melting efficiency μ, the centreline pool depth S _c, the mid-radius pool depth S _m, the Helium pressure p _he, and the current I. Where the time interval dt, the melting current I _c and the ram speed V _ram are controlled by the user. Here we use dt = 5 s, I _c = 6000 A and V _ram is inferred by setting to 0 the expression between squared brackets in (24b). The process noise terms dI, dV _ram, dμ, dℎe are Gaussian with 0 mean and covariances and respectively. Every quantity but ∆ _^ and Δ _t in X _t, and the voltage across R _i (which is related to I _t) can be measured directly. Therefore, the set of measurements Y _t is: where The particles are initialized as follows: where ^ = ^^^^(^ _∆ ^{^}, ^ _^ ^{^}, ^ _^ ^{^}, ^ _^ ^{^} ^ , ^ _^ ^{^}^^, ^ _^ ^{^} ^ ^^, ^ _^ ^{^}, ^ _^ ^{^} , ^ _^ ^{^} ^ ) [00133] Tables IV, V and VI below provide all the constants and σ terms in (24), (25), and (26). The dynamics is chosen as proposal such that (2) becomes: |^ _^ ^{^} ) [00134] Table IV – Standard deviations for noise terms Table V – Parameters for alloy 718 Table VI – Nominal Values, furnace and Alloy 718 properties [00135] Figures 9a to 9d shows that the speed-up of all particle filters increases as P increases, as we expected from theory. Furthermore, Figures 9a to 9c show that a particle filter with RoSS outperforms a particle filter with N-R or B-R by up to a factor of three, which is once again increasing with P. The maximum recorded speed-up is 160 for a particle filter with RoSS and about 50 for a particle filter with N-R or B-R. Figure 9d shows the run-time for RoSS and N-R in which the results are measured for a number of cores below 256 and extrapolated for a higher number of cores, i.e. above 256. As shown, a particle filter using RoSS could be up to 35 times faster than one using N-R for N = 2 ²⁴ particles and up to P = 2 ¹⁶ cores. [00136] Figure 10a shows the bottleneck analysis for N = 2 ²⁴ and increasing P for the particle filter using the new redistribute method and the prior art methods. IS takes up a large percentage of the total run-time for low P. We can indeed observe that IS is the bottleneck over any redistribute for P = 2. However, as P increases, both N- R and B-R eventually emerge as the bottleneck for 8 ≤ P ≤ 32. On the other hand, it is interesting to see that RoSS still is faster than IS for any P ≤ 256. [00137] Hydrological models are commonly used in the context of urban planning, policy making and water management. The goal is typically to estimate the daily or hourly water runoff levels in a specific and relatively small geographic area, given on-site measurements of rainfall, evapotranspiration, and water flow. Historically, the significant runoff parameters could be tuned by the modellers based on their experience. This approach is however highly prone to human mistakes. Today, SMC methods can be used in substitution. More precisely, SMC samplers can be employed to provide daily or hourly estimates of the runoff parameters because the application is not real-time. [00138] Figures 10b and 11a to 11d provide results for an application of the well- known rainfall-runoff Australian Water Balance (AWB) model. Here, for brevity, we highlight the most relevant features, but extra details about all inputs and functions can be found in [36] and [37]. In the generic version of this model, the physical system has a baseflow storage (BS) and a chain of ^ surface storage areas ^ _^ , having increasing capacity C ₁ < C ₂ <…< C _d <…< C _v. The surface storages have frictional areas A ₁, A ₂,…, A _v, respectively. As time goes on, part of the rainfall in excess at the surface will refill the BS. More precisely, the baseflow recharge is equal to BFI × Excess and the surface runoff is equal to (1 − BFI) × Excess, where BFI is the baseflow index, which determines the percentage of excess rainfall that recharges the BS. [00139] Part of the baseflow recharge gets run off too, according to another index K, called daily recession constant. The state has the following shape: X _t = [K, BFI, C ₁ , … , C _v , A ₁, … , A _v] (27) and, therefore, has M = 2ν + 2 dimensions. Here, measurements of water flow at catchments ^ _^ are provided daily or hourly and related to the state ^ _^ by a non-linear objective function ^() as follows: Y _t = g(X _t , Φ _t) + ε _t (28) where Φ _t are rainfall and evapotranspiration input data, and ε _t is a homoscedastic Gaussian measurement error with 0 mean. In this experiment, the specific application of the model is for the Bass River, a 52 km ² catchment located at Loch in the South Gippsland Basin in Victoria, on the western slope of the Strzelecki Ranges. Here v = 3 such that we estimate M = 8 parameters. [00140] Every new run of the SMC sampler is initialised as follows: K ~0.271 × Beta(51.9, 4.17) + (1 − 0.271) × Beta(255, 9.6) A ₁ ~Beta(1.4, 2.6) A ₂, A ₃ ~Beta(2.0, 2.5) c _D ~Weibull(2.16, 136c _d) where c _d = {0.5, 0.75, 1.5}. In the experiment, we focus on single calls of SMC samplers, under the same testing conditions of the VAR model. At every SMC iteration, the same measurement Y _t = Y ₁ is resent to the server, the particles are sampled from a Gaussian distribution, and weighted using (28) as target and backward kernel L() = q(). [00141] As can be observed in Figures 10b, 11a to 11d, we have very similar results to the previous example: an SMC sampler with RoSS is again up to three times faster than the same using N-R/B-R and RoSS still is the only redistribute option that has not yet emerged as bottleneck. This is not surprising because of two factors: first, this example has almost the same dimensionality as the VAR model in the previous example, and second particle filters and SMC samplers are nearly identical and mostly differ for the particle weighting strategy only. [00142] These results underline well the importance of having a fast, scalable redistribution. Since modern models may be very detailed and complex (e.g. requiring some sophisticated numerical integrator), the IS step also becomes highly computationally intensive. Therefore, a fast redistribute allows SMC methods to maintain a near-linear speed up for higher P, which is what we expect in theory but is hardly achievable in practice. [00143] The findings are encouraging because on the one hand applications of SMC are getting increasingly demanding in terms of accuracy and run-times and on the other hand, modern computers and supercomputers are progressively being built with higher numbers of cores. Therefore, as time goes on, it is getting more important to have a fast parallelisable redistribute, such that in real-world problems, the SMC module can maintain a near linear speed-up for a higher level of parallelism. [00144] The novel fully-balanced redistribute algorithm described above applies to SMC methods running on distributed memory environments. The algorithm has been implemented on MPI. The baselines for comparison are B-R and N-R, two similar state of the art fully-balanced redistribute algorithms which both achieve 0((log ₂N) ²) time complexity and whose implementation is already available on MPI. However, we have shown that RoSS achieves exactly 0(log ₂N) time complexity. The results show that RoSS is up to eight times faster than B-R and N-R for up to P = 256 cores. Similar results are observed on two real-world problems. For the same level of parallelism, an SMC method using RoSS is up to three times faster than the same using B-R/N-R and provides a maximum speed-up of 160. Under the same condition, RoSS is no longer the bottleneck. [00145] At least some of the example embodiments described herein may be constructed, partially or wholly, using dedicated special-purpose hardware. Terms such as ‘component’, ‘module’ or ‘unit’ used herein may include, but are not limited to, a hardware device, such as circuitry in the form of discrete or integrated components, a Field Programmable Gate Array (FPGA) or Application Specific Integrated Circuit (ASIC), which performs certain tasks or provides the associated functionality. In some embodiments, the described elements may be configured to reside on a tangible, persistent, addressable storage medium and may be configured to execute on one or more cores or processors. These functional elements may in some embodiments include, by way of example, components, such as software components, object- oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables. Although the example embodiments have been described with reference to the components, modules and units discussed herein, such functional elements may be combined into fewer elements or separated into additional elements. Various combinations of optional features have been described herein, and it will be appreciated that described features may be combined in any suitable combination. In particular, the features of any one example embodiment may be combined with features of any other embodiment, as appropriate, except where such combinations are mutually exclusive. Throughout this specification, the term “comprising” or “comprises” means including the component(s) specified but not to the exclusion of the presence of others. [00146] Attention is directed to all papers and documents which are filed concurrently with or previous to this specification in connection with this application and which are open to public inspection with this specification, and the contents of all such papers and documents are incorporated herein by reference. [00147] The list of references includes: [1] F. Gustafsson, “Particle filter theory and practice with positioning applications,” IEEE Aerospace and Electronic Systems Magazine, vol.25, pp.53–82, jul 2010. [2] R. Delgado-Gonzalo, N. Chenouard, and M. Unser, “A new hybrid Bayesian- variational particle filter with application to mitotic cell tracking,” in 2011 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, pp.1917–1920, March 2011. [3] J. M. Costa, “Estimation of tumor size evolution using particle filters,” Journal of Computational Biology, 012015. [4] Z. Wu and S. Li, “Reliability evaluation and sensitivity analysis to ac/uhvdc systems based on sequential monte carlo simulation,” IEEE Transactions on Power Systems, vol.34, pp.3156–3167, July 2019. [5] P. J. Van Leeuwen, H. Knsch, L. Nerger, R. Potthast, and S. Reich, “Particle filters for high dimensional geoscience applications: A review,” Quarterly Journal of the Royal Meteorological Society, vol.145, 042019. [6] H. Lopes and R. Tsay, “Particle filters and Bayesian inference in financial econometrics,” Journal of Forecasting, vol.30, pp.168–209, 012011. [7] C. A. Naesseth, F. Lindsten, and T. B. Schn, “High-dimensional filtering using nested sequential Monte Carlo,” IEEE Transactions on Signal Processing, vol.67, pp.4177– 4188, Aug 2019. [8] J. Zhang and H. Ji, “Distributed multi-sensor particle filter for bearings- only tracking,” International Journal of Electronics - INT J ELECTRON, vol.99, pp.239–254, 022012. [9] F. Lopez, L. Zhang, J. Beaman, and A. Mok, “Implementation of a particle filter on a gpu for nonlinear estimation in a manufacturing remelting process,” in 2014 IEEE/ASME International Conference on Advanced Intelligent Mechatronics, pp. 340–345, July 2014. [10] F. Lopez, L. Zhang, A. Mok, and J. Beaman, “Particle filtering on gpu architectures for manufacturing applications,” Computers in Industry, vol.71, pp.116 – 127, 2015. [11] L. M. Murray, A. Lee, and P. E. Jacob, “Parallel resampling in the particle filter,” Journal of Computational and Graphical Statistics, vol.25, no.3, pp.789–805, 2016. [12] A. Varsi, J. Taylor, L. Kekempanos, E. Pyzer Knapp, and S. Maskell, “A fast parallel particle filter for shared memory systems,” IEEE signal Processing Letters, vo.27, pp. 1570-1574, 2020 [13] M. Bolic, P. M. Djuric, and Sangjin Hong, “Resampling algorithms and architectures for distributed particle filters,” IEEE Transactions on Signal Processing, vol. 53, pp. 2442–2450, July 2005. [14] M. Bolic, A. Athalye, S. Hong, and P. Djuric, “Study of algorithmic and architectural characteristics of gaussian particle filters,” Signal Processing Systems, vol. 61, pp. 205–218, 112010. [15] R. Zhu, Y. Long, Y. Zeng, and W. An, “Parallel particle phd filter implemented on multicore and cluster systems,” Signal Process., vol.127, pp.206–216, Oct.2016. [16] F. Bai, F. Gu, X. Hu, and S. Guo, “Particle routing in distributed particle filters for large-scale spatial temporal systems,” IEEE transactions on Parallel and Distributed Systems, Vol.27, no.2, pp.481-493, 2016. [17] K. Heine, N. Whiteley, and A. Cemgil, “Parallelizing particle filters with butterfly interactions,” Scandinavian Journal of Statistics, vol.47, no.2, pp.361-396, 2020. [18] S.Sutharsan, T. Kirubarajan, T. Lang, and M. McDonald, “An optimization-based parallel particle filter for multitarget tracking”, IEEE Transactions on Aerospace and Electronic Systems, vol.48, pp.1601-1618, APRIL 2012. [19] A. Varsi, L. Kekempanos, J. Thiyagalingam, and S. Maskell, “Parallelising particle filters with deterministic runtime on distributed memory systems,” IET Conference Proceedings, pp.11–18, 2017. [20] O. Demirel, I. Smal, W. Niessen, E. Meijering, and I. Sbalzarini, “Ppf- a parallel particle filtering library,” IET Conference Publications, vol.2014, 102013. [21] S. Maskell, B. Alun-Jones, and M. Macleod, “A single instruction multiple data particle filter,” in IEEE Nonlinear Statistical Signal Processing Workshop, pp.51–54, 2006. [22] J. Thiyagalingam, L. Kekempanos, and S. Maskell, “MapReduce particle filtering with exact resampling and deterministic runtime,” in EURASIP Journal on Advances in Signal Processing, vol.2017, p.71, Oct 2017. [23] A. Varsi, L. Kekempanos, J. Thiyagalingam, and S. Maskell, “A single smc sampler on mpi that outperforms a single mcmc sampler”,” eprint arXiv:1905.10252, 2019. [24] M. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A tutorial on particle filters for online nonlinear/non-gaussian Bayesian tracking,” Trans. Sig. Proc., vol.50, no.2, pp.174–188, 2002. [25] J. D. Hol, T. B. Schon, and F. Gustafsson, “On resampling algorithms for particle filters,” in 2006 IEEE Nonlinear Statistical Signal Processing Workshop, pp.79–82, Sept 2006. [26] T. Li, M. Bolic, and P. M. Djuric, “Resampling methods for particle filtering: Classification, implementation, and strategies,” IEEE Signal Processing Magazine, vol. 32, no.3, pp.70–86, 2015. [27] A. Hegyi, L. Mihaylova, R.Boel, and Z. Lendek, “Parallelised particle filtering for freeway traffic state tracking” in 2007 European Control Conference (ECC), pp.2442- 2449, 2007. [28] K. Emami, T. Fernando, H. H. Iu, H. Trinh, and K. P. Wong, “Particle filter approach to dynamic state estimation of generators in power systems,” IEEE Transactions on Power Systems, vol.30, no.5, pp.2665-2675, 2015. [29] A. Doucet and S. Sncal, “Fixed-lag sequential monte carlo,” in 200412th European Signal Processing Conference, pp.861–864, 2004. [30] P. D. Moral, A. Doucet, and A. Jasra, “Sequential monte carlo samplers,” Journal of the Royal Statistical Society. Series B (Statistical Methodology), vol. 68, no. 3, pp. 411–436, 2006. [31] T. L. T. Nguyen, F. Septier, G. W. Peters, and Y. Delignon, “Efficient sequential monte-carlo samplers for bayesian inference,” IEEE Trans-actions on Signal Processing, vol.64, pp.1305–1319, March 2016. [32] K. Petersen, M. Nielsen, and S. Brandt, “A static smc sampler on shapes for the automated segmentation of aortic calcifications,” in Computer Vision – ECCV 2010 (K. Daniilidis, P. Maragos, and N. Paragios, eds.), (Berlin, Heidelberg), pp. 666–679, Springer Berlin Heidelberg, 2010. [33] X. Luo, T. Kitasaka, and K. Mori, “Manismc: A new method using manifold modeling and sequential monte carlo sampler for boosting navigated bronchoscopy,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2011 (G. Fichtinger, A. Martel, and T. Peters, eds.), (Berlin, Heidelberg), pp. 248–255, Springer Berlin Heidelberg, 2011 [34] F. Nielsen, Introduction to HPC with MPI for Data Science.092016. [35] S. White, N. Verosky, and T. Newhall, “A cuda-mpi hybrid bitonic sorting algorithm for gpu clusters,” in 2012 41st International Conference on Parallel Processing Workshops, pp.588–589, Sep.2012. [36] E. Jeremiah, S. Sisson, L. Marshall Price, R. Mehrotra, and A. Sharma, “Bayesian calibration and uncertainty analysis of hydrological models: A comparison of adaptive metropolis and sequential monte carlo samplers,” Water Resources Research - WATER RESOURCES, vol.47, 072011. [37] G. Zhu, X. Li, J. Ma, Y. Wang, S. Liu, C. Huang, K. Zhang, and X. Hu, “A new moving strategy for the sequential monte carlo approach in optimizing the hydrological model parameters,” Advances in Water Resources, vol.114, pp.164 – 179, 2018 [00148] All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and/or all of the steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features. [00149] The invention is not restricted to the details of the foregoing embodiment(s). The invention extends to any novel one, or any novel combination, of the features disclosed in this specification (including any accompanying claims, abstract and drawings), or to any novel one, or any novel combination, of the steps of any method or process so disclosed.

Previous Patent: RESIN COMPOSITION

Next Patent: IMAGING DEVICE