Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
PARALLEL PROCESSING USING A MICROCONTROLLER BASED COMPUTING APPARATUS
Document Type and Number:
WIPO Patent Application WO/2023/194772
Kind Code:
A1
Abstract:
The disclosure relates to a computing apparatus, for parallel computing, a method, a system, and a non-transitory computer readable media. The computing apparatus comprises a plurality of microcontrollers, including a master microcontroller and at least two slave microcontrollers. The computing apparatus comprises a bus, operatively interconnecting the plurality of microcontrollers. The computing apparatus comprises an input/output (I/O) interface operatively interconnected to the master microcontroller. The computing apparatus comprises a power supply. The master microcontroller is operative to receive executable fdes through the I/O interface, to distribute the executable fdes to the at least two slave microcontrollers, to receive a response from the at least two slave microcontrollers and to transmit an aggregated response through the I/O interface.

Inventors:
TELEFONAKTIEBOLAGET LM ERICSSON (PUBL) (SE)
SELLIER JEAN MICHEL (CA)
Application Number:
PCT/IB2022/053166
Publication Date:
October 12, 2023
Filing Date:
April 05, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ERICSSON TELEFON AB L M (SE)
SELLIER JEAN MICHEL (CA)
International Classes:
G06F15/163
Foreign References:
US20060294290A12006-12-28
EP0589499B11999-04-07
Attorney, Agent or Firm:
DUFORT, Julie et al. (CA)
Download PDF:
Claims:
CLAIMS A computing apparatus, comprising:

- a plurality of microcontrollers, including a master microcontroller and at least two slave microcontrollers;

- a bus, operatively interconnecting the plurality of microcontrollers;

- an input/output (I/O) interface operatively interconnected to the master microcontroller; and

- a power supply; wherein the master microcontroller is operative to: receive executable files through the I/O interface; distribute the executable files to the at least two slave microcontrollers; receive a response from the at least two slave microcontrollers; and transmit an aggregated response through the I/O interface. The computing apparatus of claim 1, wherein the bus operatively interconnects the plurality of microcontrollers in parallel. The computing apparatus of claim 1, wherein the bus is composed of hard wires. The computing apparatus of claim 2 or 3, wherein the bus is composed of two hard wires, one for serial data (SDA) signal and one for serial clock (SCL) signal. The computing apparatus of claim 4, wherein the master microcontroller distributes the executable files and receives a response from the at least two slave microcontrollers using the inter-integrated circuit (I2C) communication protocol. The computing apparatus of claim 1, wherein the master microcontroller receives the executable files and transmits the aggregated response through the I/O interface using the universal asynchronous receiver-transmitter (UART) protocol. The computing apparatus of claim 1, wherein each of the executable files received through the I/O interface comprise specific instructions, to be executed by each of the plurality of microcontrollers, for executing a parallel computation. 8. The computing apparatus of claim 7, wherein the executable files are compiled on a computer connected to the master node and are received through the I/O interface.

9. The computing apparatus of claim 1, wherein the master microcontroller is further operative to send a command to the at least two slave microcontrollers, to set the at least two slave microcontrollers in programming mode.

10. The computing apparatus of claim 9, wherein, when in programming mode, each of the at least two slave microcontrollers is operative to: write the executable file on the slave microcontroller flash memory; restart itself; and wait in idle mode.

11. The computing apparatus of claim 10, wherein the master microcontroller is further operative to send a command to the at least two slave microcontrollers, to run the executable files.

12. The computing apparatus of claim 11, wherein the master microcontroller is further operative to send a command to the at least two slave microcontrollers, to send a response to the master microcontroller.

13. A method for parallel computing, using a computing apparatus comprising: a plurality of microcontrollers, including a master microcontroller and at least two slave microcontrollers; a bus, operatively interconnecting the plurality of microcontrollers; an input/output (I/O) interface operatively interconnected to the master microcontroller; and a power supply; the method comprising, the master microcontroller: receiving executable fdes through the I/O interface; distributing the executable files to the at least two slave microcontrollers; receiving a response from the at least two slave microcontrollers; and

- transmitting an aggregated response through the I/O interface. The method of claim 13, further comprising the master microcontroller receiving the executable files through the I/O interface, wherein the executable files are compiled on a computer connected to the master node through the I/O interface. The method of claim 13, further comprising the master microcontroller sending a command, to the at least two slave microcontrollers, to set the at least two slave microcontrollers in programming mode. The method of claim 15, further comprising, when in programming mode, each of the at least two slave microcontrollers: writing the executable file on the slave microcontroller flash memory; restarting itself; and waiting in idle mode. The method of claim 16, further comprising, the master microcontroller sending a command to the at least two slave microcontrollers, to ran the executable files. The method of claim 17, further comprising, the master microcontroller sending a command to the at least two slave microcontrollers, to send a response to the master microcontroller. A system comprising the computing apparatus of claim 1 and a computer connected to the master node through the I/O interface. A non-transitory computer readable media having stored thereon instructions for parallel computing, using a computing apparatus comprising: a plurality of microcontrollers, including a master microcontroller and at least two slave microcontrollers; a bus, operatively interconnecting the plurality of microcontrollers; an input/output (I/O) interface operatively interconnected to the master microcontroller; and a power supply; the instructions comprising, the master microcontroller: receiving executable files through the I/O interface; distributing the executable files to the at least two slave microcontrollers; receiving a response from the at least two slave microcontrollers; and

- transmitting an aggregated response through the I/O interface. 21. The non-transitory computer readable media of claim 20, further comprising instructions for executing any of the steps of claims 14 to 18.

Description:
PARALLEL PROCESSING USING A MICROCONTROLLER BASED

COMPUTING APPARATUS

TECHNICAL FIELD

[0001] The present disclosure relates to parallel processing using a microcontroller based computing apparatus.

BACKGROUND

[0002] Fifth generation (5G) networks (and beyond) come with the great promise of faster speeds, eventually reaching up to 10 Gbit/second in download. These impressive speeds are going to be achieved in mainly two ways: 1) on the one hand, additional higher frequencies will be in use, bringing to the need of developing, and deploying, smaller cells in a dense manner in urban areas (distant 250 meters from each other) and 2) on the other hand, by means of a massive use of artificial intelligence (Al).

[0003] Therefore, it can be easily foreseen that data centers and edge computing devices, which can be seen as an application of the Internet of things (loT), are going to be critical for the development and deployment of this new type of networks. These two different computing device paradigms (centers and edges) will require important computational capabilities to effectively support a level of automatism (through Al techniques) never seen in the previous communication networks. Because the use of these computational devices might be on both sides (centers or edges), and because they will be both required/mstalled in large numbers, it is also important that their carbon footprint remains acceptable (which is, unfortunately, not the case with the hardware necessary to run Al algorithms nowadays).

[0004] From a purely data center perspective, a common technique to speed up computations consists in using parallel hardware which can perform a certain number of mathematical operations at the same time. The greater is the number of operations possible, the faster the computation goes. Nowadays, two main parallel computational resources are available for the market, i.e., mainstream (a.k.a. Beowulf) clusters and graphics processing units (GPUs). The former technology can be considered as a cluster of commodity-grade computers which are connected to each other into local area networks. These computing networks are, thus, obtained by utilizing both specialized hardware (server motherboards, ethemet boards, etc.) and software (operating systems, communication protocols, etc.), consequently requiring a relevant amount of power to function. The second technology can be seen as a specialized electronic circuit which performs rapid manipulations and alterations of memory data (mainly for computer graphic related purposes). Both technologies are not suitable for 5G networks (and beyond) processing needs.

[0005] From a purely edge computing perspective, the most common hardware in use is represented by either simple microcontrollers (MCUs) or systems on chip (SoC) (such as Raspberry Pi or Nvidia Jetson). The MCU is known to have limitations which prevent any meaningful aspect of Al on the edge. In the SoC, parts of the computational resources are constantly consumed by the operating system which needs to run on these machines; moreover, their carbon footprint can be quite important too.

SUMMARY

[0006] Because of the reasons discussed above, it becomes desirable to develop a parallel computing device having the following features (important in the context of 5G networks):

- the device should provide relevant computational resources in terms of speed and memory;

- the cost for such device should be reasonable to allow the construction of powerful clusters;

- the device design should be flexible enough to easily adapt to any situation from the edge to the center;

- the device should have a low carbon footprint to allow the deployment of low power, energy efficient communication networks;

- the device should come in a small package, possibly without the need for a cooling system (so that its dimensions remain small, and its carbon footprint remains low).

[0007] Clearly, a new kind of computational parallel resource needs to be designed as no currently available technology includes all the above features. Without this new technology, it is difficult to imagine efficient and low carbon footprint 5G networks that are able to run Al algorithms on both the center and the edges. [0008] The parallel machine and its software framework, presented and discussed herein, present a new flexible, affordable, and reliable solution for the computational needs of 5G networks and beyond.

[0009] There is provided a computing apparatus, comprising a plurality of microcontrollers, including a master microcontroller and at least two slave microcontrollers. The computing apparatus comprises a bus, operatively interconnecting the plurality of microcontrollers. The computing apparatus comprises an input/output (I/O) interface operatively interconnected to the master microcontroller. The computing apparatus comprises a power supply. The master microcontroller is operative to receive executable files through the I/O interface. The master microcontroller is operative to distribute the executable files to the at least two slave microcontrollers. The master microcontroller is operative to receive a response from the at least two slave microcontrollers. The master microcontroller is operative to transmit an aggregated response through the I/O interface.

[0010] There is provided a method for parallel computing, using a computing apparatus. The computing apparatus comprises a plurality of microcontrollers, including a master microcontroller and at least two slave microcontrollers. The computing apparatus comprises a bus, operatively interconnecting the plurality of microcontrollers. The computing apparatus comprises an input/output (I/O) interface operatively interconnected to the master microcontroller. The computing apparatus comprises a power supply. The method comprises, the master microcontroller receiving executable files through the I/O interface. The method comprises, the master microcontroller distributing the executable files to the at least two slave microcontrollers. The method comprises, the master microcontroller receiving a response from the at least two slave microcontrollers. The method comprises, the master microcontroller transmitting an aggregated response through the I/O interface. [0011] There is provided a system comprising the computing apparatus, as previously described, and a computer connected to the master node through the I/O interface. [0012] There is provided a non-transitory computer readable media having stored thereon instructions for parallel computing, using a computing apparatus comprising a plurality of microcontrollers, including a master microcontroller and at least two slave microcontrollers. The computing apparatus comprises a bus, operatively interconnecting the plurality of microcontrollers. The computing apparatus comprises an input/output (I/O) interface operatively interconnected to the master microcontroller. The computing apparatus comprises a power supply. The instructions comprise, the master microcontroller receiving executable files through the I/O interface. The instructions comprise, the master microcontroller distributing the executable files to the at least two slave microcontrollers. The instructions comprise, the master microcontroller receiving a response from the at least two slave microcontrollers. The instructions comprise, the master microcontroller transmitting an aggregated response through the I/O interface.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013] Figure la is a picture of an example microcontroller.

[0014] Figure lb is a picture of an example printed circuit board.

[0015] Figure 2 is a picture of an example assembly of microcontrollers.

[0016] Figures 3a and 3b are pictures of a prototype that was built as a proof of concept.

[0017] Figure 4 is a schematic illustration of the computing apparatus.

[0018] Figure 5 is a block diagram of an example microcontroller.

[0019] Figure 6 is a sequence diagram illustrating steps for using the prototype of figure 3.

[0020] Figure 7 is a flowchart of a method for parallel computing using the computing apparatus.

DETAILED DESCRIPTION

[0021] Various features will now be described with reference to the drawings to fully convey the scope of the disclosure to those skilled in the art.

[0022] Sequences of actions or functions may be used within this disclosure.

[0023] The fiinctions/actions described herein may occur out of the order noted in the sequence of actions or simultaneously. Furthermore, in some illustrations, some blocks, functions or actions may be optional and may or may not be executed; these are generally illustrated with dashed lines.

[0024] Further, computer readable carrier or carrier wave may contain an appropriate set of computer instructions that would cause a processor to carry out the techniques described herein.

[0025] The main computational resources on the market for center and edge computing can be classified in MCUs, SoCs, GPUs and clusters. While they continue to provide interesting capabilities in certain fields of science and technology, it is difficult to see how they could concretely contribute to the deployment of the various Al algorithms which future networks will rely on.

[0026] In particular:

Computer motherboards in the so-called “pizza boxes” form factor, which tend to be flat and wide, consume power just as any other desktop computer, i.e., a minimum of 430 Watt. Thus, although they provide relevant computational power, they do not qualify as low impact technology. Moreover, they come in cumbersome dimensions. The same applies, consequently, to clusters.

GPUs represent a relevant computational resource in several fields such as neural network training and computer graphics, since their hardware has been highly specialized for this kind of use. Unfortunately, though, they are not suitable for general purpose computing. Moreover, their power consumption is around a minimum of 500 Watt, not to mention that they can work only in the presence of a computer to which they must be connected (therefore, they cannot come in small packages).

MCUs comes in small packages with various computational capabilities. At the filing date of this patent application, the best MCUs belong to the family of advanced RISC machine (ARM) devices known as the Cortex-M. A typical MCU from this family runs at 600 MHz and comes with a handful of megabytes of random-access memory (RAM). While they consume very low power, it is hard to imagine how to achieve anything computationally meaningful on such limited resources.

SoCs are essentially affected by the same issues that affect motherboards. They come in relatively big packages and their power consumption is relatively high.

[0027] Clearly, the computational devices on the market today do not provide a reasonable solution to the computational needs of 5G networks and beyond.

[0028] The parallel computing device presented herein can be considered as a cluster of connected, modem, low cost and energy efficient MCUs capable of performing relevant and meaningful computations, reaching the very same capabilities and flexibility of mainstream clusters, but at lower costs, lower power consumptions, and in relatively small packages. Specific MCUs (as will be described below) have been selected and connected to obtain a machine capable of relevant performances without the need for cooling systems and/or high-power supply, and without the need for an operating system running on the MCUs or the equivalent of the message passing interface (MPI) library, which is a standard library for node communications, therefore, further reducing the need for electric power.

[0029] In order to concretely assess the performances and feasibility of such proposed technology, a prototype of the proposed architecture (made of 22 MCUs) has been built and tested on a quite complex computational validation benchmark, i.e., the simulation of the H2 and HeH + molecules by means of the well-known Hartree-Fock method. Not only these are archetypal validation tests very well-known (and recognized) in quantum chemistry, but they also represent quite complex computational problems, therefore clearly showing that this new machine is capable of providing the computational resources required in the context of 5G networks (and beyond).

[0030] Recently, anew series of ARM 32-bit RISC, low cost and energy efficient MCUs have been released, under the name of “Cortex-M”. In particular, these new MCUs come with very interesting computational capabilities: for instance, the Cortex- M7 runs at 600 MHz without any cooling system and can address memory up to 4 Gigabytes (the Cortex-M7 MCU is, obviously, one possible choice, but the parallel device described in this document is not limited to any particular MCU model selected from this family).

[0031] This new kind of MCUs allows the design and construction of interesting parallel devices capable of quite impressive computations, paving the way towards the development of flexible/adaptable parallel devices in small packages, being low powered and at drastically lower costs (literally orders of magnitude smaller), especially if compared to mainstream clusters. The parallel computing device described herein represents an excellent solution for the issues mentioned previously in the context of 5G networks and beyond.

[0032] In more technical details, this parallel device is constituted of MCUs connected with each other in a master-slave fashion, through the inter-integrated circuit (I2C) communication protocol, which is a protocol for circuit communications, and thus the MCUs are hard wired with each other.

[0033] An accompanying software framework has also been developed to handle and program such machine (which is described in great details below). Consequently, this machine does not necessitate any operating system (such as GNU/Linux on mainstream clusters) and/or communication libraries (such as the standard MPI on mainstream clusters).

[0034] The parallel device described in detail below provides relevant/meaningful computational resources, and an effective and simple way to program it, with the following important benefits/advantages: lower cost; lower power consumptions, which can be hundreds of times lower for equivalent computing power; no need for cooling systems; small package; versatile architecture, the parallel machine described herein can have any number of MCUs and the proposed architecture scales easily; the computational capabilities can be increased, or decreased, by adding, or removing, one MCU at a time, while following the same connection scheme for every MCU added/removed; simplicity of programming general computational algorithms; the networks of MCUs can be used in exactly the same way one can program mainstream clusters through the MPI protocol, i.e., as a general parallel computing machine; programming the machine does not require programming every single MCU by hand as everything is handled automatically by the framework proposed herein; channel capacity, i.e., the rate of information that can be reliably transmitted over the available communication channel is performed through the serial peripheral interface (SPI) protocol; this protocol is usually embedded electronically, i.e., in hardware, in the MCUs available nowadays; this represents an advantage over the channel capacity exploited by a software implementation of a protocol such as MPI.

[0035] The hardware architecture.

[0036] The parallel computing device presented herein can be considered as an improvement over the mainstream Beowulf cluster technology. In essence, these clusters are made by connecting motherboards (called the nodes) to each other, usually through ethemet cards, and by running a common operating system and a common communication protocol at each node. In this paradigm, a master-slave scheme is usually adopted, thus having one node behaving as the master and all others as the slaves. This approach has allowed the development of efficient parallel machines which achieve general computing capabilities and computational speeds. In the current status of this technology, every node is represented by a rather cumbersome computing device (both in terms of dimensions and power consumption), and, consequently, the final outcome is a powerful but cumbersome parallel machine (which comprise several racks of “pizza-box” computation units). Moreover, these machines require significant cooling systems, critical fortheir correct functioning (thus, more power consumption). This approach, i.e., nodes made of common motherboards with an operating system and a communication protocol, is well-known to be a serious burden in terms of dimensions and energy consumptions. [0037] Thanks to the latest developments in the field of MCU technology, a solution to this problem is within reach nowadays, which provides smaller and energy efficient MCUs, rather than motherboards, and could be utilized to create a new kind of parallel computing machine.

[0038] A software framework can be developed to enable easy programming of such machine, which would be a sort of MPI for clusters made of MCUs. A prototype of such machine was built as a proof of concept and is shown in figures 3a and 3b (top and side view of the prototype). Further, several parallel computational algorithms have been run on the prototype to validate this new approach to parallel computing. [0039] The prototype machine is made of 22 MCUs (1 master and 21 slaves, this number is arbitrary and other machines with a different number of nodes can be readily obtained by using the same techniques). In particular, 22 Cortex-M7 embedded on development boards known as Teensy 4.0 (figure la) have been arranged on two printed circuit boards (figure lb), to obtain the configuration shown in figure 2. The (arbitrary) number of MCUs does not constitute a limitation as it could be easily increased and/or decreased depending on the computational needs (just like one would add and/or remove motherboards from a mainstream cluster), in other words generalizations are quite straightforward. Therefore, the prototype consists of 2 boards containing 12 and 10 MCUs respectively, where one MCU is utilized as the master while the other MCUs represent the computing nodes (in other words, the slaves).

[0040] The communication bus consists of simple wires with pull-up resistors (when needed to strengthen the signal over the bus). Only one power supply is necessary which consists of a 110 Volt-12 Volt converter (0.5 Ampere in output) which is, in turn, transformed into 5 Volt. [0041] The connection between the MCUs happens through the I2C communication protocol which is a standard communication protocol in digital electronics, and which is embedded in hardware form in the vast majority of MCUs available on the market nowadays (rather than the usual MPI protocol which is used for mainstream clusters). The reason for selecting this type of connection/protocol is twofold: on one hand, it does not require the implementation of any software (as being directly embedded in the hardware of the MCU); on the other hand, it requires less wiring than other communication protocols (for instance, SPI requires 4 wires between 2 MCUs to work properly, while I2C requires only 3 wires) while still providing fast communication capabilities.

[0042] Finally, the connection between the master node of the prototype and a personal computer (PC) can be easily achieved, if/when necessary (for instance, for programming purposes), by connecting the master MCU of the prototype to the USB port of the PC and then, by using the universal asynchronous receiver-transmitter (UART) and/or universal synchronous and asynchronous receiver-transmitter (USART) protocol.

[0043] A schematic diagram of the interconnections of the computing apparatus 400 is provided in figure 4. The schematics of figure 4 represents N+l microcontrollers 500 connected in a master-slave network, with the O-th node being the master unit and the other nodes being the slaves. Communication between the master and the slaves is achieved by utilizing the I2C protocol and, therefore, requires two lines or buses 402, one for the serial data (SDA) signal, the other for the serial clock (SCL) signal. All units are connected to the same ground point (GND) and two pull-up resistors are added (top left of the figure) with a value of 100 Ohm. The master node is connected 404 to the PC 408 through the UART protocol. It receives, in this way, instructions from the user for the computation to be performed.

[0044] Referring to figure 5, there is provided a microcontroller 500.

[0045] The microcontroller 500 comprises processing circuitry 503 and memory 505. The memory 505 can contain instructions executable by the processing circuitry 503 whereby functions and steps described herein may be executed to provide any of the relevant features and benefits disclosed herein.

[0046] The microcontroller 500 may also include non-transitory, persistent, machine- readable storage media 507 having stored therein software and/or instruction 509 executable by the processing circuitry 503 to execute functions and steps described herein. The microcontroller may also include I/O interface(s) and a power source or way to attach to a power source.

[0047] The instructions 509 may include a computer program for configuring the processing circuitry 503. The computer program may be stored in a physical memory local to the microcontroller.

[0048] The computer program may also be embodied in a carrier such as an electronic signal, optical signal, radio signal, or computer readable storage medium external to the microcontroller.

[0049] Referring to figures 4 and 5, there is provided a computing apparatus 400. The computing apparatus comprises a plurality of microcontrollers 500, including a master microcontroller and at least two slave microcontrollers. The computing apparatus comprises a bus 402, operatively interconnecting the plurality of microcontrollers. The computing apparatus comprises an input/output (I/O) interface 404 operatively interconnected to the master microcontroller. The computing apparatus comprises a power supply 406. The master microcontroller is operative to receive executable files through the I/O interface. The master microcontroller is operative to distribute the executable files to the at least two slave microcontrollers. The master microcontroller is operative to receive a response from the at least two slave microcontrollers. The master microcontroller is operative to transmit an aggregated response through the I/O interface.

[0050] In the context of this specification, aggregated can take different meanings, but is generally used in the sense of transmitting in one transmission. Sometimes, computation results from the microcontrollers will be in the form of arrays, sometimes results will be scalars (i.e., single values), etc. The master microcontroller will regroup all the computation results (aggregate the results) and transmit these results through the I/O interface. The master microcontroller could transmit a unique value, one value per microcontroller, an array of values, multiple arrays of values, or any other suitable data format etc. as would be apparent to a person skilled in the art. The master microcontroller may or may not apply calculations to the computation results received from the slave microcontrollers before transmission.

[0051] The bus 402 operatively interconnects the plurality of microcontrollers in parallel. The bus may be composed of hard wires. More specifically, the bus may be composed of two hard wires, one for serial data (SDA) signal and one for serial clock (SCL) signal. [0052] The master microcontroller may distribute the executable files and receive a response from the at least two slave microcontrollers using the inter-integrated circuit (I2C) communication protocol. It should be noted that I2C is not the only possible communication protocol that could be used. There are other protocols one could use, for instance serial peripheral interface (SPI). In practice, one should select only one protocol, but many protocol alternatives are possible.

[0053] The master microcontroller may receive the executable files and transmit the aggregated response through the I/O interface using the universal asynchronous receiver-transmitter (UART) protocol. UART is not the only possible way to connect the master MCU to a computer. For instance, USART (and many others) could alternatively be used as would be apparent to a person skilled in the art.

[0054] Each of the executable files received through the I/O interface may comprise specific instructions, to be executed by each of the plurality of microcontrollers, for executing a parallel computation. The executable files may be compiled on a computer connected to the master node and may be received through the I/O interface. The master microcontroller may further be operative to send a command to the at least two slave microcontrollers, to set the at least two slave microcontrollers in programming mode. When in programming mode, each of the at least two slave microcontrollers may be operative to write the executable file on the slave microcontroller flash memory, restart itself and wait in idle mode. The master microcontroller may further be operative to send a command to the at least two slave microcontrollers, to run the executable files. The master microcontroller may further be operative to send a command to the at least two slave microcontrollers, to send a response to the master microcontroller.

[0055] There is provided a system comprising the computing apparatus 400 and a computer 408 connected to the master node through the I/O interface.

[0056] There is also provided, anon-transitory computer readable media 507 having stored thereon instructions 509 for parallel computing, using a computing apparatus. The computing apparatus comprising a plurality of microcontrollers, including a master microcontroller and at least two slave microcontrollers. The computing apparatus comprising a bus, operatively interconnecting the plurality of microcontrollers. The computing apparatus comprising an input/output (I/O) interface operatively interconnected to the master microcontroller. The computing apparatus comprising a power supply. The instructions comprising, the master microcontroller receiving executable files through the I/O interface, distributing the executable files to the at least two slave microcontrollers, receiving a response from the at least two slave microcontrollers and transmitting an aggregated response through the I/O interface. The non-transitory computer readable media may further comprise instructions for executing any of the steps described herein.

[0057] The software framework.

[0058] To obtain general computing capabilities by means of the above architecture, one cannot rely on the typical software infrastructure utilized by mainstream Beowulf clusters (usually a mixture of the Linux O S and the MPI protocol) since they simply cannot run on MCUs. Therefore, a software framework is necessary, which has been developed for the prototype. This software framework makes the proposed architecture easily programmable for the purpose of general parallel computing.

[0059] The standard library “Wire”, for Atmel and/or ARM MCUs, was used and a working programmable machine was developed which handles the situation in which a user develops a general parallel computing code to run on the machine. At a high level of description, this framework allows four important components: a way for the user to program the machine from a common PC through the universal serial bus (USB) port; a way for the master MCU to broadcast the above program to the slave MCUs; a way for the master MCU to run a particular computational task on the slave MCUs; and a way for the master MCU to request the results computed by the slave MCUs. [0060] To provide a simple and effective way for the user to program any parallel computational algorithm on the machine, a template implemented in the C language is provided to the user for the description of three components of the parallel algorithm to run on the machine, which involves 1) the input structure for the slave MCUs (i.e., the input data the slave MCUs need to run the computation), 2) the output structure for the slave MCUs, (i.e., the result provided by the slave MCUs at the end of the computation), and 3) the computational routine itself (i.e., the actual computation performed by the slave MCUs, which is slave identity (ID) dependent).

[0061] The template reads as:

[0062] Thus, the words “input structure”, “output structure” and “computation” are now reserved words of the proposed software infrastructure.

[0063] For the sake of clarity, a simple and concrete example is provided below which consists of two inputs (two double variables, called “a”, “b”), one output (one double variable, called “sum”) and a function adding the two inputs on every slave MCU. The input is the same for every slave MCU (i.e., it is independent from the MCU ID), where “a” is equal to 1 and “b” is equal to 2, therefore each slave MCU performs the same computation:

[0064] For completeness, another example is reported below where the computation is now depending on the ID of the slave MCU:

[0065] Once the programmer provides the above input/output structures and the computational algorithm which runs on the slave MCUs, the next step consists in creating the HEX fdes, i.e., the executable/binary files for the slave MCUs. In other words, an automatic compilation step is required which prepares an HEX file for every MCU. This is performed by using the above template and by embedding it in a bigger software infrastructure which can be properly compiled by a compiler for the MCU type at hand.

[0066] It should be noted that the examples provided herein are simplified for the sake of brevity and, as such, it will be apparent to a person skilled in the art that the above template would not work as it is incomplete, for instance the “mainQ” function is completely missing, along with other important functions which are going to be described below (in particular, all communication functions are missing as well).

[0067] A template for the bigger software infrastructure discussed above can therefore read like the following code:

[0068] First, a series of relevant headers is added to the template. Then, a series of SPI communication routines are added as well, which allow the slave MCUs to communicate with the master MCU and to interpret the various commands sent by the MCU master (for instance, the master can send message to program the slaves, to run computations or to collect the results computed). In particular, every time the master MCU send a message to the slave MCUs, an interrupt is triggered on the slaves and the SPI communication starts.

[0069] Finally, this code is compiled on the connected PC which creates all relevant HEX files which need to run on the slave MCUs to achieve the specific parallel computation previously described in the template. When the compiler builds the HEX fdes, all global variables, static variables, and compiled code are assigned to dedicated locations of the memory. This is called static allocation since all memory addresses are fixed. By default, the compiler tries to use the ultra-fast data tightly coupled memory (DTCM) & instruction tightly coupled memory (ITCM) available on the MCU. When compiling, it is possible to control the allocation by means of keywords and one can choose where the compiler will place the variables and the code within the memory. The prototype described in this document achieves this step by using parts of the code available at https://github.com/PaulStoffregen/teensy_loader_cli at the time of filing of this application, running directly on the slave MCUs (i.e., the slave MCUs can reprogram themselves when required by the master MCU).

[0070] At this point, the HEX files are ready to be written on the Flash memory of the slave MCUs. This can be done in several ways, such as by static memory allocation (which is readily available for a plethora of different MCU models). As a first step, a command is sent from the master MCU to the slave MCUs to set these last ones in programming mode. Then, the various HEX files are sent to the slaves which write them on the flash memory, restart themselves and wait idle for the master MCU to provide the command which runs the computations.

[0071] To conclude this part, for the sake of clarity, a sequence diagram is presented in figure 6 that shows how the process described above works on the prototype. The process starts with a user writing a C routine 614 on a computer, step 602, using the templates as presented previously. The user decides at this time how to use the different slave nodes for achieving parallel computing and can provide different code for the different nodes, with associated ID. Alternatively, in a more advanced embodiment, this could possibly be done automatically, with the use of libraries, for example. Then, the code is compiled on the PC and HEX files are produced, step 604. The HEX files are sent the master node MCU through a port, such as a USB port (UART/USART), step 606. The HEX files are then broadcasted to the slave MCUs through the SPI bus, step 608. The slave MCUs write the HEX files on their flash memory and run it, step 610. The process ends when the slave MCUs send back the result to the master MCU which in turn sends it to the PC, step 612.

[0072] A person skilled in the art would understand that a commercial application could differ from the prototype described above and that particular commercial applications are intended to be covered through the general concepts described herein. [0073] Referring to figures 4 and 7, there is provided a method 700 for parallel computing, using a computing apparatus. The computing apparatus comprises a plurality of microcontrollers 500, including a master microcontroller and at least two slave microcontrollers. The computing apparatus comprise a bus 402, operatively interconnecting the plurality of microcontrollers. The computing apparatus comprises an input/output (I/O) interface 404 operatively interconnected to the master microcontroller. The computing apparatus comprises a power supply 406. The method comprises, the master microcontroller receiving, step 702, executable files through the I/O interface. The master microcontroller distributing, step 704, the executable files to the at least two slave microcontrollers. The master microcontroller receiving, step 706, a response from the at least two slave microcontrollers. The master microcontroller transmitting, step 708, an aggregated response through the I/O interface.

[0074] The master microcontroller may further receive the executable files through the I/O interface, where the executable files are compiled on a computer 408 connected to the master node through the I/O interface 404. The master microcontroller may further send a command, to the at least two slave microcontrollers, to set the at least two slave microcontrollers in programming mode. When in programming mode, each of the at least two slave microcontrollers may write the executable file on the slave microcontroller flash memory, restart itself and wait in idle mode. The master microcontroller may further send a command to the at least two slave microcontrollers, to run the executable files. The master microcontroller may further send a command to the at least two slave microcontrollers, to send a response to the master microcontroller.

[0075] Examples.

[0076] Three simple examples are presented below which involve: 1) the computation of an integral by means of a brute force Monte Carlo technique, 2) the computation of the (irrational) number ir and 3) the training of an artificial neural network by means of the stochastic gradient descent (SGD) method. Part of the code is shown as well for the sake of clarity.

[0077] Monte Carlo computation of integrals.

[0078] In this section, the problem of integrating a given function f=f(x) over a given domain ? by means of the brute force Monte Carlo (MC) method is presented, i.e., the following quantity is computed:

1 = j f(x)dx n

[0079] Although beter Monte Carlo methods exist for this problem, the brute force approach is simple enough to provide a comprehensible starting point on how programmers can utilize the machine described herein for parallel computing. The brute force MC method consists in uniformly sampling the domain Q, by means of N points xi, X2, XN, and then averaging by means of the following formula (in other words, it is an approximation of the exact value):

[0080] where:

[0081] Such algorithm is parallelizable and a possible implementation of the corresponding code for the machine described herein could read as follows:

[0082] Estimation of the 7i number.

[0083] It is possible to compute the irrational number n by means of the following

MC method: randomly generate the coordinates of N points (xt, yi) in the domain [-1, 1] x [-1, 1], then count the number of points which lie inside the circle centered at (0, 0) and with radius 1, say AT. Then an estimation of the number n is readily obtained, i.e., M/N. The code to implement such relatively simple algorithm is provided below:

[0084] The training of artificial neural networks. [0085] In machine learning, one important and demanding computational phase is represented by the training of a model, especially for artificial neural networks. In practice, this step consists in updating the weights (and biases) of the network until an objective function is minimized. To achieve such a task, various methods exist among which the most well-known is the stochastic gradient descent. Such method can be easily explained by the following formula:

[0086] where wt is the z-th weight of the network, X is the learning rate, n is the total number of items in the batch, and the summation is the gradient of the loss function which reads:

[0087] It becomes rapidly clear how such training algorithm can be parallelized. In fact, every weight can be updated independently, therefore a certain number of weights can be scattered around the slave MCUs which perform the above computation in parallel. Part of the code for such algorithm is provided below.

[0088] Validation on a quantum chemistry problem.

[0089] A final (computationally complex) validation test is presented below which has been performed on the prototype described above (based on 22 MCUs).

[0090] A validation test has been performed on the prototype to assess its actual computational capabilities. This test was chosen from the field of quantum chemistry since this field is very well known to provide computationally demanding/complex problems and consists in the computation of relevant quantities related to molecules (usually more demanding than the usual Al algorithms). The numerical approach implemented for this computation is the Hartree-Fock method which is considered the archetypal model for quantum chemistry. It is applied to the Hz and HeH + molecules because they both contain the hydrogen atom which, essentially, means that a high degree of numerical difficulties is introduced by their Coulomb potential (which has a singularity at the position of the nucleus of the atom). For comparison purposes, the same computations are run on a normal PC with an Intel i7 CPU running at 3.2 GHz. The results obtained in these two tests are quite self-explanatory:

The Intel CPU computation time was 1.066 seconds for the Hz molecule, and 1.338 seconds for the HeH + molecule.

The prototype computation time was 0.212 seconds for the Hz molecule, and 0.223 seconds for the HeH + molecule.

[0091] Use cases in 5G.

[0092] Finally, a list of possible use cases for this parallel machine, in the context of 5G networks, is reported below.

[0093] In the more specific case of 5G networks (and beyond), the following cases are easily foreseeable, where the available computational resources on the market are either too simple (e.g., advanced virtual reduced instruction set computer (RISC) (AVR) and/or advanced RISC machine (ARM) microcontrollers) or too big and/or power demanding (e.g., systems on chip such as Raspberry Pi or Nvidia Jetson or PC motherboards in pizza boxes form factor):

[0094] Cloud computing and data centers. [0095] Cloud computing services are going to be more important in the context of 5G for multiple reasons. This will require, eventually, the use of data centers which are extremely expensive (even the biggest players can only afford them in small numbers). The machine described in this document could represent an affordable way to continue to use data centers, improve them while keeping their cost and environmental impact acceptable.

[0096] Deployment of small 5G antennas.

[0097] Small cells, which are essentially low-powered cellular radio access nodes that operate in a spectrum that has a range of 10 meters to a few kilometers, are going to be deployed massively in the near future. In fact, small cells are currently being viewed as an important method of increasing cellular network capacity, quality, and resilience. Without small sized, low powered computational devices, it is difficult to imagine how to deploy 5G networks aiming to use Al as a core technology. The machine presented in these pages could represent a doable and effective solution in such context.

[0098] Internet of Things.

[0099] The loT describes the network of physical objects that are embedded with sensors, software, and other technologies that are used for the purpose of connecting and exchanging data with other devices and systems over the Internet. A challenge for producers of loT applications is to clean, process and interpret the vast amount of data which is gathered by the sensors. Therefore, the loT will require relevant computational power in small packages. The machine described herein can be a valid solution for such problems since it provides a way to development small, but still relevant, computational capabilities.

[00100] Distributed Al.

[00101] The objectives of distributed artificial intelligence are to solve the reasoning, planning, learning and perception problems of artificial intelligence, especially if they require large data, by distributing the problem to autonomous processing nodes. It requires powerful centralized and decentralized computational devices with various computational capabilities. A simple solution can be provided by the machine depicted in this document, since it can be adapted to both the center and the edges of computing.

[00102] Autonomous vehicles. [00103] A self-driving vehicle, also known as an autonomous vehicle, is a vehicle capable of sensing its environment and moving safely with little or no human input. Self-driving and semi-autonomous vehicles require completely autonomous capabilities when connection is lost. This will require relevant, low-powered computational resources in small packages. Once again, this problem could be easily solved by the machine presented herein.

[00104] Modifications will come to mind to one skilled in the art having the benefit of the teachings presented in the foregoing description and the associated drawings. Therefore, it is to be understood that modifications, such as specific forms other than those described above, are intended to be included within the scope of this disclosure The previous description is merely illustrative and should not be considered restrictive in any way. The scope sought is given by the appended claims, rather than the preceding description, and all variations and equivalents that fall within the range of the claims are intended to be embraced therein. Although specific terms may be employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.