Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
BI-DIRECTIONAL LEARNING FOR PERFORMANCE PREDICTION
Document Type and Number:
WIPO Patent Application WO/2023/247019
Kind Code:
A1
Abstract:
Computing equipment obtains training data that has circular and bidirectional temporal dependencies. The computing equipment trains a memory augmented neural network, over multiple epochs, with the training data. The computing equipment does so by alternating from epoch to epoch between training the memory augmented neural network with the training data in natural temporal order and training the memory augmented neural network with the training data in reverse temporal order.

Inventors:
TAGHIA JALIL (SE)
PERSSON ISAK (SE)
LAN XIAOYU (SE)
EBRAHIMI MASOUMEH (SE)
Application Number:
PCT/EP2022/066921
Publication Date:
December 28, 2023
Filing Date:
June 21, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ERICSSON TELEFON AB L M (SE)
International Classes:
G06N3/0985; G06N3/045; G06N3/063; G06N3/084; G06N3/096
Foreign References:
US20200336398A12020-10-22
Other References:
HE YONG SANYUAN HY@ALIBABA-INC COM ET AL: "Attention and Memory-Augmented Networks for Dual-View Sequential Learning", PROCEEDINGS OF THE 35TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ACMPUB27, NEW YORK, NY, USA, 23 August 2020 (2020-08-23), pages 125 - 134, XP058663998, ISBN: 978-1-4503-8037-9, DOI: 10.1145/3394486.3403055
MIKE HUISMAN ET AL: "A Survey of Deep Meta-Learning", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 21 April 2021 (2021-04-21), XP081925868, DOI: 10.1007/S10462-021-10004-4
Attorney, Agent or Firm:
ERICSSON (SE)
Download PDF:
Claims:
CLAIMS 1. A method comprising: obtaining training data that has circular and bidirectional temporal dependencies; and training a memory augmented neural network, over multiple epochs, with the training data, by alternating from epoch to epoch between training the memory augmented neural network with the training data in natural temporal order and training the memory augmented neural network with the training data in reverse temporal order. 2. The method of embodiment 1, wherein the memory augmented neural network comprises a controller and external memory, wherein experience parameters characterize contents of the external memory resulting from training the memory augmented neural network, and wherein training the memory augmented neural network comprises re-using the experience parameters across epochs. 3. The method of embodiment 2, wherein re-using the experience parameters across epochs comprises using the experience parameters resulting from training the memory augmented neural network in a previous epoch to train the memory augmented neural network in a current epoch. 4. The method of any of embodiments 1-3, wherein the memory augmented neural network comprises external memory, wherein training the memory augmented neural network with the training data in natural temporal order comprises, in a given epoch: training the memory augmented neural network with a first sample of the training data that occurs first in the natural temporal order, based on experience parameters resulting from training the memory augmented neural network in a previous epoch with a last sample of the training data that occurs last in the reverse temporal order; and training the memory augmented neural network with a last sample of the training data that occurs last in the natural temporal order, based on experience parameters resulting from training the memory augmented neural network in the given epoch with a second-to-last sample of the training data that occurs second-to-last in the natural temporal order. 5. The method of any of embodiments 1-4, wherein the memory augmented neural network comprises external memory, wherein training the memory augmented neural network with the training data in reverse temporal order comprises, in a given epoch: training the memory augmented neural network with a first sample of the training data that occurs first in the reverse temporal order, based on experience parameters resulting from training the memory augmented neural network in a previous epoch with a last sample of the training data that occurs last in the natural temporal order; and training the memory augmented neural network with a last sample of the training data that occurs last in the reverse temporal order, based on experience parameters resulting from training the memory augmented neural network in the given epoch with a second-to-last sample of the training data that occurs second-to-last in the reverse temporal order. 6. The method of any of embodiments 4-5, wherein the experience parameters characterize contents of the external memory. 7. The method of any of embodiments 2-6, wherein the experience parameters include: a memory matrix characterizing the external memory; and a temporal link matrix characterizing the external memory. 8. The method of embodiment 7, wherein the experience parameters further include one or more of: a precedence vector; read weight vectors; write weight vectors; or a usage vector. 9. The method of any of embodiments 1-8, wherein training the memory augmented neural network over multiple epochs comprises training the memory augmented neural network until a convergence criterion is reached. 10. The method of embodiment 9, wherein the convergence criterion is either: training of the memory augmented neural network for a maximum number of epochs; or a loss metric changing by less than a threshold between epochs. 11. The method of any of embodiments 1-10, wherein the memory augmented neural network is a Differentiable Neural Computer. 12. The method of any of embodiments 1-11, wherein the training data represents, as a function of time: round trip time for communication in a communication network; power consumption given performance management counters in a communication network; or a metric characterizing a service level from a data center. 13. The method any of embodiments 1-12, further comprising forming an inference or prediction from input data by inputting the input data into the trained memory augmented neural network in natural temporal order. 14. The method of embodiment 13, wherein the input data is a time series with no future information accessible. 15. A method comprising: obtaining a trained memory augmented neural network that is trained to model circular and bidirectional temporal dependencies in data; and forming an inference from input data by inputting the input data into the trained memory augmented neural network in natural temporal order. 16. The method of embodiment 15, wherein the trained memory augmented neural network is trained, over multiple epochs, with training data, by alternating from epoch to epoch between training the memory augmented neural network with the training data in natural temporal order and training the memory augmented neural network with the training data in reverse temporal order. 17. The method of any of embodiments 15-16, wherein the memory augmented neural network is a Differentiable Neural Computer. 18. The method of any of embodiments 15-17, wherein the input data represents, as a function of time: round trip time for communication in a communication network; power consumption given performance management counters in a communication network; or a metric characterizing a service level from a data center. 19. The method any of embodiments 15-18, wherein the input data is a time series with no future information accessible.

20. Computing equipment configured to: obtain training data that has circular and bidirectional temporal dependencies; and train a memory augmented neural network, over multiple epochs, with the training data, by alternating from epoch to epoch between training the memory augmented neural network with the training data in natural temporal order and training the memory augmented neural network with the training data in reverse temporal order. 21. The computing equipment of claim 20, configured to perform the method of any of embodiments 2-14. 22. Computing equipment configured to: obtain a trained memory augmented neural network that is trained to model circular and bidirectional temporal dependencies in data; and form an inference from input data by inputting the input data into the trained memory augmented neural network in natural temporal order. 23. The computing equipment of embodiment 22, configured to perform the method of any of embodiments 16-19. 24. A computer program comprising instructions which, when executed by at least one processor of computing equipment, causes the computing equipment to perform the method of any of embodiments 1-19. 25. A carrier containing the computer program of embodiment 24, wherein the carrier is one of an electronic signal, optical signal, radio signal, or computer readable storage medium. 26. Computing equipment comprising processing circuitry configured to: obtain training data that has circular and bidirectional temporal dependencies; and train a memory augmented neural network, over multiple epochs, with the training data, by alternating from epoch to epoch between training the memory augmented neural network with the training data in natural temporal order and training the memory augmented neural network with the training data in reverse temporal order. 27. The computing equipment of claim 26, the processing circuitry configured to perform the method of any of embodiments 2-14. 28. Computing equipment comprising processing circuitry configured to: obtain a trained memory augmented neural network that is trained to model circular and bidirectional temporal dependencies in data; and form an inference from input data by inputting the input data into the trained memory augmented neural network in natural temporal order. 29. The computing equipment of embodiment 28, the processing circuitry configured to perform the method of any of embodiments 16-19. 30. A method comprising: obtaining training data that has circular and bidirectional temporal dependencies; and training a memory augmented neural network, over multiple epochs, with the training data, wherein the memory augmented neural network comprises external memory, wherein experience parameters characterize contents of the external memory resulting from training the memory augmented neural network, and wherein training the memory augmented neural network comprises re-using the experience parameters across epochs. 31. The method of embodiment 30, wherein re-using the experience parameters across epochs comprises using the experience parameters resulting from training the memory augmented neural network in a previous epoch to train the memory augmented neural network in a current epoch. 32. The method of any of embodiments 30-31, wherein the experience parameters characterize contents of the external memory. 33. The method of any of embodiments 30-32, wherein the experience parameters include: a memory matrix characterizing the external memory; and a temporal link matrix characterizing the external memory. 34. The method of embodiment 33, wherein the experience parameters further include one or more of: a precedence vector; read weight vectors; write weight vectors; or a usage vector.

35. The method of any of embodiments 30-34, wherein training the memory augmented neural network over multiple epochs comprises training the memory augmented neural network until a convergence criterion is reached. 36. The method of embodiment 35, wherein the convergence criterion is either: training of the memory augmented neural network for a maximum number of epochs; or a loss metric changing by less than a threshold between epochs. 37. The method of any of embodiments 30-36, wherein the memory augmented neural network is a Differentiable Neural Computer. 38. The method of any of embodiments 30-37, wherein the training data represents, as a function of time: round trip time for communication in a communication network; power consumption given performance management counters in a communication network; or a metric characterizing a service level from a data center. 39. The method any of embodiments 30-38, further comprising forming an inference or prediction from input data. 40. The method of embodiment 39, wherein the input data is a time series with no future information accessible. 41. Computing equipment configured to: obtain training data that has circular and bidirectional temporal dependencies; and train a memory augmented neural network, over multiple epochs, with the training data, wherein the memory augmented neural network comprises external memory, wherein experience parameters characterize contents of the external memory resulting from training the memory augmented neural network, and wherein training the memory augmented neural network comprises re-using the experience parameters across epochs. 42. The computing equipment of claim 41, configured to perform the method of any of embodiments 31-40.

43. A computer program comprising instructions which, when executed by at least one processor of computing equipment, causes the computing equipment to perform the method of any of embodiments 30-40. 44. A carrier containing the computer program of embodiment 43, wherein the carrier is one of an electronic signal, optical signal, radio signal, or computer readable storage medium. 44. Computing equipment comprising processing circuitry configured to: obtain training data that has circular and bidirectional temporal dependencies; and train a memory augmented neural network, over multiple epochs, with the training data, wherein the memory augmented neural network comprises external memory, wherein experience parameters characterize contents of the external memory resulting from training the memory augmented neural network, and wherein training the memory augmented neural network comprises re-using the experience parameters across epochs. 45. The computing equipment of claim 44, the processing circuitry configured to perform the method of any of embodiments 31-40.

Description:
BI-DIRECTIONAL LEARNING FOR PERFORMANCE PREDICTION TECHNICAL FIELD The present application relates generally to neural network training, and relates more particularly to training a neural network to model temporal data. BACKGROUND Neural networks are a set of algorithms inspired by the human brain that attempt and often excel at finding relations between huge amounts of data. Typically, in a supervised setting, the algorithms’ goal is to map a data input X to an output target Y. They have shown to be extremely successful in tasks where the mapping is complex, such as in natural language processing, computer vision, audio, and time series. Some types of neural networks prove effective at modeling dependencies in a time series, including long-term dependencies. These types include, for example, Long Short-Term Memory (LSTM) neural networks and memory augmented neural networks (MANN), e.g., a Differentiable Neural Computer (DNC). However, the LSTM neural network has problems with vanishing or exploding gradients. And although the external memory of memory augmented neural networks enables them to better model long-term dependencies, memory augmented neural networks are heretofore only capable of modeling unidirectional dependencies; namely, forward dependencies. This strict forward dependency affects model generalization at the inference when the nature of dependencies is bidirectional and/or circular. Memory augmented neural networks thereby prove ineffective at modeling temporal data in various contexts where that temporal data has bidirectional and/or circular dependencies. This may be the case for example in a communications network context, e.g., where the data contains various seasonalities. Franke et al. have enhanced the DNC by creating a bidirectional architecture, enabling it to make an inference based on information in the future. See Franke et al., “Robust and Scalable Differentiable Neural Computer for Question Answering,” 2018. However, the bidirectionality in Franke et al refers to the architecture of the model and therefore also how the model does inference. Crucially, the application of the model in Franke et al is to do question answering tasks in which “future” information is accessible. In this scenario, future means that the context of a question can come after the question is asked. This is why the proposed bidirectional architecture can be useful. However, the bidirectional architecture in Franke et al is not applicable for modeling time series where there is no future information accessible, e.g., for forecasting round time trip in a communication network. In other words, the bidirectional architecture in Franke et al is only advantageous when future information is accessible, which may not be the case in some contexts, such as for modeling time series in a communication network context. SUMMARY Some embodiments herein train a memory augmented neural network to model circular and/or bidirectional temporal dependencies in data. Some embodiments do so, for example, by alternating (e.g., from training epoch to training epoch) between training the memory augmented neural network with the training data in natural temporal order and training the memory augmented neural network with the training data in reverse temporal order. Alternatively or additionally, some embodiments herein do so by re-using external memory experience parameters across training epochs. Regardless, training the memory augmented neural network in this way proves advantageous for modeling time series even when there is no future information accessible. The trained memory augmented neural network may thereby be advantageous even for modeling communication network data, such as for forecasting round trip time. More particularly embodiments herein include a method. The method comprises obtaining training data that has circular and bidirectional temporal dependencies. The method also comprises training a memory augmented neural network, over multiple epochs, with the training data, by alternating from epoch to epoch between training the memory augmented neural network with the training data in natural temporal order and training the memory augmented neural network with the training data in reverse temporal order. In some embodiments, the memory augmented neural network comprises a controller and external memory. In some embodiments, experience parameters characterize contents of the external memory resulting from training the memory augmented neural network. In some embodiments, training the memory augmented neural network comprises re-using the experience parameters across epochs. In one or more of these embodiments, re-using the experience parameters across epochs comprises using the experience parameters resulting from training the memory augmented neural network in a previous epoch to train the memory augmented neural network in a current epoch. In some embodiments, the memory augmented neural network comprises external memory, and training the memory augmented neural network with the training data in natural temporal order comprises, in a given epoch, training the memory augmented neural network with a first sample of the training data that occurs first in the natural temporal order, based on experience parameters resulting from training the memory augmented neural network in a previous epoch with a last sample of the training data that occurs last in the reverse temporal order. Training the memory augmented neural network with the training data in natural temporal order also comprises, in a given epoch, training the memory augmented neural network with a last sample of the training data that occurs last in the natural temporal order, based on experience parameters resulting from training the memory augmented neural network in the given epoch with a second-to-last sample of the training data that occurs second-to-last in the natural temporal order. In some embodiments, the memory augmented neural network comprises external memory, and training the memory augmented neural network with the training data in reverse temporal order comprises, in a given epoch, training the memory augmented neural network with a first sample of the training data that occurs first in the reverse temporal order, based on experience parameters resulting from training the memory augmented neural network in a previous epoch with a last sample of the training data that occurs last in the natural temporal order. Training the memory augmented neural network with the training data in reverse temporal order also comprises, in a given epoch, training the memory augmented neural network with a last sample of the training data that occurs last in the reverse temporal order, based on experience parameters resulting from training the memory augmented neural network in the given epoch with a second-to-last sample of the training data that occurs second-to-last in the reverse temporal order. In one or more of these embodiments, the experience parameters characterize contents of the external memory. In some embodiments, the experience parameters include a memory matrix characterizing the external memory. The experience parameters also include a temporal link matrix characterizing the external memory. In one or more of these embodiments, the experience parameters further include at least a precedence vector. In some embodiments, the experience parameters further include at least read weight vectors. In some embodiments, the experience parameters further include at least write weight vectors. In some embodiments, the experience parameters further include at least a usage vector. In some embodiments, training the memory augmented neural network over multiple epochs comprises training the memory augmented neural network until a convergence criterion is reached. In one or more of these embodiments, the convergence criterion is training of the memory augmented neural network for a maximum number of epochs. Alternatively, the convergence criterion is a loss metric changing by less than a threshold between epochs. In some embodiments, the memory augmented neural network is a Differentiable Neural Computer. In some embodiments, the training data represents, as a function of time, round trip time for communication in a communication network. Alternatively, the training data represents, as a function of time, power consumption given performance management counters in a communication network. Alternatively, the training data represents, as a function of time, a metric characterizing a service level from a data center. In some embodiments, the method further comprises forming an inference or prediction from input data by inputting the input data into the trained memory augmented neural network in natural temporal order. In one or more of these embodiments, the input data is a time series with no future information accessible. Other embodiments herein include a method comprising obtaining a trained memory augmented neural network that is trained to model circular and bidirectional temporal dependencies in data. The method also comprises forming an inference from input data by inputting the input data into the trained memory augmented neural network in natural temporal order. In some embodiments, the trained memory augmented neural network is trained, over multiple epochs, with training data, by alternating from epoch to epoch between training the memory augmented neural network with the training data in natural temporal order and training the memory augmented neural network with the training data in reverse temporal order. In some embodiments, the input data represents, as a function of time, round trip time for communication in a communication network. Alternatively, the input data represents, as a function of time, power consumption given performance management counters in a communication network. Alternatively, the input data represents, as a function of time, a metric characterizing a service level from a data center. In some embodiments, the input data is a time series with no future information accessible. Other embodiments herein include computing equipment. The computing equipment is configured to train a memory augmented neural network, over multiple epochs, with the training data, by alternating from epoch to epoch between training the memory augmented neural network with the training data in natural temporal order and training the memory augmented neural network with the training data in reverse temporal order. In some embodiments, the computing equipment is configured to perform the steps described above. Other embodiments herein include computing equipment. The computing equipment is configured to obtain a trained memory augmented neural network that is trained to model circular and bidirectional temporal dependencies in data. The computing equipment is also configured to form an inference from input data by inputting the input data into the trained memory augmented neural network in natural temporal order. In some embodiments, the computing equipment is configured to perform the steps described above. Other embodiments herein include a computer program comprising instructions which, when executed by at least one processor of computing equipment, causes the computing equipment to perform the steps described above. In some embodiments, a carrier containing the computer program is one of an electronic signal, optical signal, radio signal, or computer readable storage medium. Other embodiments herein include computing equipment comprising processing circuitry. The processing circuitry is configured to obtain training data that has circular and bidirectional temporal dependencies. The processing circuitry is also configured to train a memory augmented neural network, over multiple epochs, with the training data, by alternating from epoch to epoch between training the memory augmented neural network with the training data in natural temporal order and training the memory augmented neural network with the training data in reverse temporal order. In some embodiments, the processing circuitry is configured to perform the steps described above. Other embodiments herein include computing equipment comprising processing circuitry. The processing circuitry is configured to obtain a trained memory augmented neural network that is trained to model circular and bidirectional temporal dependencies in data. The processing circuitry is also configured to form an inference from input data by inputting the input data into the trained memory augmented neural network in natural temporal order. In some embodiments, the processing circuitry is configured to perform the steps described above. BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is a block diagram of computing equipment according to some embodiments. Figure 2 is a block diagram of computing equipment according to other embodiments. Figure 3 is a block diagram of computing equipment according to still other embodiments. Figure 4 is a block diagram of computing equipment according to yet other embodiments. Figure 5 is a block diagram of a DNC according to some embodiments. Figure 6 is a logic flow diagram of a learning phase of neural network training according to some embodiments. Figure 7 is a block diagram of training of a bi-DNC in a single epoch according to some embodiments. Figure 8 is a block diagram of inference in a bi-DNC according to some embodiments. Figure 9 is a graph of the service performance metrics of read and write latency according to an example. Figure 10 is a graph of results of some embodiments herein in an example. Figure 11 is a logic flow diagram of a method performed by computing equipment according to some embodiments. Figure 12 is a logic flow diagram of a method performed by computing equipment according to other embodiments. Figure 13 is a block diagram of computing equipment according to some embodiments. Figure 14 is a block diagram of a communication system in accordance with some embodiments. Figure 15 is a block diagram of a user equipment according to some embodiments. Figure 16 is a block diagram of a network node according to some embodiments. Figure 17 is a block diagram of a host according to some embodiments. Figure 18 is a block diagram of a virtualization environment according to some embodiments. DETAILED DESCRIPTION Figure 1 shows computing equipment 10 configured to train a memory augmented neural network 12, in order to produce a trained memory augmented neural network 12T. The memory augmented neural network 12 can store and read information from an external memory 12M. In some embodiments, the external memory 12M is external in the sense that the behavior of the neural network is independent of the size of the external memory 12M, as long as the external memory 12M is not filled to capacity. The computing equipment 10 according to embodiments herein is configured to train the memory augmented neural network 12 with training data 14. In one embodiment, the training data 14 has circular and/or bidirectional temporal dependencies. In this case, the trained memory augmented neural network 12T may model these circular and/or bidirectional temporal dependencies, e.g., in order to account for not only forward dependencies (to the future) but also backward dependencies (to the past) in data. The training data 14 may have bidirectional temporal dependencies in the sense that it has dependency to the future as well as dependency to the past. As an example, in language modelling, the next word may affect the meaning of the current word. This contrasts with unidirectional temporal dependencies which may be unidirectional for example due to lack of dependency to the future and/or lack of seasonalities in the data, e.g., the stock price tomorrow does not affect the stock price of today. The training data 14 may alternatively or additionally have circular temporal dependencies, e.g., due to presence of seasonalities in the training data 14. As an example for forecasting the weather condition, the weather condition tomorrow does not affect the weather condition today, but seasonalities such as the month of the year can largely influence the forecasts. In some embodiments, then, the training data 14 may be data that is from or that is associated with a communication network. The training data 14 may for example include performance metrics (PM) data collected from one or more base stations, mobile network data of user activities, and/or data collected from a data center. Such data contains various seasonalities such as daily, weekly, monthly, and yearly dependencies. As an example, in mobile network data of user activities, there will be fewer activities monitored by a base station located at city center during nighttime than during rush hours. As another example, the training data 14 may represent round-trip-time (RTT) in a communication network. As other examples, the training data may represent, as a function of time, round trip time for communication in a communication network, power consumption given performance management counters in a communication network, or a metric characterizing a service level from a data center. Regardless, the computing equipment 10 in Figure 1 is configured to train the memory augmented neural network 12 with the training data 14 over multiple epochs, e.g., multiple training intervals or iterations. Each epoch in this regard may correspond to training of the memory augmented neural network 12 with a certain number of samples from the training data 14. Each epoch may thereby produce an intermediately trained neural network, with subsequent epochs refining that intermediate training through additional training using another number of samples from the training data 14. Training may continue in this fashion until a convergence criterion is reached, such as a maximum fixed number of epochs being reached or a loss metric changing by less than a threshold between epochs. In order for the trained memory augmented neural network 12T to model circular and/or bidirectional temporal dependencies in the training data 14, the computing equipment 10 notably exploits the training data 14 in both its natural temporal order and its reverse temporal order. When the training data 14 is in its natural temporal order, the training data 14 is ordered according to the temporal order in which the training data 14 naturally occurred, e.g., samples of the training data 14 are arranged according to the times at which the samples were collected. By contrast, when the training data 14 is in its reverse temporal order, the training data 14 is ordered opposite the temporal order in which the training data 14 naturally occurred, e.g., samples of the training data 14 are arranged in the reverse of the times at which the samples were collected. As shown in Figure 1, then, the computing equipment 10 may include an order reverser 16 configured to reverse the order of the training data 14, so as to exploit both training data 14N in its natural temporal order and training data 14R in its reverse temporal order. Equipped with both training data 14N in its natural temporal order and training data 14R in its reverse temporal order, the computing equipment 10 in some embodiments alternates from epoch to epoch between training the memory augmented neural network 12 with the training data 14N in natural temporal order and training the memory augmented neural network 12 with the training data 14R in reverse temporal order. As shown in Figure 1, for example, the computing equipment 10 includes a training controller 16 that switches the training data that a per-epoch trainer 18 uses for training the memory augmented neural network 12. The training controller 16 switches the training data that the trainer 18 uses on an epoch by epoch basis, as controlled by an epoch controller 16E. After training the memory augmented neural network 12 for one epoch with the training data 14N in natural temporal order (referred to as direct training), the training controller 16 switches the training data to be used by the per-epoch trainer 18 in the next epoch to be the training data 14R in reverse temporal order (referred to as reverse training). After training the memory augmented neural network 12 for that epoch with the training data 14R in reverse temporal order, the training controller 16 switches the training data to be used by the per-epoch trainer 18 in the next epoch to be the training data 14N in natural temporal order. And so on. Conceptually, in direct training that uses the training data 14 in natural temporal order, the past temporal dynamics are explored, meaning that the data sample at the current time depends on the data samples in the past. In the reverse training that uses the training data in reverse temporal order, the future temporal dynamics are explored, meaning that the data sample at the current time depends on the data samples in the future. To achieve the former, the data samples are fed in their natural order. To achieve the latter, the data samples are fed in reverse order. At the direct phase of training, the external memory 12M at the current time is influenced by the past. At the reverse phase of training, the external memory 12M at the current time is influenced by the future. This allows memory augmented neural network 12 to be able to capture circular and/or bidirectional temporal dependencies. Figure 2 shows the computing equipment 10 according to other embodiments herein, which may be implemented separately from or in combination with the embodiments in Figure 1. The computing equipment 10 may for example operate as described above in Figure 1, except as modified and/or supplemented with respect to Figure 2. As shown in Figure 2, the computing equipment 10 likewise includes a per-epoch trainer 18 that trains the memory augmented neural network 12 with the training data 14 over multiple epochs. The per-epoch trainer 18 in Figure 2, though, notably re-uses so-called experience parameter(s) 12E across epochs of training. In some embodiments, for example, re-using the experience parameter(s) 12E across epochs may comprise using the experience parameter(s) 12E resulting from training the memory augmented neural network 12 in a previous epoch to train the memory augmented neural network 12 in a current epoch. The experience parameter(s) 12E as shown characterize contents of the external memory 12M resulting from training the memory augmented neural network 12T. In some embodiments, for example, the experience parameter(s) 12E include a memory matrix characterizing the external memory 12M and/or a temporal link matrix characterizing the external memory 12M. In one or more such embodiments, the experience parameter(s) 12E may also include a precedence vector, read weight vectors, write weight vectors, and/or a usage vector. Regardless, re-using the experience parameter(s) 12E across epochs is notable as the experience parameter(s) 12E are heretofore reset, e.g., to zeroes, between epochs. Re-using the experience parameter(s) 12E across epochs for training mirrors the way that an inference is made from data and/or contributes meaningful input to training from previous epochs. Note that the experience parameter(s) 12E re-used across epochs may constitute just a subset of the experience parameter(s) 12E that characterize the external memory 12M. Although illustrated separately in Figures 1 and 2, the embodiments in Figures 1 and 2 may be implemented in combination by the computing equipment 10. In this case, training the memory augmented neural network 12 with the training data 14N in natural temporal order may comprise the following in a given epoch. The computing equipment 10 in the given epoch trains the memory augmented neural network 12 with a first sample of the training data 14N that occurs first in the natural temporal order, based on experience parameter(s) 12E resulting from training the memory augmented neural network 12 in a previous epoch with a last sample of the training data 12R that occurs last in the reverse temporal order. And the computing equipment 10 trains the memory augmented neural network 12 with a last sample of the training data 14N that occurs last in the natural temporal order, based on experience parameter(s) 12E resulting from training the memory augmented neural network 12 in the given epoch with a second-to-last sample of the training data 14N that occurs second-to-last in the natural temporal order. And so on for other samples of the training data 14N. Similarly, training the memory augmented neural network 12 with the training data 14R in reverse temporal order may comprise the following in a given epoch. The computing equipment 10 in the given epoch trains the memory augmented neural network 12 with a first sample of the training data 12R that occurs first in the reverse temporal order, based on experience parameter(s) 12E resulting from training the memory augmented neural network 12 in a previous epoch with a last sample of the training data 12N that occurs last in the natural temporal order. And the computing equipment 10 trains the memory augmented neural network 12 with a last sample of the training data 14R that occurs last in the reverse temporal order, based on experience parameter(s) 12E resulting from training the memory augmented neural network 12 in the given epoch with a second-to-last sample of the training data 14R that occurs second-to- last in the reverse temporal order. And so on for other samples of the training data 14R. In some embodiments, the computing equipment 10 herein trains the memory augmented neural network 12 and stores or sends the resulting trained memory augmented neural network 12T to other equipment for use in making an inference from input data. In other embodiments, though, the computing equipment 10 itself makes this inference from input data. Figure 3 shows one such example. As shown in Figure 3, the computing equipment 10 trains the memory augmented neural network 12, e.g., according to Figure 1 and/or Figure 2, in order to obtain the trained memory augmented neural network 12T. The computing equipment 10 further includes an inference maker 20 that receives the trained memory augmented neural network 12T and input data 22. The input data 22 in some embodiments is a time series with no future information accessible. Regardless, the inference maker 20 forms an inference 24 (e.g., prediction) from the input data 22 and the trained memory augmented neural network 12T. In some embodiments, for example, the inference maker 20 inputs the input data 22 into the trained memory augmented neural network 12T in natural temporal order. In yet other embodiments herein shown in Figure 4, the computing equipment 10 forms the inference 24 based on a trained memory augmented neural network 12T that the computing equipment 10 receives from other equipment. Consider now some additional details of some embodiments herein where the memory augmented neural network 12 is exemplified as a Differentiable Neural Computer (DNC). A DNC is a neural network with an external memory (e.g., an external memory matrix) that the neural network can operate and manipulate while still being fully differentiable. In some embodiments, the external memory can be selectively written to as well as read, allowing iterative modification of memory content. If the external memory can be thought of as the DNC’s random access memory (RAM), then the network, referred to as the ‘controller’, is a differentiable CPU whose operations are learned with gradient descent. In some embodiments, a Differentiable Neural Computer (DNC) is an architecture consisting of a neural network (controller) (e.g., a recurrent neural network, RNN) and an external memory matrix from which the neural network can read and write. One example architecture of a DNC is shown in Figure 5. The whole architecture is differentiable, allowing the neural network to learn how to operate and manipulate the external memory end-to-end with gradient descent. By having an external memory that the neural network can read and write from, the network can encode the input data and store it in the memory, allowing it to remember data over long timescales. The controller interacts with the external memory based on three different mechanisms: Content-based addressing, Dynamic memory allocation, and Temporal memory linkage. Content-based addressing can be thought of as the mechanism that allows the controller to communicate directly what content to write into and read from the memory. Dynamic memory allocation controls where to free and allocate memory by having a usage counter for each memory cell. Lastly, temporal memory linkage is the mechanism that allows the DNC to remember which order data is written into the external memory. As shown in Figure 5, the input goes through the controller which then outputs a hidden state ℎ ^ . This state is sent to the predictive layer as well as the memory interaction component. Using the hidden state, the DNC will decide what should be written to the memory as well as where it should be written to. Furthermore, the DNC uses the hidden state to decide what to read from memory. The write and read to memory happens in the memory interaction component and will be described more in-depth later. From the memory interaction component, what is read is sent to the predictive layer as well as to the next time step for the controller. The predictive layer uses what has been read from memory and the hidden state to make a prediction. In any event, the memory matrix is fully differentiable and contains temporal embeddings of data that are learned from data. Memory augmented neural nets in general and DNCs in specific provide a structured way to improve the model capabilities in capturing long term dependencies.

Some embodiments herein provide a bi-directional DNC, referred to as biDNC, for capturing circular-and-bidirectional temporal dependencies in data. Some embodiments accordingly account for both forward and backward dependencies in data, and/or for circular dependencies in data. Some embodiments thereby open up application of DNC to a wider set of problems. Examples are language model and telecom domain.

Figure 6 shows processing steps, during the learning phase, for training a DNC to capture circular-and-bidirectional temporal dependencies in data according to some embodiments. Here, the DNC exemplifies the memory augmented neural network 12 in Figure 1 and/or 2.

Step 1 is to obtain data samples. Data samples are temporal in nature shown as (X 1 ,X 2 , ...,X T ), where X t denotes the data feature vector at the time stamp t.

Step 2 is initialization of the DNC.

Step 3 is specification of the experience parameter set. The experience parameter set in this example is defined as E = (E 1 ,E 2 , ..., E T ), where E t denotes the experience at time stamp t. In some embodiments, the experience E t includes the following parameters: (i) DNC memory matrix M t ,; and (ii) DNC link matrix L t . These may be initialized randomly and may be learned by standard application of the DNC [1] at the training phase.

In some embodiments, the experience set E t may optionally include one or more of the following parameters: (i) DNC precedence vector at time t; (ii) DNC read weight vectors at time t; (iii) DNC write weight vectors at time t; and/or (iv) DNC usage vector at time t.

Step 4 is training and includes Steps 4.1 and 4.2. These steps are described also with reference to Figure 7.

Step 4.1 is to train the DNC for a single epoch by passing through data in the natural order. The first data sample X 1 uses E T and update E 1 where E T is the experience at time stamp T and E 1 is the experience at time stamp 1 . The second data sample X 2 uses E 1 and update E 2 . And so on until the last data sample X T uses E T-1 and update E T .

Step 4.2 is to train the DNC for a single epoch by passing through data in the reverse order. The last data sample X T uses E T and update E 1 . X T-1 uses E 1 and update E 2 . And so on until X 2 uses E T-2 and update E T-1 and X 1 uses E T-1 and update E T .

According to Step 5, while a convergence criterion is not reached, step 4.1 and 4.2 are repeated. One simple case is to stop after a fixed number of epochs (100 epochs). Another criterion is to monitor the loss during learning. If the change in loss is smaller than a threshold, the learning stops.

Training can be done online or offline. In case of online training, a batch of data is available to the computing equipment 10. And the computing equipment 10 can only capture temporal dependencies existing within this batch. Figure 8 illustrates the inference phase that occurs after the learning phase. At the inference phase, bi-DNC only performs a forward pass through the natural order of data using the previously learned experience parameters E t . The first data sample X 1 uses E T and update E 1 , where E T is the previously learned experience at time stamp T and E 1 is the experience at time stamp 1. The second data sample X 2 uses E 1 and update E 2 , And so on until the last data sample X T uses E T-1 and update E T . Some embodiments herein are applicable for predicting round-trip-time (RTT) from a 5G- mmWave. Alternatively or additionally, data samples may include features related to communications in the uplink and/or downlink of a communication network at different time stamps. Alternatively or additionally, data samples may include features related to beamforming at different time stamps in a communication network. Data samples in these and other examples may be collected from one or more base stations in the communication network. As another example, some embodiments herein are applicable for prediction of the power consumption given performance management (PM) counters. As still another example, some embodiments herein are applicable for prediction of service level metrics from features extracted from a data center. Figures 9 and 10 show exemplary performance according to some embodiments. Here, data traces are collected from a data center located at KTH, named KTH DC data traces. These data traces are publicly available. The machine learning task underlying the use case is prediction of the read and write latency given 197 features extracted from data center, such as memory usage, cpu usage, and I/Os. The data is a time series of 24225 samples. Figure 9 shows the service performance metrics of read and write latency. Data features are standardized. The first 60% time-steps are used for training and the remaining 40% are for testing. The DNC in this example uses an LSTM controller with 32 hidden units. Furthermore, the memory has a single read head, 20 memory cells and the cell size is 32 equal to the size of hidden states. The models are trained to reduce the mean-square-error (MSE) between the model output and the true target with an Adam optimizer and a learning rate of 0.01. During training, a dropout of 0.6 is used. The performance of the bi-DNC is compared against standard implementation of DNC. Performance is evaluated by analyzing the MSE loss on the validation set. The method achieving the lower MSE is preferred. The results are shown in Figure 10. For the case of prediction of the average read latency, the bi-DNC achieves superior performance. In view of the modifications and variations herein, Figure 11 depicts a method in accordance with particular embodiments. The method may be performed by computing equipment 10. The method includes obtaining training data 14, e.g., that has circular and bidirectional temporal dependencies (Block 100). The method also includes training a memory augmented neural network 12, over multiple epochs, with the training data 14, by alternating from epoch to epoch between training the memory augmented neural network 12 with the training data 14N in natural temporal order and training the memory augmented neural network 12 with the training data 14R in reverse temporal order (Block 110). In some embodiments, the method also includes forming an inference 24 or prediction from input data 22 by inputting the input data 22 into the trained memory augmented neural network 12T in natural temporal order (Block 130). In some embodiments, the memory augmented neural network comprises a controller and external memory. In some embodiments, experience parameters characterize contents of the external memory resulting from training the memory augmented neural network. In some embodiments, training the memory augmented neural network comprises re-using the experience parameters across epochs. In one or more of these embodiments, re-using the experience parameters across epochs comprises using the experience parameters resulting from training the memory augmented neural network in a previous epoch to train the memory augmented neural network in a current epoch. In some embodiments, the memory augmented neural network comprises external memory, and training the memory augmented neural network with the training data in natural temporal order comprises, in a given epoch, training the memory augmented neural network with a first sample of the training data that occurs first in the natural temporal order, based on experience parameters resulting from training the memory augmented neural network in a previous epoch with a last sample of the training data that occurs last in the reverse temporal order. Training the memory augmented neural network with the training data in natural temporal order also comprises, in a given epoch, training the memory augmented neural network with a last sample of the training data that occurs last in the natural temporal order, based on experience parameters resulting from training the memory augmented neural network in the given epoch with a second-to-last sample of the training data that occurs second-to-last in the natural temporal order. In some embodiments, the memory augmented neural network comprises external memory, and training the memory augmented neural network with the training data in reverse temporal order comprises, in a given epoch, training the memory augmented neural network with a first sample of the training data that occurs first in the reverse temporal order, based on experience parameters resulting from training the memory augmented neural network in a previous epoch with a last sample of the training data that occurs last in the natural temporal order. Training the memory augmented neural network with the training data in reverse temporal order also comprises, in a given epoch, training the memory augmented neural network with a last sample of the training data that occurs last in the reverse temporal order, based on experience parameters resulting from training the memory augmented neural network in the given epoch with a second-to-last sample of the training data that occurs second-to-last in the reverse temporal order. In one or more of these embodiments, the experience parameters characterize contents of the external memory. In some embodiments, the experience parameters include a memory matrix characterizing the external memory. The experience parameters also include a temporal link matrix characterizing the external memory. In one or more of these embodiments, the experience parameters further include at least a precedence vector. In some embodiments, the experience parameters further include at least read weight vectors. In some embodiments, the experience parameters further include at least write weight vectors. In some embodiments, the experience parameters further include at least a usage vector. In some embodiments, training the memory augmented neural network over multiple epochs comprises training the memory augmented neural network until a convergence criterion is reached. In one or more of these embodiments, the convergence criterion is training of the memory augmented neural network for a maximum number of epochs. Alternatively, the convergence criterion is a loss metric changing by less than a threshold between epochs. In some embodiments, the memory augmented neural network is a Differentiable Neural Computer. In some embodiments, the training data represents, as a function of time, round trip time for communication in a communication network. Alternatively, the training data represents, as a function of time, power consumption given performance management counters in a communication network. Alternatively, the training data represents, as a function of time, a metric characterizing a service level from a data center. Figure 12 depicts a method in accordance with other particular embodiments. The method may be performed by computing equipment 10. The method includes obtaining a trained memory augmented neural network 12T that is trained to model circular and bidirectional temporal dependencies in data (Block 200). The method also includes forming an inference 24 from input data 22 by inputting the input data 22 into the trained memory augmented neural network 12T in natural temporal order (Block 210). In some embodiments, the trained memory augmented neural network is trained, over multiple epochs, with training data, by alternating from epoch to epoch between training the memory augmented neural network with the training data in natural temporal order and training the memory augmented neural network with the training data in reverse temporal order. In some embodiments, the input data represents, as a function of time, round trip time for communication in a communication network. Alternatively, the input data represents, as a function of time, power consumption given performance management counters in a communication network. Alternatively, the input data represents, as a function of time, a metric characterizing a service level from a data center. In some embodiments, the input data is a time series with no future information accessible. Embodiments herein also include corresponding apparatuses. Embodiments herein for instance include computing equipment 10 configured to perform any of the steps of any of the embodiments described above for the computing equipment 10. Embodiments also include computing equipment 10comprising processing circuitry and power supply circuitry. The processing circuitry is configured to perform any of the steps of any of the embodiments described above for the computing equipment 10. The power supply circuitry is configured to supply power to the computing equipment 10. Embodiments further include computing equipment 10comprising processing circuitry. The processing circuitry is configured to perform any of the steps of any of the embodiments described above for the computing equipment 10. In some embodiments, the computing equipment 10 further comprises communication circuitry. Embodiments further include computing equipment 10comprising processing circuitry and memory. The memory contains instructions executable by the processing circuitry whereby the computing equipment 10is configured to perform any of the steps of any of the embodiments described above for the computing equipment 10. More particularly, the apparatuses described above may perform the methods herein and any other processing by implementing any functional means, modules, units, or circuitry. In one embodiment, for example, the apparatuses comprise respective circuits or circuitry configured to perform the steps shown in the method figures. The circuits or circuitry in this regard may comprise circuits dedicated to performing certain functional processing and/or one or more microprocessors in conjunction with memory. For instance, the circuitry may include one or more microprocessor or microcontrollers, as well as other digital hardware, which may include digital signal processors (DSPs), special-purpose digital logic, and the like. The processing circuitry may be configured to execute program code stored in memory, which may include one or several types of memory such as read-only memory (ROM), random-access memory, cache memory, flash memory devices, optical storage devices, etc. Program code stored in memory may include program instructions for executing one or more telecommunications and/or data communications protocols as well as instructions for carrying out one or more of the techniques described herein, in several embodiments. In embodiments that employ memory, the memory stores program code that, when executed by the one or more processors, carries out the techniques described herein. Figure 13 for example illustrates computing equipment 10 as implemented in accordance with one or more embodiments. As shown, the computing equipment 10 includes processing circuitry 310. The processing circuitry 310 is configured to perform processing described above, e.g., in Figure 11 or Figure 12, such as by executing instructions stored in memory 330. The processing circuitry 310 in this regard may implement certain functional means, units, or modules. In some embodiments, the computing equipment 10 also comprises communication circuitry 320 configured to transmit and/or receive information to and/or from one or more other nodes, e.g., via any communication technology. Those skilled in the art will also appreciate that embodiments herein further include corresponding computer programs. A computer program comprises instructions which, when executed on at least one processor of an apparatus, cause the apparatus to carry out any of the respective processing described above. A computer program in this regard may comprise one or more code modules corresponding to the means or units described above. Embodiments further include a carrier containing such a computer program. This carrier may comprise one of an electronic signal, optical signal, radio signal, or computer readable storage medium. In this regard, embodiments herein also include a computer program product stored on a non-transitory computer readable (storage or recording) medium and comprising instructions that, when executed by a processor of an apparatus, cause the apparatus to perform as described above. Embodiments further include a computer program product comprising program code portions for performing the steps of any of the embodiments herein when the computer program product is executed by a computing device. This computer program product may be stored on a computer readable recording medium. In some embodiments, the computing equipment 10 is equipment in a communication network or system. Figure 14 shows an example of a communication system 1400 in accordance with some embodiments. In the example, the communication system 1400 includes a telecommunication network 1402 that includes an access network 1404, such as a radio access network (RAN), and a core network 1406, which includes one or more core network nodes 1408. The access network 1404 includes one or more access network nodes, such as network nodes 1410a and 1410b (one or more of which may be generally referred to as network nodes 1410), or any other similar 3 rd Generation Partnership Project (3GPP) access node or non-3GPP access point. The network nodes 1410 facilitate direct or indirect connection of user equipment (UE), such as by connecting UEs 1412a, 1412b, 1412c, and 1412d (one or more of which may be generally referred to as UEs 1412) to the core network 1406 over one or more wireless connections. Example wireless communications over a wireless connection include transmitting and/or receiving wireless signals using electromagnetic waves, radio waves, infrared waves, and/or other types of signals suitable for conveying information without the use of wires, cables, or other material conductors. Moreover, in different embodiments, the communication system 1400 may include any number of wired or wireless networks, network nodes, UEs, and/or any other components or systems that may facilitate or participate in the communication of data and/or signals whether via wired or wireless connections. The communication system 1400 may include and/or interface with any type of communication, telecommunication, data, cellular, radio network, and/or other similar type of system. The UEs 1412 may be any of a wide variety of communication devices, including wireless devices arranged, configured, and/or operable to communicate wirelessly with the network nodes 1410 and other communication devices. Similarly, the network nodes 1410 are arranged, capable, configured, and/or operable to communicate directly or indirectly with the UEs 1412 and/or with other network nodes or equipment in the telecommunication network 1402 to enable and/or provide network access, such as wireless network access, and/or to perform other functions, such as administration in the telecommunication network 1402. In the depicted example, the core network 1406 connects the network nodes 1410 to one or more hosts, such as host 1416. These connections may be direct or indirect via one or more intermediary networks or devices. In other examples, network nodes may be directly coupled to hosts. The core network 1406 includes one more core network nodes (e.g., core network node 1408) that are structured with hardware and software components. Features of these components may be substantially similar to those described with respect to the UEs, network nodes, and/or hosts, such that the descriptions thereof are generally applicable to the corresponding components of the core network node 1408. Example core network nodes include functions of one or more of a Mobile Switching Center (MSC), Mobility Management Entity (MME), Home Subscriber Server (HSS), Access and Mobility Management Function (AMF), Session Management Function (SMF), Authentication Server Function (AUSF), Subscription Identifier De- concealing function (SIDF), Unified Data Management (UDM), Security Edge Protection Proxy (SEPP), Network Exposure Function (NEF), and/or a User Plane Function (UPF). The host 1416 may be under the ownership or control of a service provider other than an operator or provider of the access network 1404 and/or the telecommunication network 1402, and may be operated by the service provider or on behalf of the service provider. The host 1416 may host a variety of applications to provide one or more service. Examples of such applications include live and pre-recorded audio/video content, data collection services such as retrieving and compiling data on various ambient conditions detected by a plurality of UEs, analytics functionality, social media, functions for controlling or otherwise interacting with remote devices, functions for an alarm and surveillance center, or any other such function performed by a server. As a whole, the communication system 1400 of Figure 14 enables connectivity between the UEs, network nodes, and hosts. In that sense, the communication system may be configured to operate according to predefined rules or procedures, such as specific standards that include, but are not limited to: Global System for Mobile Communications (GSM); Universal Mobile Telecommunications System (UMTS); Long Term Evolution (LTE), and/or other suitable 2G, 3G, 4G, 5G standards, or any applicable future generation standard (e.g., 6G); wireless local area network (WLAN) standards, such as the Institute of Electrical and Electronics Engineers (IEEE) 802.11 standards (WiFi); and/or any other appropriate wireless communication standard, such as the Worldwide Interoperability for Microwave Access (WiMax), Bluetooth, Z-Wave, Near Field Communication (NFC) ZigBee, LiFi, and/or any low-power wide-area network (LPWAN) standards such as LoRa and Sigfox. In some examples, the telecommunication network 1402 is a cellular network that implements 3GPP standardized features. Accordingly, the telecommunications network 1402 may support network slicing to provide different logical networks to different devices that are connected to the telecommunication network 1402. For example, the telecommunications network 1402 may provide Ultra Reliable Low Latency Communication (URLLC) services to some UEs, while providing Enhanced Mobile Broadband (eMBB) services to other UEs, and/or Massive Machine Type Communication (mMTC)/Massive IoT services to yet further UEs. In some examples, the UEs 1412 are configured to transmit and/or receive information without direct human interaction. For instance, a UE may be designed to transmit information to the access network 1404 on a predetermined schedule, when triggered by an internal or external event, or in response to requests from the access network 1404. Additionally, a UE may be configured for operating in single- or multi-RAT or multi-standard mode. For example, a UE may operate with any one or combination of Wi-Fi, NR (New Radio) and LTE, i.e. being configured for multi-radio dual connectivity (MR-DC), such as E-UTRAN (Evolved-UMTS Terrestrial Radio Access Network) New Radio – Dual Connectivity (EN-DC). In the example, the hub 1414 communicates with the access network 1404 to facilitate indirect communication between one or more UEs (e.g., UE 1412c and/or 1412d) and network nodes (e.g., network node 1410b). In some examples, the hub 1414 may be a controller, router, content source and analytics, or any of the other communication devices described herein regarding UEs. For example, the hub 1414 may be a broadband router enabling access to the core network 1406 for the UEs. As another example, the hub 1414 may be a controller that sends commands or instructions to one or more actuators in the UEs. Commands or instructions may be received from the UEs, network nodes 1410, or by executable code, script, process, or other instructions in the hub 1414. As another example, the hub 1414 may be a data collector that acts as temporary storage for UE data and, in some embodiments, may perform analysis or other processing of the data. As another example, the hub 1414 may be a content source. For example, for a UE that is a VR headset, display, loudspeaker or other media delivery device, the hub 1414 may retrieve VR assets, video, audio, or other media or data related to sensory information via a network node, which the hub 1414 then provides to the UE either directly, after performing local processing, and/or after adding additional local content. In still another example, the hub 1414 acts as a proxy server or orchestrator for the UEs, in particular in if one or more of the UEs are low energy IoT devices. The hub 1414 may have a constant/persistent or intermittent connection to the network node 1410b. The hub 1414 may also allow for a different communication scheme and/or schedule between the hub 1414 and UEs (e.g., UE 1412c and/or 1412d), and between the hub 1414 and the core network 1406. In other examples, the hub 1414 is connected to the core network 1406 and/or one or more UEs via a wired connection. Moreover, the hub 1414 may be configured to connect to an M2M service provider over the access network 1404 and/or to another UE over a direct connection. In some scenarios, UEs may establish a wireless connection with the network nodes 1410 while still connected via the hub 1414 via a wired or wireless connection. In some embodiments, the hub 1414 may be a dedicated hub – that is, a hub whose primary function is to route communications to/from the UEs from/to the network node 1410b. In other embodiments, the hub 1414 may be a non-dedicated hub – that is, a device which is capable of operating to route communications between the UEs and network node 1410b, but which is additionally capable of operating as a communication start and/or end point for certain data channels. Figure 15 shows a UE 1500 in accordance with some embodiments. As used herein, a UE refers to a device capable, configured, arranged and/or operable to communicate wirelessly with network nodes and/or other UEs. Examples of a UE include, but are not limited to, a smart phone, mobile phone, cell phone, voice over IP (VoIP) phone, wireless local loop phone, desktop computer, personal digital assistant (PDA), wireless cameras, gaming console or device, music storage device, playback appliance, wearable terminal device, wireless endpoint, mobile station, tablet, laptop, laptop-embedded equipment (LEE), laptop-mounted equipment (LME), smart device, wireless customer-premise equipment (CPE), vehicle-mounted or vehicle embedded/integrated wireless device, etc. Other examples include any UE identified by the 3rd Generation Partnership Project (3GPP), including a narrow band internet of things (NB-IoT) UE, a machine type communication (MTC) UE, and/or an enhanced MTC (eMTC) UE. A UE may support device-to-device (D2D) communication, for example by implementing a 3GPP standard for sidelink communication, Dedicated Short-Range Communication (DSRC), vehicle-to-vehicle (V2V), vehicle-to-infrastructure (V2I), or vehicle-to-everything (V2X). In other examples, a UE may not necessarily have a user in the sense of a human user who owns and/or operates the relevant device. Instead, a UE may represent a device that is intended for sale to, or operation by, a human user but which may not, or which may not initially, be associated with a specific human user (e.g., a smart sprinkler controller). Alternatively, a UE may represent a device that is not intended for sale to, or operation by, an end user but which may be associated with or operated for the benefit of a user (e.g., a smart power meter). The UE 1500 includes processing circuitry 1502 that is operatively coupled via a bus 1504 to an input/output interface 1506, a power source 1508, a memory 1510, a communication interface 1512, and/or any other component, or any combination thereof. Certain UEs may utilize all or a subset of the components shown in Figure 15. The level of integration between the components may vary from one UE to another UE. Further, certain UEs may contain multiple instances of a component, such as multiple processors, memories, transceivers, transmitters, receivers, etc. The processing circuitry 1502 is configured to process instructions and data and may be configured to implement any sequential state machine operative to execute instructions stored as machine-readable computer programs in the memory 1510. The processing circuitry 1502 may be implemented as one or more hardware-implemented state machines (e.g., in discrete logic, field-programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), etc.); programmable logic together with appropriate firmware; one or more stored computer programs, general-purpose processors, such as a microprocessor or digital signal processor (DSP), together with appropriate software; or any combination of the above. For example, the processing circuitry 1502 may include multiple central processing units (CPUs). In the example, the input/output interface 1506 may be configured to provide an interface or interfaces to an input device, output device, or one or more input and/or output devices. Examples of an output device include a speaker, a sound card, a video card, a display, a monitor, a printer, an actuator, an emitter, a smartcard, another output device, or any combination thereof. An input device may allow a user to capture information into the UE 1500. Examples of an input device include a touch-sensitive or presence-sensitive display, a camera (e.g., a digital camera, a digital video camera, a web camera, etc.), a microphone, a sensor, a mouse, a trackball, a directional pad, a trackpad, a scroll wheel, a smartcard, and the like. The presence-sensitive display may include a capacitive or resistive touch sensor to sense input from a user. A sensor may be, for instance, an accelerometer, a gyroscope, a tilt sensor, a force sensor, a magnetometer, an optical sensor, a proximity sensor, a biometric sensor, etc., or any combination thereof. An output device may use the same type of interface port as an input device. For example, a Universal Serial Bus (USB) port may be used to provide an input device and an output device. In some embodiments, the power source 1508 is structured as a battery or battery pack. Other types of power sources, such as an external power source (e.g., an electricity outlet), photovoltaic device, or power cell, may be used. The power source 1508 may further include power circuitry for delivering power from the power source 1508 itself, and/or an external power source, to the various parts of the UE 1500 via input circuitry or an interface such as an electrical power cable. Delivering power may be, for example, for charging of the power source 1508. Power circuitry may perform any formatting, converting, or other modification to the power from the power source 1508 to make the power suitable for the respective components of the UE 1500 to which power is supplied. The memory 1510 may be or be configured to include memory such as random access memory (RAM), read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), magnetic disks, optical disks, hard disks, removable cartridges, flash drives, and so forth. In one example, the memory 1510 includes one or more application programs 1514, such as an operating system, web browser application, a widget, gadget engine, or other application, and corresponding data 1516. The memory 1510 may store, for use by the UE 1500, any of a variety of various operating systems or combinations of operating systems. The memory 1510 may be configured to include a number of physical drive units, such as redundant array of independent disks (RAID), flash memory, USB flash drive, external hard disk drive, thumb drive, pen drive, key drive, high-density digital versatile disc (HD-DVD) optical disc drive, internal hard disk drive, Blu-Ray optical disc drive, holographic digital data storage (HDDS) optical disc drive, external mini-dual in-line memory module (DIMM), synchronous dynamic random access memory (SDRAM), external micro-DIMM SDRAM, smartcard memory such as tamper resistant module in the form of a universal integrated circuit card (UICC) including one or more subscriber identity modules (SIMs), such as a USIM and/or ISIM, other memory, or any combination thereof. The UICC may for example be an embedded UICC (eUICC), integrated UICC (iUICC) or a removable UICC commonly known as ‘SIM card.’ The memory 1510 may allow the UE 1500 to access instructions, application programs and the like, stored on transitory or non- transitory memory media, to off-load data, or to upload data. An article of manufacture, such as one utilizing a communication system may be tangibly embodied as or in the memory 1510, which may be or comprise a device-readable storage medium. The processing circuitry 1502 may be configured to communicate with an access network or other network using the communication interface 1512. The communication interface 1512 may comprise one or more communication subsystems and may include or be communicatively coupled to an antenna 1522. The communication interface 1512 may include one or more transceivers used to communicate, such as by communicating with one or more remote transceivers of another device capable of wireless communication (e.g., another UE or a network node in an access network). Each transceiver may include a transmitter 1518 and/or a receiver 1520 appropriate to provide network communications (e.g., optical, electrical, frequency allocations, and so forth). Moreover, the transmitter 1518 and receiver 1520 may be coupled to one or more antennas (e.g., antenna 1522) and may share circuit components, software or firmware, or alternatively be implemented separately. In the illustrated embodiment, communication functions of the communication interface 1512 may include cellular communication, Wi-Fi communication, LPWAN communication, data communication, voice communication, multimedia communication, short-range communications such as Bluetooth, near-field communication, location-based communication such as the use of the global positioning system (GPS) to determine a location, another like communication function, or any combination thereof. Communications may be implemented in according to one or more communication protocols and/or standards, such as IEEE 802.11, Code Division Multiplexing Access (CDMA), Wideband Code Division Multiple Access (WCDMA), GSM, LTE, New Radio (NR), UMTS, WiMax, Ethernet, transmission control protocol/internet protocol (TCP/IP), synchronous optical networking (SONET), Asynchronous Transfer Mode (ATM), QUIC, Hypertext Transfer Protocol (HTTP), and so forth. Regardless of the type of sensor, a UE may provide an output of data captured by its sensors, through its communication interface 1512, via a wireless connection to a network node. Data captured by sensors of a UE can be communicated through a wireless connection to a network node via another UE. The output may be periodic (e.g., once every 15 minutes if it reports the sensed temperature), random (e.g., to even out the load from reporting from several sensors), in response to a triggering event (e.g., when moisture is detected an alert is sent), in response to a request (e.g., a user initiated request), or a continuous stream (e.g., a live video feed of a patient). As another example, a UE comprises an actuator, a motor, or a switch, related to a communication interface configured to receive wireless input from a network node via a wireless connection. In response to the received wireless input the states of the actuator, the motor, or the switch may change. For example, the UE may comprise a motor that adjusts the control surfaces or rotors of a drone in flight according to the received input or to a robotic arm performing a medical procedure according to the received input. A UE, when in the form of an Internet of Things (IoT) device, may be a device for use in one or more application domains, these domains comprising, but not limited to, city wearable technology, extended industrial application and healthcare. Non-limiting examples of such an IoT device are a device which is or which is embedded in: a connected refrigerator or freezer, a TV, a connected lighting device, an electricity meter, a robot vacuum cleaner, a voice controlled smart speaker, a home security camera, a motion detector, a thermostat, a smoke detector, a door/window sensor, a flood/moisture sensor, an electrical door lock, a connected doorbell, an air conditioning system like a heat pump, an autonomous vehicle, a surveillance system, a weather monitoring device, a vehicle parking monitoring device, an electric vehicle charging station, a smart watch, a fitness tracker, a head-mounted display for Augmented Reality (AR) or Virtual Reality (VR), a wearable for tactile augmentation or sensory enhancement, a water sprinkler, an animal- or item-tracking device, a sensor for monitoring a plant or animal, an industrial robot, an Unmanned Aerial Vehicle (UAV), and any kind of medical device, like a heart rate monitor or a remote controlled surgical robot. A UE in the form of an IoT device comprises circuitry and/or software in dependence of the intended application of the IoT device in addition to other components as described in relation to the UE 1500 shown in Figure 15. As yet another specific example, in an IoT scenario, a UE may represent a machine or other device that performs monitoring and/or measurements, and transmits the results of such monitoring and/or measurements to another UE and/or a network node. The UE may in this case be an M2M device, which may in a 3GPP context be referred to as an MTC device. As one particular example, the UE may implement the 3GPP NB-IoT standard. In other scenarios, a UE may represent a vehicle, such as a car, a bus, a truck, a ship and an airplane, or other equipment that is capable of monitoring and/or reporting on its operational status or other functions associated with its operation. In practice, any number of UEs may be used together with respect to a single use case. For example, a first UE might be or be integrated in a drone and provide the drone’s speed information (obtained through a speed sensor) to a second UE that is a remote controller operating the drone. When the user makes changes from the remote controller, the first UE may adjust the throttle on the drone (e.g. by controlling an actuator) to increase or decrease the drone’s speed. The first and/or the second UE can also include more than one of the functionalities described above. For example, a UE might comprise the sensor and the actuator, and handle communication of data for both the speed sensor and the actuators. Figure 16 shows a network node 1600 in accordance with some embodiments. As used herein, network node refers to equipment capable, configured, arranged and/or operable to communicate directly or indirectly with a UE and/or with other network nodes or equipment, in a telecommunication network. Examples of network nodes include, but are not limited to, access points (APs) (e.g., radio access points), base stations (BSs) (e.g., radio base stations, Node Bs, evolved Node Bs (eNBs) and NR NodeBs (gNBs)). Base stations may be categorized based on the amount of coverage they provide (or, stated differently, their transmit power level) and so, depending on the provided amount of coverage, may be referred to as femto base stations, pico base stations, micro base stations, or macro base stations. A base station may be a relay node or a relay donor node controlling a relay. A network node may also include one or more (or all) parts of a distributed radio base station such as centralized digital units and/or remote radio units (RRUs), sometimes referred to as Remote Radio Heads (RRHs). Such remote radio units may or may not be integrated with an antenna as an antenna integrated radio. Parts of a distributed radio base station may also be referred to as nodes in a distributed antenna system (DAS). Other examples of network nodes include multiple transmission point (multi-TRP) 5G access nodes, multi-standard radio (MSR) equipment such as MSR BSs, network controllers such as radio network controllers (RNCs) or base station controllers (BSCs), base transceiver stations (BTSs), transmission points, transmission nodes, multi-cell/multicast coordination entities (MCEs), Operation and Maintenance (O&M) nodes, Operations Support System (OSS) nodes, Self-Organizing Network (SON) nodes, positioning nodes (e.g., Evolved Serving Mobile Location Centers (E-SMLCs)), and/or Minimization of Drive Tests (MDTs). The network node 1600 includes a processing circuitry 1602, a memory 1604, a communication interface 1606, and a power source 1608. The network node 1600 may be composed of multiple physically separate components (e.g., a NodeB component and a RNC component, or a BTS component and a BSC component, etc.), which may each have their own respective components. In certain scenarios in which the network node 1600 comprises multiple separate components (e.g., BTS and BSC components), one or more of the separate components may be shared among several network nodes. For example, a single RNC may control multiple NodeBs. In such a scenario, each unique NodeB and RNC pair, may in some instances be considered a single separate network node. In some embodiments, the network node 1600 may be configured to support multiple radio access technologies (RATs). In such embodiments, some components may be duplicated (e.g., separate memory 1604 for different RATs) and some components may be reused (e.g., a same antenna 1610 may be shared by different RATs). The network node 1600 may also include multiple sets of the various illustrated components for different wireless technologies integrated into network node 1600, for example GSM, WCDMA, LTE, NR, WiFi, Zigbee, Z-wave, LoRaWAN, Radio Frequency Identification (RFID) or Bluetooth wireless technologies. These wireless technologies may be integrated into the same or different chip or set of chips and other components within network node 1600. The processing circuitry 1602 may comprise a combination of one or more of a microprocessor, controller, microcontroller, central processing unit, digital signal processor, application-specific integrated circuit, field programmable gate array, or any other suitable computing device, resource, or combination of hardware, software and/or encoded logic operable to provide, either alone or in conjunction with other network node 1600 components, such as the memory 1604, to provide network node 1600 functionality. In some embodiments, the processing circuitry 1602 includes a system on a chip (SOC). In some embodiments, the processing circuitry 1602 includes one or more of radio frequency (RF) transceiver circuitry 1612 and baseband processing circuitry 1614. In some embodiments, the radio frequency (RF) transceiver circuitry 1612 and the baseband processing circuitry 1614 may be on separate chips (or sets of chips), boards, or units, such as radio units and digital units. In alternative embodiments, part or all of RF transceiver circuitry 1612 and baseband processing circuitry 1614 may be on the same chip or set of chips, boards, or units. The memory 1604 may comprise any form of volatile or non-volatile computer-readable memory including, without limitation, persistent storage, solid-state memory, remotely mounted memory, magnetic media, optical media, random access memory (RAM), read-only memory (ROM), mass storage media (for example, a hard disk), removable storage media (for example, a flash drive, a Compact Disk (CD) or a Digital Video Disk (DVD)), and/or any other volatile or non-volatile, non-transitory device-readable and/or computer-executable memory devices that store information, data, and/or instructions that may be used by the processing circuitry 1602. The memory 1604 may store any suitable instructions, data, or information, including a computer program, software, an application including one or more of logic, rules, code, tables, and/or other instructions capable of being executed by the processing circuitry 1602 and utilized by the network node 1600. The memory 1604 may be used to store any calculations made by the processing circuitry 1602 and/or any data received via the communication interface 1606. In some embodiments, the processing circuitry 1602 and memory 1604 is integrated. The communication interface 1606 is used in wired or wireless communication of signaling and/or data between a network node, access network, and/or UE. As illustrated, the communication interface 1606 comprises port(s)/terminal(s) 1616 to send and receive data, for example to and from a network over a wired connection. The communication interface 1606 also includes radio front-end circuitry 1618 that may be coupled to, or in certain embodiments a part of, the antenna 1610. Radio front-end circuitry 1618 comprises filters 1620 and amplifiers 1622. The radio front-end circuitry 1618 may be connected to an antenna 1610 and processing circuitry 1602. The radio front-end circuitry may be configured to condition signals communicated between antenna 1610 and processing circuitry 1602. The radio front-end circuitry 1618 may receive digital data that is to be sent out to other network nodes or UEs via a wireless connection. The radio front-end circuitry 1618 may convert the digital data into a radio signal having the appropriate channel and bandwidth parameters using a combination of filters 1620 and/or amplifiers 1622. The radio signal may then be transmitted via the antenna 1610. Similarly, when receiving data, the antenna 1610 may collect radio signals which are then converted into digital data by the radio front-end circuitry 1618. The digital data may be passed to the processing circuitry 1602. In other embodiments, the communication interface may comprise different components and/or different combinations of components. In certain alternative embodiments, the network node 1600 does not include separate radio front-end circuitry 1618, instead, the processing circuitry 1602 includes radio front-end circuitry and is connected to the antenna 1610. Similarly, in some embodiments, all or some of the RF transceiver circuitry 1612 is part of the communication interface 1606. In still other embodiments, the communication interface 1606 includes one or more ports or terminals 1616, the radio front-end circuitry 1618, and the RF transceiver circuitry 1612, as part of a radio unit (not shown), and the communication interface 1606 communicates with the baseband processing circuitry 1614, which is part of a digital unit (not shown). The antenna 1610 may include one or more antennas, or antenna arrays, configured to send and/or receive wireless signals. The antenna 1610 may be coupled to the radio front-end circuitry 1618 and may be any type of antenna capable of transmitting and receiving data and/or signals wirelessly. In certain embodiments, the antenna 1610 is separate from the network node 1600 and connectable to the network node 1600 through an interface or port. The antenna 1610, communication interface 1606, and/or the processing circuitry 1602 may be configured to perform any receiving operations and/or certain obtaining operations described herein as being performed by the network node. Any information, data and/or signals may be received from a UE, another network node and/or any other network equipment. Similarly, the antenna 1610, the communication interface 1606, and/or the processing circuitry 1602 may be configured to perform any transmitting operations described herein as being performed by the network node. Any information, data and/or signals may be transmitted to a UE, another network node and/or any other network equipment. The power source 1608 provides power to the various components of network node 1600 in a form suitable for the respective components (e.g., at a voltage and current level needed for each respective component). The power source 1608 may further comprise, or be coupled to, power management circuitry to supply the components of the network node 1600 with power for performing the functionality described herein. For example, the network node 1600 may be connectable to an external power source (e.g., the power grid, an electricity outlet) via an input circuitry or interface such as an electrical cable, whereby the external power source supplies power to power circuitry of the power source 1608. As a further example, the power source 1608 may comprise a source of power in the form of a battery or battery pack which is connected to, or integrated in, power circuitry. The battery may provide backup power should the external power source fail. Embodiments of the network node 1600 may include additional components beyond those shown in Figure 16 for providing certain aspects of the network node’s functionality, including any of the functionality described herein and/or any functionality necessary to support the subject matter described herein. For example, the network node 1600 may include user interface equipment to allow input of information into the network node 1600 and to allow output of information from the network node 1600. This may allow a user to perform diagnostic, maintenance, repair, and other administrative functions for the network node 1600. Figure 17 is a block diagram of a host 1700, which may be an embodiment of the host 1416 of Figure 14, in accordance with various aspects described herein. As used herein, the host 1700 may be or comprise various combinations hardware and/or software, including a standalone server, a blade server, a cloud-implemented server, a distributed server, a virtual machine, container, or processing resources in a server farm. The host 1700 may provide one or more services to one or more UEs. The host 1700 includes processing circuitry 1702 that is operatively coupled via a bus 1704 to an input/output interface 1706, a network interface 1708, a power source 1710, and a memory 1712. Other components may be included in other embodiments. Features of these components may be substantially similar to those described with respect to the devices of previous figures, such as Figures 15 and 16, such that the descriptions thereof are generally applicable to the corresponding components of host 1700. The memory 1712 may include one or more computer programs including one or more host application programs 1714 and data 1716, which may include user data, e.g., data generated by a UE for the host 1700 or data generated by the host 1700 for a UE. Embodiments of the host 1700 may utilize only a subset or all of the components shown. The host application programs 1714 may be implemented in a container-based architecture and may provide support for video codecs (e.g., Versatile Video Coding (VVC), High Efficiency Video Coding (HEVC), Advanced Video Coding (AVC), MPEG, VP9) and audio codecs (e.g., FLAC, Advanced Audio Coding (AAC), MPEG, G.711), including transcoding for multiple different classes, types, or implementations of UEs (e.g., handsets, desktop computers, wearable display systems, heads-up display systems). The host application programs 1714 may also provide for user authentication and licensing checks and may periodically report health, routes, and content availability to a central node, such as a device in or on the edge of a core network. Accordingly, the host 1700 may select and/or indicate a different host for over-the-top services for a UE. The host application programs 1714 may support various protocols, such as the HTTP Live Streaming (HLS) protocol, Real-Time Messaging Protocol (RTMP), Real-Time Streaming Protocol (RTSP), Dynamic Adaptive Streaming over HTTP (MPEG-DASH), etc. Figure 18 is a block diagram illustrating a virtualization environment 1800 in which functions implemented by some embodiments may be virtualized. In the present context, virtualizing means creating virtual versions of apparatuses or devices which may include virtualizing hardware platforms, storage devices and networking resources. As used herein, virtualization can be applied to any device described herein, or components thereof, and relates to an implementation in which at least a portion of the functionality is implemented as one or more virtual components. Some or all of the functions described herein may be implemented as virtual components executed by one or more virtual machines (VMs) implemented in one or more virtual environments 1800 hosted by one or more of hardware nodes, such as a hardware computing device that operates as a network node, UE, core network node, or host. Further, in embodiments in which the virtual node does not require radio connectivity (e.g., a core network node or host), then the node may be entirely virtualized. Applications 1802 (which may alternatively be called software instances, virtual appliances, network functions, virtual nodes, virtual network functions, etc.) are run in the virtualization environment Q400 to implement some of the features, functions, and/or benefits of some of the embodiments disclosed herein. Hardware 1804 includes processing circuitry, memory that stores software and/or instructions executable by hardware processing circuitry, and/or other hardware devices as described herein, such as a network interface, input/output interface, and so forth. Software may be executed by the processing circuitry to instantiate one or more virtualization layers 1806 (also referred to as hypervisors or virtual machine monitors (VMMs)), provide VMs 1808a and 1808b (one or more of which may be generally referred to as VMs 1808), and/or perform any of the functions, features and/or benefits described in relation with some embodiments described herein. The virtualization layer 1806 may present a virtual operating platform that appears like networking hardware to the VMs 1808. The VMs 1808 comprise virtual processing, virtual memory, virtual networking or interface and virtual storage, and may be run by a corresponding virtualization layer 1806. Different embodiments of the instance of a virtual appliance 1802 may be implemented on one or more of VMs 1808, and the implementations may be made in different ways. Virtualization of the hardware is in some contexts referred to as network function virtualization (NFV). NFV may be used to consolidate many network equipment types onto industry standard high volume server hardware, physical switches, and physical storage, which can be located in data centers, and customer premise equipment. In the context of NFV, a VM 1808 may be a software implementation of a physical machine that runs programs as if they were executing on a physical, non-virtualized machine. Each of the VMs 1808, and that part of hardware 1804 that executes that VM, be it hardware dedicated to that VM and/or hardware shared by that VM with others of the VMs, forms separate virtual network elements. Still in the context of NFV, a virtual network function is responsible for handling specific network functions that run in one or more VMs 1808 on top of the hardware 1804 and corresponds to the application 1802. Hardware 1804 may be implemented in a standalone network node with generic or specific components. Hardware 1804 may implement some functions via virtualization. Alternatively, hardware 1804 may be part of a larger cluster of hardware (e.g. such as in a data center or CPE) where many hardware nodes work together and are managed via management and orchestration 1810, which, among others, oversees lifecycle management of applications 1802. In some embodiments, hardware 1804 is coupled to one or more radio units that each include one or more transmitters and one or more receivers that may be coupled to one or more antennas. Radio units may communicate directly with other hardware nodes via one or more appropriate network interfaces and may be used in combination with the virtual components to provide a virtual node with radio capabilities, such as a radio access node or a base station. In some embodiments, some signaling can be provided with the use of a control system 1812 which may alternatively be used for communication between hardware nodes and radio units. Although the computing devices described herein (e.g., UEs, network nodes, hosts) may include the illustrated combination of hardware components, other embodiments may comprise computing devices with different combinations of components. It is to be understood that these computing devices may comprise any suitable combination of hardware and/or software needed to perform the tasks, features, functions and methods disclosed herein. Determining, calculating, obtaining or similar operations described herein may be performed by processing circuitry, which may process information by, for example, converting the obtained information into other information, comparing the obtained information or converted information to information stored in the network node, and/or performing one or more operations based on the obtained information or converted information, and as a result of said processing making a determination. Moreover, while components are depicted as single boxes located within a larger box, or nested within multiple boxes, in practice, computing devices may comprise multiple different physical components that make up a single illustrated component, and functionality may be partitioned between separate components. For example, a communication interface may be configured to include any of the components described herein, and/or the functionality of the components may be partitioned between the processing circuitry and the communication interface. In another example, non-computationally intensive functions of any of such components may be implemented in software or firmware and computationally intensive functions may be implemented in hardware. In certain embodiments, some or all of the functionality described herein may be provided by processing circuitry executing instructions stored on in memory, which in certain embodiments may be a computer program product in the form of a non-transitory computer-readable storage medium. In alternative embodiments, some or all of the functionality may be provided by the processing circuitry without executing instructions stored on a separate or discrete device- readable storage medium, such as in a hard-wired manner. In any of those particular embodiments, whether executing instructions stored on a non-transitory computer-readable storage medium or not, the processing circuitry can be configured to perform the described functionality. The benefits provided by such functionality are not limited to the processing circuitry alone or to other components of the computing device, but are enjoyed by the computing device as a whole, and/or by end users and a wireless network generally. REFERENCES 1. Graves, Alex, et al. "Hybrid computing using a neural network with dynamic external memory." Nature 538.7626 (2016): 471-476. 2. Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780. 3. Franke, J.K., Niehues, J., & Waibel, A.H. (2018). Robust and Scalable Differentiable Neural Computer for Question Answering. List of Figures I ix List of Figures 2.1 Illustration of the data flow of an RNN. xis the input data, y is the output of the RNN, and h is the hidden state . . . . . . 4 2.2 Illustration of the high-level data flows in a DNC with an RNN as the controller. The input to the controller at each time step is the data, Xt, hidden state from previous time step, ht-i, and a read vector from the external memory, rt-i• The output of the model is Yt, produced by the predictive layer. . . . . . . . 7 2.3 Illustration of the data flow of the memory interaction component of the DNC. . . . . . . . . . . . . . . . . . . . . 8 3.1 Illustration of how the experience is updated when passing a sequence, Xi, X 2 , ... , XT, through DNC. At the start of each new sequence, or epoch as depicted in the figure, all of the experience is set to zero. . . . . . . . . . . . . . . . . . . . . 19 3.2 Illustration of how the experience is updated when passing a sequence, Xi, X 2 , ... , XT, through a newly proposed method of training a DNC. At the start of each new sequence, or epoch as depicted in the figure, the experience is passed from the previous sequence or epoch. . . . . . . . . . . . . . . . . . . 20

x I List of Figures 3.3 Illustration of how the experience is updated when passing a sequence, X 1 , X 2 , ... , XT, through a newly proposed method of training a DNC. For n = l , 2, 3... , epoch 2n uses the experience from epoch 2n - 1 and passes the sequence with the direction feature set to zero in a forward direction. Epoch 2n + 1 uses the experience from epoch 2n and passes the sequence with the direction feature set to one in the opposite direction. At n = 0, the experience starts as zeros. . . . . . . 22 4.1 PlotofRTTtime series 1, mapping targetRTT against the time step. The units are seconds for the time step. . . . . . . . . . 26 4.2 Plot of RTT time series 2, mapping target RTT against the time step. The units are seconds for the time step. . . . . . . . . . 27 4.3 Plot of RWL time series 1 and 2 target, mapping target latency against the time step. The unit for the target is ms, and seconds for the time step. . . . . . . . . . . . . . . . . . . . . . . . 28 6.1 Output of the LSTM and DNC on the validation set of the RTT time series 1 against the true RTT. The values for the vertical axis are not shown due to business-sensitive information. . . 34 6.2 RTT time series 1 validation loss plot of the original DNC (blue), keeping memory between epochs (orange), and keeping memory and link matrix between epochs (green). . . 36 6.3 RTT time series 2 validation loss plot of the original DNC (blue), keeping memory between epochs (orange), and keeping memory and link matrix between epochs (green). . . 37 6.4 RTT time series 1 validation loss plot of the DNC with the original training method (blue), bi-directional method with only passing the memory (orange), and bi-directional method with passing the memory and link matrix (green). . . . . . . 40 6.5 RTT time series 2 validation loss plot of the DNC with the original training method (blue), bi-directional method with only passing the memory (orange), and bi-directional method with passing the memory and link matrix (green). . . . . . . . 41 List of Figures I xi 6.6 RTT time series 1 validation loss plot of the DNC trained with the method of reusing the memory (blue), method of reusing the memory and link matrix (orange), bi-directional method with passing the memory (green), and bi-directional method with passing the memory and link matrix (red). . . . . . . . 42 6.7 RTT time series 1 validation loss plot of the DNC trained with the method of reusing the memory (blue), method of reusing the memory and link matrix (orange), bi-directional method with passing the memory (green), and bi-directional method with passing the memory and link matrix (red). . . . . . . . 43 6.8 Read validation loss plot on RWL time series 1. Blue is the original method of training a DNC and orange is the bi- directional method. . . . . . . . . . . . . . . . . . . . . . . . 45 6.9 Write validation loss plot on RWL time series 1. Blue is the original method of training a DNC and orange is the bi- directional method. . . . . . . . . . . . . . . . . . . . . . . . 46 6.10 Read validation loss plot on RWL time series 2. Blue is the original method of training a DNC and orange is the bi- directional method. . . . . . . . . . . . . . . . . . . . . . . . 47 6.11 Write validation loss plot on RWL time series 2. Blue is the original method of training a DNC and orange is the bi- directional method. . . . . . . . . . . . . . . . . . . . . . . . 48 6.12 RTT time series 1 validation loss plot of the original DNC (blue), transferring memory and link matrix from a model trained on the same data set (orange), and transferring memory and link matrix from a model trained on RTT time series 2 (green). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.13 RTT time series 2 validation loss plot of the original DNC (blue), transferring memory and link matrix from a model trained on the same data set (orange), and transferring memory and link matrix from a model trained on RTT time series 1 (green). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

xii I List of Figures 6.14 RWL time series 1 read validation loss plot of the original DNC (blue), transferring memory and link matrix from a model trained on the same data set (orange), and transferring memory and link matrix from a model trained on RWL time series 2 (green). . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.15 RWL time series 1 write validation loss plot of the original DNC (blue), transferring memory and link matrix from a model trained on the same data set (orange), and transferring memory and link matrix from a model trained on RWL time series 2 (green). . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.16 RWL time series 2 read validation loss plot of the original DNC (blue), transferring memory and link matrix from a model trained on the same data set (orange), and transferring memory and link matrix from a model trained on RWL time series 1 (green). . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.17 RWL time series 2 write validation loss plot of the original DNC (blue), transferring memory and link matrix from a model trained on the same data set (orange), and transferring memory and link matrix from a model trained on RWL time series 1 (green). . . . . . . . . . . . . . . . . . . . . . . . . . 57

List of Tables I xiii List of Tables 2.1 Table with naming, notation, and domain of every component in the interface vector. . . . . . . . . . . . . . . . . . . . . . 9 5.1 Parameters tuned for the LSTM as well as the values tested. . 30 5.2 Parameters tuned for the DNC as well as the values tested. . 31

Introduction I 1 Chapter 1 Introduction As the data in the world increases drastically and companies have a strong desire to become more data-driven, time series modeling is an important branch of machine learning to be studied. Some examples of use cases are predicting the stock market, forecasting the demand for products, and forecasting the weather. Furthermore, in this work, time series modeling is used in the telecom domain to predict the latency and round trip time of signals. Time series modeling is a hot research area and therefore many different approaches have been proposed to solve related tasks. One of which is recurrent neural networks (RNN) that have shown promising capabilities of understanding time dependencies in data. However, some recurrent neural networks, such as the Long short-term memory (LSTM) [1], have difficulties finding longer-term dependencies which can be vital for solving some tasks [2]. Another branch of networks is memory augmented neural networks (MANN) and they have the ability to find long-term dependencies better than other RNN s due to them having an accessible memory. The MANN studied in this work is the Differentiable Neural Computer (DNC) [3]. It is a neural network with an external memory that it can operate and manipulate while still being fully differentiable. It has excelled in tasks where long- term dependencies exist such as question answering tasks and tasks requiring processing of implicit or explicit graphs. According to the current literature [4] [5] [6], DNCs are difficult and slow to train due to their external memory. Furthermore, little research has been done on applying the DNC to time series even though the model brings promising 2 I Introduction properties useful to the time series domain. Therefore, this work has focused on improving the training process of the DNC in time series tasks. Specifically, there are three aspects that this work has aimed to improve in regards to DNCs trained on time series: 1. The rate of convergence: after how many epochs has the model converged on the validation data? 2. The performance: how low is the validation loss after convergence? 3. The stability: After convergence, how long does it take before the model overfits? In this work, three training methods that attempt to improve training based on these aspects are presented: 1. Reusing experience: The memory and the link matrix is kept between epochs, instead of resetting them. 2. Bi-directional training of DNC: This is an extension of the first method by also adding a bi-directional scheme for the training. 3. Transfer learning via memory: A trained DNC's memory is passed to an untrained DNC. The empirical experiments conducted in this work showed that all the three approaches improve the stability of the training, the 2nd method being the best method. Furthermore, depending on the time series, the 2nd method also improves the performance relative to the original way of training a DNC and can converge faster. The structure of the report is as follows. After this chapter comes the Background chapter which goes through the theoretical background to understand the work presented in this thesis. It also goes through related work such that the novelness of the methods presented in this thesis is clear. After the background, the three proposed training methods are presented in detail in the Proposed methods chapter. The chapter after that, Datasets and setup, describes the datasets used and the setup for the experiments conducted. Then comes the Experiments Chapter which describes each of the experiments conducted in this work. This is followed up by a Result and discussion chapter and lastly future work and conclusion. Background I 3 Chapter 2 Background This chapter lays out the background of the thesis as well as related work. In the background, the focus is on explaining how the DNC work and different types of transfer learning. The related work will present work that tries to improve the DNC and also different transfer learning methods. 2.1 Recurrent neural networks Neural networks are a set of algorithms inspired by the human brain that attempts and often excels at finding relations between large amounts of data. Typically, in a supervised setting, the algorithm's goal is to map a data input X to an output target Y . They have shown to be extremely successful in tasks where the mapping is complex such as in natural language processing, computer vision, audio, and time series. One promising type of neural network is the recurrent neural network (RNN). They have a particular architecture that allows them to find time dependencies in the data and this is extremely useful in tasks involving time series. A generalization of how RNN architectures look is visualized in Figure 2.1. At time step t of a time series, x 1 , x 2 , x 3 ... , a hidden state, h t-1 , and data, x t , is passed to a RNN cell. This cell takes the input and creates a new representation h t+1 that is passed to the next time step. It also outputs Y t which is supposed to fulfill some classification or regression task. I Background

Figure 2.1: Illustration of the data flow of an RNN. x is the input data, y is the output of the RNN, and h, is the hidden state

Long-term dependencies .in a time series, for example the output at time step j is dependent on the data input x i where j — i is large, can be difficult RNNs to team. The problem is vanishing and exploding gradients when performing backpropagation. The reason is that the connection from the output of the RNN to its dependent input data is long and requires passing through the RNN cell many times which causes the gradient potentially to disappear or explode [2]. One of the most influential. RNNs that attempts to solve this problem is the Long Short-Term Memory (LSTM) network [1], The LSTM allows the gradient to flow better through the cells by its forget gate. There are different var i ants of the LSTM, but for this paper, the set of equations 2, 1 is what defines it. where x t is the input data at time t, δ (y) is the logistic sigmoid function, is the input gate, forget gate, state, output gate, and hidden activation vectors respectively at time t and layer I. Furthermore, W are the learnable weight matrices and b are the learnable biases. For each layer, the hidden and state activation vectors starts with zeroes, and the hidden is always zero at the zeroth layer, Background I 5 2.2 Differentiable neural computer A. Graves et al. in [3] present their novel model the differential neural computer . It is an architecture consisting of a neural network and an external memory matrix from which the neural network can read and write. The whole architecture is differentiable, allowing the neural network to learn how to operate and manipulate the external memory end-to-end with gradient descent. By having an external memory that the neural network can read and write from, the network can encode the input data and store it in the memory, allowing it to remember data over long timescales. To show this, [3] tested the DNC on implicit and explicit graph problems which showed that it is able of representing variables and data structures for memorizing. For implicit graphs, they tested the DNC on the bAbl dataset. This dataset contains question and answering tasks that are structured such that the question and the context of the question can be interpreted as a graph. An example of context and question is "John is in the playground. John picked up the football . Where is the football?". "John", "playground", and ''football" can be thought of as nodes while "is in" and "picked up" are edges that connect the nodes. The DNC gave a mean test error rate of 3.8% compared to the previous best- performing model's error rate of 7.5% [7]. The explicit graphs in [3] showed that the DNC has good long-term memory. An example of tasks they did was traversal and shortest path task using London's underground subway system. The DNC was trained on randomly generated graphs to learn how to memorize the graph as well as do a task based on the memorized graph. The graphs are presented to the DNC by letting the inputs be triplets specifying two nodes and the edge connecting them. After all the triplets that make up the graph are presented to the network, the DNC is asked to do a task such as traversal or finding the shortest path. It is trained with curriculum learning by starting with simpler graphs and then gradually increasing the complexity. After training on randomly generated graphs, it is presented with a real graph it hasn't seen before, such as London's underground subway. It was compared to an LSTM with an extensive hyper- parameter search and performed significantly better. On the traversal task, the LSTM couldn't complete the easiest task from the curriculum learning and reached 37% accuracy. The DNC reached 98.8% accuracy with only half of the training examples that the LSTM trained on. This experiment showed that the DNCs are significantly better at representing and memorizing graphs than 6 I Background LSTMs. 2.2.1 How DNCs work Explaining the DNC on a high level, it can be described as containing three components; 1. A controller, which is a feed-forward or recurrent neural network. In this work, the controller will be an LSTM. 2. A memory interaction component. 3. A predictive layer. The flow of the data can be seen in 2.2. First of all, the input goes through the controller which then outputs a hidden state h t . This state is sent to the predictive layer as well as the memory interaction component. Using the hidden state, the DNC will decide what should be written to the memory as well as where it should be written to. Furthermore, the DNC uses the hidden state to decide what to read from the memory. The write and read to the memory happens in the memory interaction component and will be described more in-depth later. From the memory interaction component, what is read is sent to the predictive layer as well as to the next time step for the controller. The predictive layer uses what has been read from memory and the hidden state to make a prediction.

Background I 7

Figure 2.2: Illustration of the high-level data flows in a DNC with an RNN as the controller. The input to the controller at each time step is the data, x t , hidden state from previous time step, h t _ 1 , and a read vector from the external memory, r t _ 1 . The output of the model is y t , produced by the predictive layer.

2.2.2 Memory interaction component

The input for the memory interaction component is the hidden state of the controller, h t . It is transformed into an interface vector illustrated in Figure 2.3 in the interface creation block.. The transformation is done by multiplying the hidden state with a learnable weight matrix as shown in Eq. 23. oneplus(x) = 1 + log(1 + e x ) (2.2)

The components of the interface vector, which can be found in Table 2.1 and Eq. 2.5, decide how the DNC interacts with the external memory. To make sure that they lie in the correct domain, three different functions are applied to them. The logistic sigmoid function is used to constrain components to [0, 1], the oneplus function (Eq. 2.2) is used to constrain to [1, ∞ ), and lastly, softmax is used to constrain vectors to the N — 1 dimensional unit simplex 8 I Background

Figure 2.3: Illustration of the data flow of the memory interaction component of the DNC.

(Eq. 2.4).

The interface vector is then sent to the write and read heads. The DNC decides what and where to write into memory in the write head. The memory is a matrix of size N x W and is updated by 'the write head at each time step. It is updated by Eq. 2.6 where o is element-wise multiplication, J is a matrix of ones with the same size as the memory, and: is the write weighting. The interpretation of Eq. 2.6 is that removes information from the memory and adds new information. is computed in Eq. 2.7 where a t and are computed by attention mechanisms. a t is computed by the dynamic memory allocation mechanism and by content-based addressing. These attention mechanisms are Background I 9

Table 2.1: Table with naming, notation, and domain of every component in the interface vector. described in detail in Sections 2.23 and 2.2.4. is the write gate which says how much should be written to the memory. If it is 0, then nothing in memory will change. is the allocation gate and says which of the two attention mechanisms is prioritized. is computed in Eq. 2.7 where a t and are computed by attention mechanisms. a t is computed by the dynamic memory allocation and by content-based addressing mechanism. These attention mechanisms are described in detail in Sections 2.23 and 2.2.4. Continuing with the other components, is the write gate which says how much should be written to the memory. If it is 0, then nothing in memory will change. is the allocation gate and says which of the two attention mechanisms is prioritized.

After writing to memory, the DNC will read from memory in a non-binary fashion by weighting how much from each memory cell is to be read (Eq. 101 Background

2.8).

Similar to the write head, the read weight is computed with content-based addressing but also another attention mechanism called temporal memory linkage. Eq. 2.9 shows how the read weight is computed. where and comes from the temporal memory linkage while is from content-based addressing. causes the read weight to be an interpolation between and by weighing their importance.

2.2.3 Content-based addressing

Arguably the most important attention mechanism is called content-based addressing which can be thought of as the mechanism that allows the controller to communicate directly what content to read and write from the memory. From the interface vector, there is a query key for each read and write head which is compared to the content of each location in the memory. The more similar the key is to the content, the more that content is read or overwritten. More concretely, the comparison is done by measuring the cosine similarity as in Eq. 2.10 and then normalized by all the similarities through Eq. 2.11 . k is the query key and β tells the strength of the normalization. The larger β is, the closer the normalized similarities are to binary values which would closely mimic a normal computer when fetching content from a memory (either you read from an address or not). Background I 1 1

2.2.4 Dynamic memory allocation

The attention mechanism that controls where to free and allocate memory is called dynamic memory allocation. It works by having a usage counter for each location of the memory. When freeing and allocating memory, the locations with the lower usage counter will be used more. The counter is increased if the location is being written too and optionally decreased when being read. The usage vector, which contains the usage counters, is defined and updated as follows is defined in Eq. 2.12 and is referred to as the memory retention vector as it decides if the usage counter of a memory cell is kept or removed, i.e. when or respectively. Assuming , the usage counter of a memory cell is high if either the usage was high in the past time step, or that it has been written to in the past time step, . Finally, after the usage vector has been computed, the allocation weighting, a t , is determined and it says where to allocate the new information to the memory. To compute this weighting, first, the indices of the usage vector are stored in and sorted in ascending order by the indices respective usage counter. The index with the lowest usage would then be . Using , the allocation weighting is computed as in Eq. 2.14.

2.2.5 Temporal memory linkage

The last of the three attention mechanisms, temporal memory linkage, tries to remember in which order data is written into the external memory. This mechanism is useful when the DNC needs to retrieve data from memory in the correct order. With this mechanism, the DNC can retrieve data in order or reverse order, but the DNC still has the option to ignore the order completely.

45 12 I Background

The network remembers the order of data that has been written to memory with a temporal link matrix, , where is the degree at which cell i of the memory was written to after cell j. To compute L t , a precedence weighting p t is needed where would represent the degree to which cell i was the last one written to. p t is computed as in Eq. 2.15.

Using the precedence weighting, L t is updated as follows

The motivation behind Eq. 2.16 is for the link matrix to remove i and j’s old links if they are being updated and create new ones if i is written to after j-

To make use of L t the following equations are used where and becomes the forward and backward weighting used in Eq.

2

2.3 Transfer learning

The field of transfer learning attempts to transfer knowledge between different but similar domains. One could, for example, train a model on Task A and then transfer what it has learned to solve another task B. By taking advantage of what has been learned from similar domains to the target domain, increased performance, better generalization, and the reduced need for targetdomain data can be achieved. The types of transfer learning problems can be categorized into four categories according to [8]: Background I 13

Transductive transfer learning is when the target task is the same as the source task, but the domains are different.

Inductive transfer learning refers to the problem of transferring knowledge between tasks that are different from each other. The domains may be the same or different, but the source task is different from the target task

Unsupervised transfer learning is similar to inductive transfer learning as the target and source task is different. However, the difference is that unsupervised transfer learning focus on unsupervised learning tasks i.e when the tasks don’t have any labels. homogeneous transfer learning is when the domain and task from the source and target are both the same. If both domain and task are different, then it is called heterogeneous transfer learning.

The above categorizes the different problems in transfer learning. According to the survey on transfer learning [9], there are four different approaches to solving the problems.

Instance-based approach is when the source domain is assumed to be similar enough to the target domain such that the source data can be reused to be trained together with the target data. Examples of techniques are instance reweighting and importance sampling.

Feature-representation-based approach attempts to learn good feature representations from the source data and then apply the learned representations to the target data. The assumption is that the way of representing data can contain useful information for the target task. parameter-based approach assumes that there is knowledge in the hyperparameters or weights of a trained model on the source task that can be transferred to a model that solves the target task. relational-based approach tries to handle data that is not independent and identically distributed (not IID). The knowledge transferred is the relationship between the data points. An example of such data where it could be useful is social networks.

The work in this thesis will focus on the transductive transfer learning problem by using feature representation-based and instance-based approaches. 14 I Background

2.4 Related work area

In this section, work that is similar or relevant to this thesis will be presented, particularly, work on improving the DNC as well as work on transfer learning that solves similar problems to the method presented in this paper. A lot of the work that tries to improve the DNC focus on trying to increase the speed and reduce the complexity of training the DNC, however, it still remains a problem. For example, A. Yadav and K. Pasupa in [4] claim that training complex DNC with large memory matrices is slow, V. Knippenberg et al. in [5] say that curriculum learning is required (for their task) and training on large-scale instances is inefficient. Y. Tao and Z. Zhengya state ’’the enhanced performance of DNC comes at a high computational cost, complex memory operations, and specifically history-based attention mechanisms” .

2.4.1 Existing improvements on the DNC

Robert Csordas and Juergen Schmidhuber in [10] presented three limitations of the DNC architecture which they improved upon. The first criticism was about how content lookup attempts to read and write data. In this mechanism, the controller emits a read key that contains partial information about what the network wants to remember, and the external memory serves the purpose of completing the partial memory. The problem is that the content-based addressing will use the irrelevant information of the key (the information that should be completed by the external memory) to fetch the contents of the memory. This problem was solved by masking the irrelevant portions of the query key and memory. A mask vector is emitted by the controller which is multiplied by the query key and the memory content before they are compared.

Another improvement they make is on the sharpness of the temporal link distributions. Since the interaction with the external memory is all continuous, the write weighting, , will often not be one-hot encoded and will likely contain some noise which will be transferred to the link matrix, L t . If there are multiple consecutive reads based on the temporal linkage, i.e. when or is big in 2.7, the noise will become exponentially worse. They alleviate this problem, although don’t remove it completely, by applying a sharpness enhancement step to the temporal link distribution generation.

Content-based addressing [3] has no key-value separation, meaning contents in the memory act both as the value and the key. Other papers, such as [4] Background I 15 and [11] separate the key- value pair so that the key and value are two separate vectors. [11] presents an extension of the Neural Turing Machine (NTM) by making the model’s interaction with the external memory trainable, i.e. how the model reads and writes from memory, similar to the DNC. They name this method Dynamic Neural Turing Machine (D-NTM). In each memory cell of the D-NTM, there are two vectors: a content vector (value) and an address vector (key). The address vectors of the memory are learnable parameters of the model which is different from how DNC learns how to read and write to memory. This means that, during training, the D-NTM learns what the addresses should be and during inference, they stay constant. The content part of the memory, however, is reset to a zero matrix after training and is updated during inference based on the data processed by the controller.

[4] also introduced a key-value pair separation, but instead of being an extension of the NTM like the D-NTM, this was an extension of the DNC. The controller in the original DNC emits a look-up vector (key) that decides what information is read from the memory by how similar it is with the whole memory cells. By reading memory cells that are already similar to what the controller can produce, not a lot of useful new information is provided to make a prediction. What [4] was introduced was to have only a part of the memory cell be compared to the look-up vector and the rest of the memory cell to be what is written or read from the memory. The benefit of this is that new information is gained that the controller might not have been able to produce. Additionally, the look-up keys are smaller which leads to reduced computational time.

Jorg Franke et. al. in [12] extend the DNC by creating a bidirectional architecture enabling it to make an inference based on information in the future. They argue that the unidirectional architecture of the DNC can be limiting in certain NLP tasks. The primary example they give is for a question- answering task where the question is posed in the middle of the text. In that case, having a bidirectional architecture that can take later context into account can be very useful. Furthermore, they also extend the DNC by making the training more robust and a slimmer memory. Their model gave a state-of-the- art performance on the bAbl task at the time their paper was released.

2.4.2 Transfer learning in time series

Although not explored as much as in computer vision and natural language processing, research has been done on how to do transfer learning in time 16 I Background series. H.I. Fawaz et al. [13] did extensive research on testing transfer learning on time series. They had 85 datasets, all of which were time series, pre-trained one neural network for each dataset, and then fine-tuned the models on the other datasets creating 7140 different models in total. They found that transfer learning can improve the models but also degrade them. This led them to develop Dynamic Time Warping, an algorithm to measure how similar two time series are. Using this algorithm, they were able to choose which datasets are better to use as the source to transfer knowledge to the target.

Rui Ye and Qun Dai [14] propose a transfer learning framework which they term DTr-CNN. In their work, they first select datasets using Dynamic Time Warping [13] and Jensen-Shannon divergence [15]. Then, by defining a loss where the objective is to minimize the dissimilarities between the source and target domains, knowledge from the source domain can be transferred to the target task.

2.4.3 Feature representation transfer learning and domain adaptation

In deep neural networks, typically the early layers’ output features are general while the deeper features along the network become more specific to the task. Mingsheng Long et al. in [16] claim that the transferability in the deeper layers reduces with an increase in domain discrepancy. They proposed Deep Adaptation Network (DAN) that aims to improve feature transferability in the deeper layers. The method reduces the domain discrepancy by embedding the hidden representations at the deeper levels of the network to a reproducing kernel Hilbert space. DAN learns transferable features that can improve the target task where little to no data is labeled.

Baochen Sun and Kate Saenko in [17] suggest an unsupervised domain adaptation algorithm, Deep CORAL, an extension of CORAL [18]. The original CORAL aligns the source and target’s second-order statistics. However, it relies on linear transformation which the Deep CORAL improves upon. The Deep CORAL method works by adding a loss function where the objective becomes to minimize the correlations between the source and target. Specifically, the target data and source data are passed through a network which is then used to calculate the covariance matrix. The objective is to minimize the difference between the source covariance matrix and the target covariance matrix. Background I 17

2.4.4 Instance-based transfer learning

A commonly researched transfer learning method in regards to instance-based methods is importance weighting. This method attempts to solve the problem of having different input distributions between the source and the target data. It does this by weighing the samples by the ratio between the target and source input densities, i.e. the weight for sample x is where p is the target and source’s respective input density. One of the key problems is estimating this ratio. Sugiyama et al. [19] attempt to estimate the ratio by a method they devised called Kullback-Leibler Importance Estimation Procedure (KLIEP). The method is able to estimate the weights directly and doesn’t have to estimate the densities. The idea of their method is find that minimizes the Kullback-Leibler divergence from the target’s input density to its estimate The algorithm they propose is able to solve this optimization problem without having to model the input densities.

Dai et al. [20] had another approach where they focused on how to adapt the optimizer to alleviate the problem of having different input densities. They propose the transfer learning framework TrAdaBoost which is based on the AdaBoost optimizer [21], With this framework, the source and target data are combined and trained on together. The framework then changes the weights of instances based on their impact on the learner and takes into account the dissimilarities in distribution.

181 Proposed methods

Chapter 3

Proposed methods

In this section, three different methods are presented that in some way take advantage of a subset of what will be called the experience. The experience is defined to be what the DNC produces in the external memory after a sequence has been passed through. Specifically, the experience is the DNC’s memory M t , link matrix L t , precedence vector p t , read weights write weights and usage vector u t . The first method, 3.1, reuses the experience to better mimic the DNC’s practical application on time series, the second method, 3.2, builds upon the first method but produces a more meaningful experience by running the DNC in both directions, and lastly, method 3.3 introduces a novel and versatile way of performing transfer learning.

3.1 Reusing experience

Normally, when training a DNC, the experience is reset to zeroes between each new sequence. For time series modeling where there is only one sequence, the experience would be reset every epoch as shown in Figure 3.1. In this section, the method proposed is to reuse a subset of the experience instead of resetting it, as illustrated in Figure 3.2. For a time series that is T steps long, at epoch n, a subset of the experience produced is denoted as . In the next epoch, this subset is used at time step 1 and then updated normally as the DNC moves along the time series. The subset of the experience that is passed on is the memory, M t , and the link matrix, L t .

The motivation for this method is twofold. Firstly, when doing inference on time series in the real world, ideally the model would be trained up Proposed methods I 19 until time step t and then make an inference on time step t + 1. This also includes transferring the experience from time step t to make the most accurate prediction on time step t + 1. What is meant by this is that when the network is applied, validated, or tested, it will always make an inference based on an experience. Therefore, it is important that the model is also trained in such a manner. However, looking at the original way of training DNCs, in the first time step, there is no experience and furthermore, the experience in the consecutive steps will also be limited. When trained, these early time steps will encourage the network to learn how to make decisions based on an empty memory, which will not be the case during inference. In the proposed method, the memory will never empty and will, therefore be encouraged to learn how to interact with it more than in the original method. Hence, the proposed method could benefit training.

Figure 3.1: Illustration of how the experience is updated when passing a sequence, X1, X2,...X T ,, through DNC. At the start of each new sequence, or epoch as depicted in the figure, all of the experience is set to zero.

20 I Proposed methods

Figure 3.2: Illustration of how the experience is updated when passing a sequence, X 1 , X 2 ,..., X T , through a newly proposed method of training a DNC. At the start of each new sequence, or epoch as depicted in the figure, the experience is passed from the previous sequence or epoch.

3.2 Bi-directional training of DNC

A problem that could arise with the reusing experience method is when there are no long-term dependencies in the time series. Let’s say that there is a time series of 100 steps and the output at time step i is only directly dependent on the data at time step j if |i — j | <= 20. Let’s also say that the memory has less than 20 memory cells. When training on this time series, the DNC is encouraged to only store the 20 latest time steps in the memory. This means that when passing the experience to the next epoch, the memory only contains information about the last 20 steps of the time series which will be useless for the early steps. If the memory is irrelevant to making the correct prediction, the DNC will not be encouraged to learn how to read from memory.

The training method presented in this section is an extension of the previous method by adding bi-directionality to the training. The motivation for training the DNC in a bi-directional manner when keeping the experience is that the DNC will always keep the memory relevant. The flow of the DNC including the experience of this novel method is illustrated Figure 3.3. With this method, for every odd epoch, the DNC passes through the time series in a forward direction while in every even epoch, the time series is passed backward. For Proposed methods | 21 the DNC to understand which direction the time series is passed, an additional feature is added that is zero in the forward direction and one in the backward direction. Between every epoch, a subset of the experience is passed. If the link matrix is included in this subset, it is transposed because the DNC will pass the time series in the reverse order.

Other than always keeping the memory relevant, a benefit of reversing the direction at each epoch is that the memory will contain dependencies of the opposite direction which could encourage the DNC to learn how to read the memory. The way reading from memory works is that the key emitted by the controller contains partial information about what is supposed to be read and the memory’s role is to complement the controller with information that the key doesn’t have. If the memory doesn’t contain any new information that the controller can’t produce by itself, then there is no reason for the DNC to learn how to interact with the memory. By reversing the order of the DNC at each epoch, it is more likely that the memory will contain information that the controller cant produce by itself. This is because the memory will contain information about data dependencies opposite to the direction the DNC is going. For example, when passing the time series in a forward direction, the controller at time step t will emit a key based on the data at time steps t, t — l, t — 2, ... while the memory will contain information based on the data at t, t + 1, t + 2, ... unless it has been overwritten. Therefore, by having this bi-directional training scheme then memory will contain information that the controller can’t produce by itself and therefore the DNC will be encouraged to learn how to read from the memory. This could improve the training convergence.

No work has been found that does something similar to the bi-directional training method. The most similar work would be the rsDNC presented by Jorg Franke et. al. in [12] which has a bidirectional architecture. Although the method in our work also has bi-directionality, their work is fundamentally different in regard to both what bi-directionality is referring to as well as the application of the method. Their model’s bi-directionality refers to the architecture of the model and therefore also how the model does inference. The method in this work refers to the training scheme to improve the training of the original DNC. The application of their model is to do question-answering tasks in which “future” information is accessible. In this scenario, the future means that the context of a question can come after the question is asked. This is why their bidirectional architecture can be useful. However, the application of the DNC in this work is on time series where there is no future information 22 | Proposed methods accessible, for example, forecasting round-time trip.

Figure 33: Illustration of how the experience is updated when passing a sequence, X 1 ,X 2 ,...,X T through a newly proposed method of training a

DNC. For n = 1, 2, 3..., epoch 2n uses the experience from epoch 2n — 1 and passes the sequence with the direction feature set to zero in a forward direction. Epoch 2n + 1 uses the experience from epoch 2n and passes the sequence with the direction feature set to one in the opposite direction. At n = 0, the. experience starts as zeros.

3.3 Transfer learning

In this section, a novel approach to do transfer learning is presented. The method is to train a model on 'the source data and then transfer a subset of the experience to the untrained target model. In this work, the part of the experience that is being transferred is the memory and link, matrix. At each epoch of training the target DNC, the source memory and link matrix is passed to the model at the first time step of the time series. The target model is then allowed in the consecutive time steps to update the experience by itself.

There are multiple benefits of being able to do transfer learning via the memory of the DNC because of how flexible the method is. It doesn’t rely on the dimensions of the input features of the source and target being the same. This is very useful for example when you would want to add additional, features to the target data, but you cant do it for the source. Furthermore, the method is Proposed methods I 23 flexible because it doesn't need to use the same controller, for example, the source DNC could use a feed-forward neural network while the target DNC uses a recurrent neural network. The potential of this transferring method compared to training the DNC normally is that it helps the DNC learn how to interact with the external memory. This is motivated by two reasons. Firstly, by passing a memory that contains useful information, the controller in the DNC will learn how to interact with a "good" memory already from the first epoch. In the original method, the early epochs will likely not have a useful memory because the DNC hasn't learned how to interact with the memory yet. By having a useful memory the controller will be encouraged to emit an encoding that is similar to the contents in the memory in order to read information with content-based addressing. In this way, the controller could learn how to encode the data in a similar way to how the source model encoded the data. Secondly, after reading from the memory, the predictive layer needs to learn how to make a prediction based on what has been read. For these two aspects, this transfer learning method encourages the target to learn how to interact with the memory and potentially could improve the convergence rate. Another benefit of this method is that new information is brought from the source to the target. The transferred memory is more or less many encodings of the source data. This memory then contains information that could be useful for the target model so that it can generalize better. How this would occur, for example, is that the predictive layer will possibly encounter a more diverse set of read vectors by reading from both the source memory and target memory created after it has been updated with the target time series. No previous work has been found that presents anything similar to this transfer learning method. Also, the method doesn't fit well with only one of the transfer learning approaches defined in the survey [9], but rather could be argued to fit multiple approaches. From one perspective, the experience contains encodings of the source data which is passed to the target model. This is similar to the instance-based approach in that the source data is transferred, however, it's the encodings of the time series and not the original instances. What is also different with this method compared to many instance- based approaches is that the instance-based approaches usually consider the dissimilarities of the source and target distribution. The method presented in this work does not consider the data distributions. The method can also be compared to feature-representation-based approaches that try to apply learned representation from the source to the target data. Transferring the memory, 24 I Proposed methods as stated previously, could encourage the target controller to learn how to represent the time series similarly to how the source controller represents the data. Lastly, the memory could be argued to be a matrix of parameters that the trained model learns as it passes through the time series. This would then be considered a parameter-based approach.

Datasets and setup I 25 Chapter 4 Datasets and setup 4.1 Datasets In this study, the chosen datasets are multivariate time series from the telecom industry. Specifically, two datasets are for predicting round trip time (RTT) and two are for predicting read and write latency (RWL). The RTT datasets are private internal Ericsson datasets and can be read about in detail in [22] and [23]. The RWL datasets are public and found in [24]. 4.1.1 Round trip time The two round trip time (RTT) datasets, RTT 1 and RTT 2 are multivariate time series gathered from Ericsson's 5G-Smart factory testbed. In the testbed, the data has been gathered from four different sources that make up the 5g network; a user, eNB, Evolved Packet Core (EPC), and the internet/cloud. What happens is that a message transmission is sent and logs and traces are generated at different levels of the path. These logs measure different KPis and can be grouped into two categories; end-to-end metrics which are logs from the user equipment (UE) levels such as RTT and radio metrics which are logs from the radio interface and core network levels such as signal quality and power. In this study, the purpose of the datasets is to predict the end-2-end latency of a 5G network, that is, the time it takes to send a signal to a device and then back to the original transmitter. RTT 1 and RTT 2 contain 636 and 495 features respectively including the end-to-end metrics and radio metrics. Both 26 I Datasets and setup datasets have been gathered over 10 min, drawing measurements each second, producing a 600 time step multivariate time series each. The time series of RTT 1 and 2's targets are visualized in Figure 4.1 and 4.2, however, the vertical axis does not show the values because it is business-sensitive information. It is important to note that user equipment (UE) performance as illustrated in the figures may be affected by limited radio coverage, competing traffic on the wireless channel, exogenous processes in the laptop or UE modems, protocol selection for measurement, network configurations, and other interactions [25]. That is, the many spikes and other RTT trends may have their origin outside of the base station. RTT time series 1: Target plot Figure 4.1: Plot of RTT time series 1, mapping target RTT against the time step. The units are seconds for the time step.

Datasets and setup | 27 RTT time series 2: Target plot Figure 4.2: Plot of RTT time series 2, mapping target RTI against the time step. The units are seconds for the time step. 4.1.2 Read and write latency The read and write latency (RWL) datasets are two multivariate time series [24] from, RWL 1 and RWL 2. The data is gathered from three different sources; a cluster of servers, a client machine, and a network connecting the two. Each device records statistics that are used in this dataset. The statistics of the cluster of servers and the network are the features of the datasets, while the statistics of the client machine are the target. Specifically, the RWL are the statistics that will be predicted and their time series are shown in Figure 4.3. The difference between the RWL time series is that their features have different values however the target time series is the same. The length for all time series is 24225 and both datasets have 197 features recorded from the network and cluster of servers. More information about the datasets can be found in [24].

28 | Datasets and setup

Figure 4.3: Plot of RWL time series 1 and 2 target, mapping target latency against the time step. The unit for the target is ms, and. seconds for the time step. 28 |Datasets and setup Figure 4.3: Plot of RWL time series 1 and 2 target, mapping target latency Datasets and setup I 29 4.2 Preprocessing and split of data For the RTT datasets, the features are all normalized with 12 norm. Furthermore, the data split for each sequence is done by using the first 750/o of the data points for training, and the remaining 250/o for validation. Lastly, for the two RWL datasets, the features are standardized and the data is split such that the first 600/o time steps are used for training and the remaining 400/o are for testing. 4.3 Model settings The DNCs were trained with different configurations depending on which dataset they were trained on. The settings stated in this section will hold true unless stated otherwise in the Experiment section later in the thesis. The DNCs trained on the RTT datasets had an LS1M controller with 64 in hidden state size. The memory had 1 read head, 20 memory cells and the cell size was 64 like the hidden state. The models were trained to reduce the MSE between the model output and the true target with an Adam optimizer and a learning rate of 0.01. During training, a dropout of 0.6 was also used. These configurations were chosen based on the hyperparameter tuning in Experiment 5. Due to the size of the RWL time series, a smaller DNC had to be trained on them due to the time constraint. It had the same settings as DNCs trained on the RTT data, but the hidden size and memory cell size were 32. 4.4 Hardware and software The programming language used for this project was Python 3.8 and the library PyTorch version 1.10.2. For hardware, every model was trained locally on an Intel core i7-10610U CPU with 1.80GHz.

30 I Experiments Chapter 5 Experiments 5.1 Validating DNC on multivariate time se- ries The purpose of this experiment was twofold. Firstly, a limited number of studies have tested DNC's performance on time series. Therefore, before doing other experiments on the RTT data, it is important to validate that the DNC is able to perform such a task. The second purpose is to train a DNC to benchmark against suggested novel methods presented in the method chapter. Before searching for the optimal parameters for the DNC, an LS1M without an external memory was hyperparameter tuned. The parameters search consisted of testing the parameters found in Table 5.1 which also states what values were tested. Every single parameter combination was tested. After the optimal LS1M architecture was found, the hyperparameter tuning is conducted for the DNC using the optimal LS1M architecture as its controller. Table 5.1: Para meters tuned for the LS1M as well as the values tested. Experiments I 31 5 Table 5.2: Para meters tuned for the DNC as well as the values tested. The parameters tuned for the DNC are found in Table 5.2. The models were trained and tested on RTT time series 1. They were trained for 2000 epochs and the MSE loss on the validation set was measured at each epoch. The best performing DNC's loss plot was analyzed in two regards: does it converge and what is the performance compared to the best performing LSTM. 5.2 Reusing experience This experiment is to test the method proposed in 3.1. Three different models were compared. The first model was the benchmark model from the previous experiment and is illustrated in Fig. 3.1. The second model has the same architecture as the first model, but the memory M t is not reset between epochs. The third model is the same as the second model but also does not reset the link matrix, L t , The three models will be evaluated on the RTT time series 1 and 2. For each dataset, each model will be trained five times with different seeds and the average MSE validation loss is computed at each epoch. Through the loss plot, the three models' convergence rates will be compared as well as their lowest MSE error. 5.3 Bi-directional training of DNC This experiment is to test the proposed method in 3.2. Firstly, the method will be tested on RTT time series 1 and 2. Like the previous experiment, keeping two subsets of the experience was tested. That is, when only the memory is kept as well as when both the memory and the link matrix are kept. To evaluate these methods, five models trained with different seeds was 32 I Experiments compared to the original method of resetting the experience and Method 3.1. The comparison was made by analyzing the MSE validation loss plots like in the previous experiment. The method that did the best on the RTT time series, which happened to be keeping both the memory and link matrix, was further tested on both of the RWL time series. The same tests were conducted, comparing the bi-directional training of the DNC with the original way of training the DNC. 5.4 Transfer learning This experiment tests the method proposed in 3.3. The method of transferring the experience was tested in two scenarios. Firstly, when the source model and target model are trained on the same dataset. Secondly, when the source model and target model are trained on two different datasets. The experiments were conducted with the RTT and RWL time series. The transfer learning was done between the two RTT time series and between the RWL time series. Five DNC models were trained on every time series, each model with a different seed. At the epoch with the lowest MSE loss on the validation set, the memory and link matrix were saved and transferred to the target model. Each saved experience was passed to two target models, one that was trained on the same time series and one that was trained on the other time series. The new models were compared to normally trained DNCs by looking at the convergence rate and lowest MSE.

Results and Discussion I 33 Chapter 6 Results and Discussion In this Chapter, the results and discussion for each experiment is presented. It is structured such that each experiment's results and discussion is presented one at a time. 6.1 Validating DNC on multivariate time se­ ries Results The results in this section are to show that the DNC is able to perform satisfactorily on the RTT time series 1. The hyperparameter tuning showed that the best parameters for the LSTM on the RTT time series 1 were 64 in hidden size, 0.3 in dropout, and 0.05 in learning rate. For the DNC, the best combination of parameters was 0.6 in dropout and a learning rate of 0.01. These models' performances are put in comparison to naive solutions in Table 6.1. There it shows that the DNC has a lower MSE than the naive solution of taking the average of the training set and applying it to the validation set. It also has a lower MSE than taking the average of the validation set. It does not perform as well as the best LSTM model, but not far of. In Figure 6.1, the output of the LSTM and DNC on the validation set are shown in comparison to the true target. 34 I Results and Discussion Figure 6.1: Output of the LSTM and DNC on the validation set of the RTT time series 1 against the true RTT. The values for the vertical axis are not shown due to business-sensitive information. 34 | Results and Discussion

Figure 6.1: Output of the LSTM and DNC on the validation set of the RTT time series 1 against the true RTT. The values for the vertical axis are not shown due to business-sensitive information. Results and Discussion I 35 d l ki f i ki f l S C Discussion As the results show, the DNC performs better than the naive solutions. By the DNC performing better than taking the average of the validation set, it can be argued that it is able to find some of the variations in the data. It doesn't perform as well as the best-performing LSTM, however, it is good enough to be considered viable for the dataset. Note that the goal of this experiment is not to show that the DNC is better than the LSTM, just that it works on time series. Therefore, the hyperparameter tuned DNC from this experiment will be used as a benchmark for the remaining experiments. 6.2 Reusing experience Results These results are used to show whether keeping the experience between epochs can improve the training of the DNC. The convergence of the models on RTT time series 1 and 2 are shown in Figures 6.2 and 6.3 respectively. In both Figures, it can be seen that the original model, "Reset experience" in the plots, improves faster than the other models for the first number of epochs. In Figure 6.2, it does better from the first epoch until the 100th epoch, while in Figure 6.3 it only has an advantage until the 20th epoch. In the later epochs, the models that keep their experience have a lower MSE, especially for the RTT time series 1. Comparing only keeping the memory with keeping the memory and the link matrix, the results are similar.

36 | Results and Discussion

Figure 6.2: RTT time series 1 validation loss plot of the original DNC (blue), keeping memory between epochs (orange), and keeping memory and link matrix between epochs (green). 36 | Results and Discussion

Figure 6.2: RTT time series 1 validation loss plot of the original DNC (blue), keeping memory between epochs (orange), and keeping memory and link matrix between epochs (green). Results and Discussion | 37

Figure 6.3: RTT time series 2 validation loss plot of the original DNC (blue), keeping memory between epochs (orange), and keeping memory and link matrix between epochs (green). Results and Discussion | 37

Figure 6.3: RTT time series 2 validation loss plot of the original DNC (blue), keeping memory between epochs (orange), and keeping memory and link matrix between epochs (green). 38 I Results and Discussion Discussion There are multiple interesting takeaways from this experiment. Firstly, keeping the experience instead of resetting it between epochs does not improve the convergence rate. The main reason for this is likely because the memory can't contain important information if the DNC has not yet learned how to store useful information. When the DNC doesn't have this capability, the experience that is passed to the next epoch will contain useless information that could disturb the DNC. The way it would disturb the DNC is that in the forward propagation, the DNC will recall information from memory that doesn't represent the RTT time series well. This information is sent to the predictive layer which in turn will make a misinformed prediction. Subsequently, in the backpropagation, the DNC updates its weights based on misinformed predictions which is of course not optimal. The DNC is affected negatively in the early epochs by passing on an experience that is misleading for the network, causing it to converge slower, but it does better in later epochs as the experience gets better. This asks the question; would it perform even better if it had a trained experience from the start? This would motivate doing transfer learning where a trained network passes its experience to a new model. This experiment is described in 5.4. In later epochs, for both the RTT time series, keeping the experience gives better performance by not overfitting to the training data as much. A reason for this could be that in the original method the experience always starts the same, just with zeros. In the proposed method, the experience will be different each time the weights are updated. The always-updated experience will cause the DNC to deal with new information in each forward propagation. In this way, the DNC will continuously update its weights based on different conditions potentially improving generalization. This could be a way of regularization similar to how noise can be added to the data to improve generalization [26]. Another reason why keeping the experience between epochs could be beneficial is that the DNC will learn how to handle having an experience from the first time step like in the validation stage. When the DNC is tested on the validation set, the experience from the training set is passed, so one could argue that the DNC should be trained in a similar fashion. When resetting the memory, however, the DNC has to learn how to start from zero, which won't happen in the validation stage. Hence, keeping the experience could be argued for. However, looking at the results, resetting the experience found Results and Discussion I 39 a lower minimum than the keeping experience which doesn't support this argument. Comparing keeping memory with keeping memory and link matrix, the results are similar. In both time series, also keeping the link matrix can improve the training slightly in some stages of the training. It doesn't seem to improve early on in the training nor at the end of the 2000 epochs, however, it is slightly better between epoch 250 and 350 in RTT time series 1 and between epoch 300 and 1000 in RTT time series 2. The reason for this slight improvement is unclear, but it could be explained by there existing a significant time dependency in the time series. 6.3 Bi-directional training of DNC Results The results in this section are to show whether the bi-directional training scheme for the DNC can improve convergence. The loss plots with the bi­ directional training scheme compared to the original training method on RTT time series 1 and 2 are found in Figures 6.4 and 6.5. In these plots, it can be seen that the bi-directional method caused the model to improve slower in the early epochs, similar to the results of the reusing experience models in 6.2. However, after a few epochs on RTT time series 2, the models with the bi-directional training scheme converged significantly faster than the original DNC. In RTT time series 1, the bi-directional method has a slightly lower minimum than the original method, but in RTT time series 2 the methods' lowest MSE are similar. After the minimum is reached, the bi-directional models perform significantly better on both time series compared to the original models. Furthermore, passing both the memory and the link matrix causes less overfitting than only passing the memory. Figures 6.6 and 6.7 shows the bi-directional method compared to the reuse experience method (Method 3.1). The models perform similarly in the early epochs. For RTT time series 1, they converge at a similar rate until the 100th epoch, while for RTT time series 2 they only converge at a similar rate to the 20th epoch. After that, the bi-directional model reaches a better optimal performance as well as being less prone to overfitting when overtraining the model. 40 I Results and Discussion

Figure 6.4: RTT time series 1 validation loss plot of the DNC with the original training method (blue), bi-directional method with only passing the memory (orange), and bi-directional, method with passing the memory and link matrix (green). 40 I Results and Discussion

Figure 6.4: RTT time series 1 validation loss plot of the DNC with the original training method (blue), bi-directional method with only passing the memory (orange), and bi-directional method with passing the memory and link matrix (green). Results and Discussion I 41

Figure 6.5: RTT time series 2 validation loss plot of the DNC with the original training method (blue), bi-directional method with only passing the memory (orange), and bi-directional method with passing the memory and link matrix (green). Results and Discussion I 41

Figure 6.5: RTT time series 2 validation loss plot of the DNC with the original training method (blue), bi-directional method with only passing the memory (orange), and bi-directional method with passing the memory and link matrix (green). 42 I Results and Discussion

Figure 6.6: RTT time series 1 validation loss plot of the DNC trained with the method of reusing the memory (blue), method of reusing the memory and link matrix (orange), bi-directional method with passing the memory (green), and bi-directional method with passing the memory and. link matrix (red). 42 I Results and Discussion

Figure 6.6: RTT time series 1 validation loss plot of the DNC trained with the method of reusing the memory (blue), method of reusing the memory and link matrix (orange), bi-directional method with passing the memory (green), and bi-directional method with passing the memory and link matrix (red). Results and Discussion I 43

Figure 6.7: RTT time series 1 validation loss plot of the DNC trained with the method of reusing the memory (blue), method, of reusing the memory and link matrix (orange), bi-directional method with passing' the memory (green), and bi-directional method with passing the memory and link matrix (red). Results and Discussion I 43

Figure 6.7: RTT time series 1 validation loss plot of the DNC trained with the method of reusing the memory (blue), method of reusing the memory and link matrix (orange), bi-directional method with passing the memory (green), and bi-directional method with passing the memory and link matrix (red). 44 I Results and Discussion To further verify the bi-directional method works, it was also applied to the RWL time series 1 and 2. For the RWL time series 1, the read and write validation loss during training are shown in Figures 6.8 and 6.9 respectively. The original method caused the DNC to overfit significantly on the read target by the 100th epoch and forward while the bi-directional method never overfitted. On the write target, both methods perform similarly. The read and write loss plots for the other dataset, RWL time series 2, are shown in Figures 6.10 and 6.11. The bi-directional method for the read target performs significantly better, while for the write target they are similar. However, the difference for the write target is that the bi-directional method converges slightly slower.

Results and Discussion 145

Figure 6.8: Read validation loss plot on RWL time series 1. Blue is the original method of training a DNC and orange is the bi-directional method.

87 Results and Discussion | 45 Figure 6.8: Read validation loss plot on RWL time series 1. Blue is the original method of training a DNC and orange is the bi-directional method. 46 I Results and Discussion

Figure 6.9: Write validation loss plot on RWL time series 1. Blue is the original method of training a DNC and orange is the bi-directional method. 46 I Results and Discussion

Figure 6.9: Write validation loss plot on RWL time series 1. Blue is the original method of training a DNC and orange is the bi-directional method. Results and Discussion I 47

Figure 6.10: Read validation loss plot on RWL time series 2. Blue is the original method of training a DNC and orange is the bi-directional method. Results and Discussion I 47

Figure 6.10: Read validation loss plot on RWL time series 2. Blue is the original method of training a DNC and orange is the bi-directional method. 48 I Results and Discussion

Figure 6.11: Write validation loss plot on RWL time series 2. Blue is the original method of training a DNC and orange is the bi-directional method. 48 I Results and Discussion

Figure 6.11: Write validation loss plot on RWL time series 2. Blue is the original method of training a DNC and orange is the bi-directional method. Results and Discussion I 49 Discussion The bi-directional models improve at a similar rate to the reuse experience models in the early epochs. As explained in 6.2, the DNC could have a difficult time learning if the experience contains irrelevant information. What is likely happening with the bi-directional method is that in the early epochs, the model hasn't learned to produce a meaningful experience which is disturbing the backpropagation in the next epoch as the experience is passed to it i.e like the reuse experience model. However, for RTT time series 2, the DNC actually converges faster than the original DNC. When the models are overtrained, the bi-directional models perform extremely well, both compared to the original DNC as well as the reuse experience models. For RTT time series 2, the bi-directional models have a very stable convergence and start overfitting long after the other methods. On RTT time series 1, it starts overfitting quicker, but slower than with the other methods. An explanation for this could be that the bi-directional training method causes the model to generalize better due to it getting more varied data to train on. Firstly, like the reuse model, the experience will be different at each epoch which could work as a regularisation technique. Secondly, as the experience is generated from two different directions, there will be even more variety in the experience that is presented to the DNC. Thirdly, the dependencies in a time series in the forward and backward direction could be different. By also allowing the DNC to learn the backward dependencies, it could possibly generalize better. Other than giving more variance to the training, the bi-directional method could pass a more purposeful experience than what is passed by the reuse method. In the reuse method, the experience passed could be heavily biased towards the later time steps of the time series. If there are no long-term dependencies in the time series, having an experience based on the end of the time series would not be useful for the early time steps. Furthermore, by the time the DNC circles back to the end of the time series, where the past experience was based on, the memory will have been overwritten with new information. This means that passing the experience with the reuse method won't be useful. However, with the bi-directional training method, the experience will stay useful as described in the method section 3.2. This could be the reason why the method did better than the reuse method on both RTT datasets. In the bi-directional training, the results show that keeping the link matrix in 50 I Results and Discussion the experience is important. It also seems to be more important to keep the link matrix for the bi-directional method compared to the reuse method. An explanation could be that if the experience doesn't contain useful information in the reuse method, as explained earlier, then having a link matrix should not make a difference. Therefore, the benefit of passing the link matrix is more obvious in the bi-directional method. Overall, other than in the early epochs, the bi-directional method produces at least as good results as the original method and, in some time series, significantly better such as in the read latency task RTT time series 2 and RWL time series 2. The bi-directional method is particularly good when there is a risk of overtraining the model as it overfits less than the other methods. An idea of what could be the optimal way of training a DNC is to first train it normally for a few epochs to converge faster, and then transition to the bi­ directional method to reduse the risk of overfitting. This idea is left to test in future work. 6.4 Transfer learning Results The results in this section are to show whether transferring a trained model's experience to an untrained model can help the untrained model converge faster. The results for RTT time series 1 and 2 are shown in Figures 6.12 and 6.13 respectively. These figures show the validation loss plot for the original DNC compared to models getting an experience from a trained model on the same dataset and from a trained model on different a dataset. All models converge at a similar rate, however, when overtrained, the model that gets the experience from another dataset kept the lowest MSE.

Results and Discussion I 51

Figure 6.12: RTT time series 1 validation loss plot of the original DNC (blue), transferring memory and link matrix from a model trained on the same data set (orange), and transferring memory and link matrix from a model trained on. RTT time series 2 (green). Results and Discussion I 51

Figure 6.12: RTT time series 1 validation loss plot of the original DNC (blue), transferring memory and link matrix from a model trained on the same data set (orange), and transferring memory and link matrix from a model trained on RTT time series 2 (green). 52 I Results and Discussion

Figure 6.13: RTT time series 2 validation loss plot of the original. DNC (blue), transferring memory and link matrix from a model trained on the same data set (orange), and transferring memory and link matrix from a model trained on RTT time series 1 (green). 52 I Results and Discussion

Figure 6.13: RTT time series 2 validation loss plot of the original DNC (blue), transferring memory and link matrix from a model trained on the same data set (orange), and transferring memory and link matrix from a model trained on RTT time series 1 (green). Results and Discussion I 53 The results for the RWL time series were slightly different. Beginning with the RWL time series 1, the loss plots for the read and write are shown in Figures 6.14 and 6.15 respectively. On the read target, all the methods performed similarly until around epoch 100 when the original method started overfitting more than the transferring methods. On the write target, all methods performed similarly. The results for the RWL time series 2 are shown in Figures 6.16 and 6.17. On the write target, transferring the experience does not make a significant difference. On the read target, the transfer learning methods keep a lower MSE than not transferring up until epoch 100, then all methods perform similarly.

54 I Results and Discussion

Figure 6.14: RWL time series 1 read validation loss plot of the original DNC (blue), transferring memory and link matrix from a model trained on the same data set (orange), and transferring memory and link matrix from a model trained on RWL time series 2 (green). 54 I Results and Discussion

Figure 6.14: RWL time series 1 read validation loss plot of the original DNC (blue), transferring memory and link matrix from a model trained on the same data set (orange), and transferring memory and link matrix from a model trained on RWL time series 2 (green). Results and Discussion I 55

Figure 6.15: RWL time series 1 write validation loss plot of the original DNC (blue), transferring memory and link matrix from a model trained on the same data set (orange), and transferring memory and link matrix from a model trained on RWL time series 2 (green). Results and Discussion I 55

Figure 6.15: RWL time series 1 write validation loss plot of the original DNC (blue), transferring memory and link matrix from a model trained on the same data set (orange), and transferring memory and link matrix from a model trained on RWL time series 2 (green). 56 I Results and Discussion

Figure 6.16: RWL time series 2 read validation loss plot of the original DNC (blue), transferring memory and link matrix from a model trained on the same data set (orange), and transferring memory and link matrix from a model trained on RWL time series 1 (green). 56 I Results and Discussion

Figure 6.16: RWL time series 2 read validation loss plot of the original DNC (blue), transferring memory and link matrix from a model trained on the same data set (orange), and transferring memory and link matrix from a model trained on RWL time series 1 (green). Results and Discussion I 57

Figure 6.17: RWL time series 2 write validation loss plot of the original DNC (blue), transferring memory and link matrix from a model trained on the same data set (orange), and transferring memory and link matrix from a model trained on RWL time series 1 (green). Results and Discussion I 57

Figure 6.17: RWL time series 2 write validation loss plot of the original DNC (blue), transferring memory and link matrix from a model trained on the same data set (orange), and transferring memory and link matrix from a model trained on RWL time series 1 (green). 58 I Results and Discussion Discussion Transferring the experience doesn't seem to increase the rate of convergence on either of the data sets. It could be because experience is biased towards the later time steps and forgets the early steps as explained in previous sections. Another reason could be due to the datasets. All four time series are relatively simple in that a neural network would not have a problem encoding the data in a meaningful way. This means that the transferring method loses one of its potential benefits described in the method section 3.3; transferring the memory could encourage the target model to learn how to encode the data similar to how the source model does it. If creating meaningful encoding of the data for the memory is simple, then the transferring method could lose its value. For future work, it would be interesting to handle more complex data such as video. An interesting observation to be made from this experiment is that transferring an experience based on a different dataset with a different number of features can be better than transferring the experience from the same dataset. What is specifically interesting is that RTT time series 1 and 2 have a different number of features. This means that the controller of the DNC, the LSTM, will have a slightly different architecture which will, in turn, lead to different ways of encoding the data as well as different experiences. Still, the results show that it is possible to transfer the experience meaning that the DNC can adapt to use encodings that itself might not have produced. The transferring method does perform better on some of the time series such as RTT time series 1 and on the read target for both RWL time series. The reason for this could be similar to one of the reasons the reuse method works - the DNC learns how to interact with the memory from the start of the time series which better prepares it for the real application of the DNC and the validation test.

Future Work I 59 Chapter 7 Future Work As the results show for the reuse experience and bi-directional method, the DNC doesn't converge as fast as the original method in the early epochs of training. However, later on, these methods perform really well. This is likely due to the DNC not having learned how to interact with the external memory in the early stages of training. Therefore, an interesting idea is to first train the DNC with the original method until the DNC is able to interact with the memory and then switch to one of the proposed methods. The difficulty is to know when DNC is able to interact with the memory purposefully, but that is left for future work. The experiments testing the proposed methods only tested transferring or keeping the memory and the link matrix, however, there are other parts of the experience that could be useful to transfer or keep as well. The parts of the experience not experimented with are the precedence vector, read weights, write weights, and usage vector. It is possible that some of these components could improve one of the three methods. For the transfer learning method, one of the potential benefits is that it would help and encourage the target model to produce useful encodings of the data. If the data is simple, however, useful encodings might not be difficult to produce and the benefits of the method might not be shown. In future work, it would therefore be interesting to see if the transfer learning method has a bigger advantage when the task is more complex. Due to the time constraints of the project, it was not feasible to do a hyperparameter search for each method on each time series. This also includes not looking at how different parameters affect the different methods. An 60 I Future Work example of what would be interesting to study is to see if the size of the memory or the number of read heads have an effect.

Conclusion I 61 Chapter 8 Conclusion The work in this thesis had the goal of improving the training of the DNC when applied to time series. It resulted in three new training methods for the DNC; resuing experience, bi-directional training, and transfer learning via memory. The most promising method of the three is the bi-directional training method which has been shown through empirical experiments on four different time series that it has three advantages over the original method. The three advantages are that it provides a more stable convergence, on some time series it converges faster, and lastly, it can produce a better performing DNC. The other two methods also showed promising results however only in regards to overfitting the model less when training for too long. In the end, this work has presented methods that make it easier to train the DNC on time series.