Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A DISTRIBUTED NETWORK TRAFFIC DATA DECOMPOSITION METHOD
Document Type and Number:
WIPO Patent Application WO/2021/186158
Kind Code:
A1
Abstract:
To be able to adequately provide desired services over a 5G mobile service network, the 5G communication infrastructures requires a much-improved flexibility in resource management. Network operators are foreseen to deploy network slicing, by isolating dedicated resources and providing customised logical instances of the physical infrastructure to each service. A critical operation in performing management and orchestration of network resources is the anticipatory provisioning of isolated capacity to each network slice. Accordingly, it is necessary to obtain an estimate of service level demands. However, the estimation of such service level demands is typically obtained via deep packet inspection, which is a resource intensive and time-consuming process. Therefore, it is typically not possible to provide updated accurate estimates at a frequency suitable for use in accurate prediction of a future per-service traffic consumption, without an undesirable level of computational and time resources being required. The present invention provides a distributed network traffic data decomposition method which makes use of a neural network to provide an accurate future per-service traffic consumption prediction without deep-packet inspection or another resource intensive analysis method.

Inventors:
ZHANG CHAOYUN (GB)
PATRAS PAUL (GB)
FIORE MARCO (GB)
Application Number:
PCT/GB2021/050644
Publication Date:
September 23, 2021
Filing Date:
March 16, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV COURT UNIV OF EDINBURGH (GB)
International Classes:
H04L12/24
Other References:
CHAOYUN ZHANG ET AL: "CloudLSTM: A Recurrent Neural Model for Spatiotemporal Point-cloud Stream Forecasting", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 29 July 2019 (2019-07-29), XP081497406
OUYANG WENLI ET AL: "Interpretable Spatial-Temporal Attention Graph Convolution Network for Service Part Hierarchical Demand Forecast", 30 September 2019, 12TH EUROPEAN CONFERENCE ON COMPUTER VISION, ECCV 2012; [LECTURE NOTES IN COMPUTER SCIENCE], SPRINGER BERLIN HEIDELBERG, BERLIN GERMANY, PAGE(S) 575 - 586, ISBN: 978-3-642-04427-4, ISSN: 0302-9743, XP047521767
ZHANG, CFIORE, MZIEMLICKI, CPATRAS, P: "The 26th Annual International Conference on Mobile Computing and Networking (MobiCom '20", 2020, ACM, article "Microscope: Mobile Service Traffic Decomposition for Network Slicing as a Service"
Attorney, Agent or Firm:
MURGITROYD & COMPANY (GB)
Download PDF:
Claims:
Claims

1. A distributed network traffic data decomposition method comprising the steps of: receiving input data comprising aggregate network traffic data from a plurality of distributed source locations, wherein the aggregate data includes traffic data corresponding to a plurality of services operating over the network; converting the input data into a format suitable for further analysis by re- arranging and mapping the locations of the plurality of source locations such that the source locations are arranged in a regular grid pattern and separating the aggregate network traffic data into a time-dependent sequence of snapshots; analysing the converted data with a neural network, comprising a plurality of neural layers, to extract, in a final neural layer of the plurality of neural layers, a plurality of outputs from the converted data, wherein each output corresponds to decomposed traffic volume of one service of the plurality of services operating over the network; and employing 2D convolutions to extract a plurality of outputs from the determined spatiotemporal correlations; and predicting, based on the plurality of outputs, a future per-service traffic consumption.

2. The distributed network traffic data decomposition method of claim 1 , wherein the step of analysing the converted data with a neural network includes: determining a fraction of traffic that belongs to each of the plurality of services at each of the plurality of sources; employing 3D deformable convolutions to: at least partially mitigate spatial displacements introduced during the conversion of the input data; and determine at least one intermediate output from the converted data; determining spatiotemporal correlations from the determined at least one intermediate output; and employing 2D convolutions to extract a plurality of outputs from the determined spatiotemporal correlations.

3. The distributed network traffic data decomposition method of claim 1 or claim 2, further comprising the step of retraining the neural network by: comparing per-service traffic consumption measured at a predetermined time with a predicted per-service traffic consumption corresponding to said predetermined time; determining a prediction error by calculating a difference between the measured per-service traffic consumption and the predicted per-service traffic consumption; and amending at least one function of the neural network such that the prediction error is reduced.

4. The distributed network traffic data decomposition method of claim 3, wherein the step of amending at least one function of the neural network such that the prediction error is reduced is only undertaken if the prediction error is above a predetermined prediction error threshold value.

5. The distributed network traffic data decomposition method of claim 1 or claim 2, further comprising the step of retraining the neural network by: calculating a maximum likelihood estimation between a measured per-service traffic consumption for at least one of the plurality of distributed source locations and for at least one of the plurality of services operating over the network and the predicted future per- service traffic consumption; and amending at least one function of the neural network to increase the maximum likelihood estimation.

6. The distributed network traffic data decomposition method of claim 5, wherein the step of amending at least one function of the neural network to increase the maximum likelihood estimation is only undertaken if the calculated maximum likelihood estimation is below a predetermined maximum likelihood estimation threshold value.

7. The distributed network traffic data decomposition method of claim 5 or claim 6, wherein the step of retraining the neural network comprises calculating a maximum likelihood estimation between a measured per-service traffic consumption for each of the plurality of distributed source locations and for each of the plurality of services operating over the network and the predicted future per-service traffic consumption.

8. The distributed network traffic data decomposition method of any one of claims 3 to 7, wherein the step of retraining the neural network is based upon a subset of input data collected at only a portion of the plurality of distributed source locations.

9. The distributed network traffic data decomposition method of any one of claims 3 to 8, wherein the neural network is trained with a cross-entropy function.

10. The distributed network traffic data decomposition method of any preceding claim, wherein the measured per-service mobile traffic data is obtained from deep packet inspection.

11. The distributed network traffic data decomposition method of any preceding claim, wherein the step of employing 3D deformable convolutions to at least partially mitigate spatial displacements introduced during the conversion of the input data includes rearranging the converted data.

12. The distributed network traffic data decomposition method of any preceding claim, further comprising the step of allocating network resources based on the predicted future service-wise traffic consumption.

13. The distributed network traffic data decomposition method of claim 12, further comprising the step of reallocating network resources based on a further predicted future service-wise traffic consumption.

14. The distributed network traffic data decomposition method of any preceding claim, wherein the aggregate network traffic data is encrypted.

15. The distributed network traffic data decomposition method of any preceding claim, further comprising the step of collecting the input data.

16. The distributed network traffic data decomposition method of claim 15, wherein the step of collecting the input data comprises data collection without deep packet inspection.

17. The distributed network traffic data decomposition method of any preceding claim, further comprising the step of performing adaptive weighting by assigning a weight to at least one snapshot, wherein the weight applied is dependent on a time of capture of data included in said snapshot.

18. The distributed network traffic data decomposition method of claim 17, wherein a first snapshot captured at a first time is assigned a different weight when compared to a second snapshot captured at a second time, wherein the second time is more recent than the first time.

19. The distributed network traffic data decomposition method of any preceding claim, wherein the plurality of outputs extracted from the converted data comprises a plurality of feature maps, wherein each feature map corresponds to decomposed traffic volume of one service of the plurality of services operating over the network, and wherein the prediction of a future per-service traffic consumption is based on the plurality of feature maps.

20. The distributed network traffic data decomposition method of any preceding claim, wherein the network is one of a WiFi network, a mobile telecoms service network, a broadband network, an Internet of Things sensor and actuator network, a distributed electricity distribution grid, a network of electricity consumption sensors, roadways, airways, shipping lanes, a network of air quality sensors, a network of household water or gas consumption meters, or a social network.

21. The distributed network traffic data decomposition method of any preceding claim, wherein the step of converting the input data into a format suitable for further analysis by re-arranging and mapping the locations of the plurality of source locations such that the source locations are arranged in a regular grid pattern comprises constructing a regular grid including a number of grid points equal to the number of the plurality of distributed source locations, and performing a one-to-one source location to grid point association, such that a single grid point relates to a single source location.

22. The distributed network traffic data decomposition method of claim 21, wherein the one-to-one source location to grid point association is performed in such a manner as to minimise an average spatial displacement of a portion or all of the source locations when they are associated to a respective grid point.

23. The distributed network traffic data decomposition method of claim 21 or claim 22, wherein the one-to-one source location to grid point association is performed using the Hungarian Algorithm.

24. The distributed network traffic data decomposition method of any preceding claim, wherein the at least one intermediate output comprises at least one spatiotemporal pattern.

25. An apparatus comprising at least one processor, wherein the at least one processor is configured to be operable to execute the method of any preceding claim.

Description:
A DISTRIBUTED NETWORK TRAFFIC DATA DECOMPOSITION METHOD

Field of the Invention

The present invention relates to a method and apparatus for decomposing traffic data from a distributed network and finds particular, although not exclusive, utility in providing a method for predicting a future per-service traffic consumption of services operating over the distributed network.

Background to the Invention

The next generation of mobile networks will become a dominant General Purpose Technology and enable new services to operate. As such, 5G networks must fulfil a growing variety of quality of service requirements, ranging from extreme mobile broadband for ultra-high-definition streaming, to ultra-reliable low-latency communication for autonomous driving. To be able to adequately provide these services, 5G mobile communication infrastructures require a much-improved flexibility in resource management.

Typically, this will be realised via virtualisation of network functions, including dynamic spectrum allocation, baseband processing, scheduling or task containerisation including allocation to one or more tenants. Furthermore, operators are foreseen to deploy network slicing, by isolating dedicated resources and providing customised logical instances of the physical infrastructure to each tenant. Tenants may obtain full control of the resources and functions allocated within the slice they hold, driven by their precise knowledge of end-to-end service performance. In such scenarios, network operators remain in charge of performing management and orchestration of resources dedicated to each slice. A critical operation in performing management and orchestration of resources is the anticipatory provisioning of isolated capacity to each slice. Accordingly, it is necessary to obtain at least an estimate of service level demands.

However, the estimation of such service level demands is typically obtained via deep packet inspection, which is a resource intensive and time-consuming process. Therefore, it is typically not possible to provide updated accurate estimates at a frequency suitable for use in accurate prediction of a future per-service traffic consumption, without an undesirable level of computational and time resources being required.

Objects and aspects of the present invention seek to alleviate at least these problems with prior known prediction methods.

Summary of the Invention

According to a first aspect of the present invention, there is provided a distributed network traffic data decomposition method comprising the steps of: receiving input data comprising aggregate network traffic data from a plurality of distributed source locations, wherein the aggregate data includes traffic data corresponding to a plurality of services operating over the network; converting the input data into a format suitable for further analysis by re-arranging and mapping the locations of the plurality of source locations such that the source locations are arranged in a regular grid pattern and separating the aggregate network traffic data into a time-dependent sequence of snapshots; analysing the converted data with a neural network, comprising a plurality of neural layers, to extract, in a final neural layer of the plurality of neural layers, a plurality of outputs from the converted data, wherein each output corresponds to decomposed traffic volume of one service of the plurality of services operating over the network; and predicting, based on the plurality of outputs, a future per-service traffic consumption.

A key advantage of the present invention is that an accurate prediction of future per- service traffic consumption is obtainable without requiring relatively resource intensive analysis, such as deep packet inspection or similar. The use of the method of the first aspect of the present invention has been found via experimentation to result in estimation errors, relative to the traffic variance interval, below 1.2%.

The step of analysing the converted data with a neural network may include any or each of: determining a fraction of traffic that belongs to each of the plurality of services at each of the plurality of sources; employing 3D deformable convolutions to: at least partially mitigate spatial displacements introduced during the conversion of the input data; and determine at least one intermediate output from the converted data; determining spatiotemporal correlations from the determined at least one intermediate output; and employing 2D convolutions to extract a plurality of outputs from the determined spatiotemporal correlations.

The distributed network may be a system of interconnected nodes, wherein the nodes are spaced apart. The nodes may be spaced apart by distances in the order of metres, or kilometres. For example, the nodes may be antennae for use in a mobile telecoms network, and the antennae may be distributed across an urban area. Each node may be connected to each other node in the network. Each node may be connected to a single, central node, such as a data centre. Connected may mean physically connected, such as with a wire, a data bus, and/or any other known apparatus. Alternatively, connected may mean wirelessly connected. The connectivity may make use of any known wired or wireless protocol.

The network may be one of a WiFi network, a mobile telecoms service network, a broadband network, an Internet of Things sensor and actuators network, a distributed electricity distribution grid, a network of electricity consumption sensors, roadways, airways, shipping lanes, a network of air quality sensors, a network of household water or gas consumption meters, or a social network..

Traffic may be anything which moves around a network, from node to node. For example, in a mobile telecoms service network, traffic may be data packets moving between mobile devices. As a further example, in a roadway network, traffic may be vehicles traversing the roadways.

Traffic data may be data related to the number, frequency, speed, size, or any other characteristic of the traffic.

Decomposition may mean taking a single, aggregate, data stream and converting it into a plurality of constituent data streams.

Receiving input data may mean that the input data is collected and conveyed to apparatus configured to carry out the method. The method may further comprise the step of collecting the input data. The step of collecting the input data may comprise data collection without deep packet inspection. In this way, the input data may be collected, rather than received.

Aggregate network traffic data may be a summation of network traffic data over the network. The aggregate network traffic may comprise a plurality of constituent network data traffic portions.

The plurality of services may be services operating over the network. For example, the network may be a mobile service network, and the plurality of services may include mobile games, messaging apps, social media, video streaming platforms and audio streaming platforms.

The neural network may comprise a plurality of layers, wherein each layer is responsible for completing a portion of the overall task or problem. Typically, one layer provides its output as an input to another layer and a final layer will output the desired final output of the neural network.

The method may further comprise the step of retraining the neural network. The step of retraining the neural network may comprise comparing per-service traffic consumption measured at a predetermined time with a predicted per-service traffic consumption corresponding to said predetermined time. The step of retraining the neural network may further comprise determining a prediction error by calculating a difference between the measured per-service traffic consumption and the predicted per-service traffic consumption. The step of retraining the neural network may further comprise amending at least one function of the neural network such that the prediction error is reduced. In this way, the neural network may be retrained such that the method provides a more accurate prediction.

The step of amending at least one function of the neural network such that the prediction error is reduced may only be undertaken if the prediction error is above a predetermined prediction error threshold value. In this way, retraining of the neural network may only be undertaken if the accuracy of the prediction falls outside of a predetermined acceptable error limit. The method may further comprise the step of retraining the neural network. The step of retraining the neural network may include periodic retraining of the neural network to refine the neural model following changes in the network data landscape. The step of retraining the neural network may comprise calculating a maximum likelihood estimation between a measured per-service traffic consumption and the predicted future per-service traffic consumption. The maximum likelihood estimation may be calculated for at least one of the plurality of distributed source locations. Alternatively, or additionally, the maximum likelihood estimation may be calculated for at least one of the plurality of services operating over the network. The step of retraining the neural network may further comprise amending at least one function of the neural network to increase the maximum likelihood estimation. In this way, the neural network may be retrained such that the method provides a more accurate prediction.

The step of amending at least one function of the neural network to increase the maximum likelihood estimation may only be undertaken if the calculated maximum likelihood estimation is below a predetermined maximum likelihood estimation threshold value. In this way, retraining of the neural network may only be undertaken if the accuracy of the prediction falls outside of a predetermined acceptable error limit.

The step of retraining the neural network may comprise calculating a maximum likelihood estimation between a measured per-service traffic consumption for each of the plurality of distributed source locations and the predicted future per-service traffic consumption. Alternatively, or additionally, the step of retraining the neural network may comprise calculating a maximum likelihood estimation between a measured per-service traffic consumption for each of the plurality of services operating over the network and the predicted future per-service traffic consumption. In this way, the entirety of the data may be taken into account to allow for a more accurate comparison between the predicted and measured per-service traffic consumptions.

The step of retraining the neural network may be based upon a subset of input data collected at only a portion of the plurality of distributed source locations. In this way, the retraining may be less resource intensive, when compared to a retraining based on the entirety of the input data. The neural network may be trained with a cross-entropy function or a mean square error loss function.

The measured per-service mobile traffic data may be obtained from deep packet inspection. Accordingly, the predicted per-service traffic consumption may be compared with a measured per-service traffic consumption which has been obtained via an accurate method. Deep packet inspection may be undertaken periodically in order to reduce the resource requirements.

The step of employing 3D deformable convolutions to at least partially mitigate spatial displacements introduced during the conversion of the input data may include rearranging the converted data. In this way, the method may at least partially compensate for errors introduced during the conversion of the input data into a form suitable for further analysis.

The method may further comprise the step of allocating network resources based on the predicted future service-wise traffic consumption. For example, for a 5G mobile telecoms network, the predicted future service-wise traffic consumption may be used during future network slicing. In this way, the future network slicing may be better tailored to the future network traffic, which may mean the users of the network experience a better connection to the network, including fewer interruptions and lower latency.

The method may further comprise the step of reallocating network resources based on a further predicted future service-wise traffic consumption. For example, a first prediction may be made and 5G network slicing undertaken accordingly, and then a second prediction may be made at a later time or date and the 5G network slicing may be undertaken a second time such that the network slicing is better suited to the second prediction.

The aggregate network traffic data may be encrypted. In this way, the privacy and or security of the user data, and the confidentiality of the flow-level traffic generated by each service provider may be maintained. The method may further comprise the step of performing adaptive weighting by assigning a weight to at least one snapshot. The weight applied may be dependent on a time of capture of data included in said snapshot. A first snapshot captured at a first time may be assigned a different weight when compared to a second snapshot captured at a second time, wherein the second time is more recent than the first time. The weight assigned may be a lesser weight. Accordingly, snapshots including data collected more recently may be assigned a greater weight, when compared to snapshots including data collected less recently. In this way, the method may apply a greater weight to traffic data collected or received most recently. Therefore, the prediction may be based more heavily on the most recent traffic data, and may not be skewed by historical data which may no longer be as relevant as recent data.

The plurality of outputs extracted from the converted data may comprise a plurality of feature maps. Each feature map may correspond to decomposed traffic volume of one service of the plurality of services operating over the network. The prediction of a future per-service traffic consumption may be based on the plurality of feature maps.

The step of converting the input data into a format suitable for further analysis by rearranging and mapping the locations of the plurality of source locations such that the source locations are arranged in a regular grid pattern may comprise constructing a regular grid including a number of grid points equal to the number of the plurality of distributed source locations. The step of converting the input data into a format suitable for further analysis by re-arranging and mapping the locations of the plurality of source locations such that the source locations are arranged in a regular grid pattern may further comprise performing a one-to-one source location to grid point association, such that a single grid point relates to a single source location. In this way, an irregularly distributed plurality of source locations may be mapped onto a regular grid or matrix, which may then be manipulated to provide the predictions.

The one-to-one source location to grid point association may be performed in such a manner as to minimise an average spatial displacement of the source locations when they are associated to a respective grid point. In this way, spatial distortions in the converted data may be minimised. The one-to-one source location to grid point association may be performed in such a manner as to minimise an average spatial displacement of a portion of the source locations when they are associated to a respective grid point. In this way, spatial distortions of a main or most relevant portion in the converted data may be minimised. For example, the spatial distortion of source locations in a centre of an urban area of interest may be minimised, whilst peripheral source locations may be more heavily distorted.

The one-to-one source location to grid point association may be performed using a combinatorial optimisation algorithm. The one-to-one source location to grid point association may be performed using the Hungarian Algorithm.

The at least one intermediate output may comprise at least one spatiotemporal pattern. The spatiotemporal correlations and/or spatiotemporal patterns may be abstract and/or hidden in the input data.

When making the prediction, the method may take, as a further input, further information that is unrelated to the data contained in the snapshots. For example, the further information may be a date and a time. It may be expected that traffic in a mobile telecoms network may increase at midnight on New Years Eve, due to users of the network sending good will messages or the like. Such an increase in the traffic may not be discernible from previous data alone. Accordingly, providing the data and time may allow the method to make a more accurate prediction. Traffic data from previous years may be taken into account to allow the method to more accurately predict future per-service traffic consumption.

According to a second aspect of the present invention, there is provided an apparatus comprising at least one processor, wherein the at least one processor is configured to be operable to execute the method of the first aspect of the present invention. The apparatus may further comprise a memory operatively coupled to the at least one processor. The apparatus may further comprise at least one receiver configured to receive the input data. Detailed Description

Figure 1 is a schematic diagram outlining the steps of a method of decomposing data.

Figure 1 is a schematic diagram 100 outlining the steps of a method of decomposing data. The first step 110 is to receive input data. Alternatively, the method may include a step prior to the first step 110 of collecting the input data from the distributed source locations. The input data typically comprises aggregate network traffic data collected from a plurality of distributed source locations. The aggregate data typically includes traffic data corresponding to a plurality of services operating over the network. The network may be a mobile service network, such as a 4G or 5G network. The aggregate data may therefore comprise all data moving about the network, and may include traffic data corresponding to a plurality of services operating over the mobile service network, such as mobile gaming and video streaming, among others.

The second step 120 is to convert the input data into a format suitable for further analysis. The conversion typically involves re-arranging and mapping the locations of the plurality of source locations such that the source locations are arranged in a regular grid pattern. The source locations may be the locations of a plurality of antennae which form part of the physical infrastructure of the mobile service network. The antennae may be distributed over an urban area, and may be separated by distances in the order or metres or kilometres. It is clear that the positions of the antennae will not be in a regular, grid-like, arrangement because their positioning is based on local demand and topography. Accordingly, to convert the input data into a regular, grid-like, arrangement suitable for further processing, it is clear that the positions must be rearranged. The conversion also typically involves separating the aggregate network traffic data into a time-dependent sequence of snapshots. Accordingly, the data may be given a timestamp and organised into a sequential order.

The third step 130 is to analyse the converted data with a neural network. The neural network typically includes a plurality of neural layers. The analysis is typically used to extract, in a final neural layer of the plurality of neural layers, a plurality of outputs from the converted data. Each output typically corresponds to decomposed traffic volume of one service of the plurality of services operating over the network. For example, each output may correspond to one service operating over the mobile service network. As such, a single output may be extracted for each of the services, such as mobile gaming and video streaming.

The analysis at the third step 130 typically includes several analysis steps, which may be carried out in any order.

One analysis step is to determine a fraction of traffic that belongs to each of the plurality of services at each of the plurality of sources. Accordingly, a fraction of traffic belonging to each of the plurality of services for the past collected data may be determined. For example, the fraction of the aggregate data corresponding to each of the plurality of services in a timeframe immediately preceding the present time may be determined.

Another analysis step is to employ 3D deformable convolutions to at least partially mitigate spatial displacements introduced during the conversion of the input data and determine at least one intermediate output from the converted data. The at least one intermediate output may be abstract and may not have any significance without being processed further.

Another analysis step is to determine spatiotemporal correlations from the determined at least one intermediate output. 2D convolutions may then be employed to extract a plurality of outputs from the determined spatiotemporal correlations.

The fourth step 140 is typically predicting, based on the plurality of outputs obtained via the analysis in the third step 130, a future per-service traffic consumption. The predicted future per-service traffic consumption may reflect an expected future traffic in or on the network.

Although the network described herein is a mobile service network, it is to be understood that the method described herein is equally applicable to other networks, such as a WiFi network, a mobile telecoms service network, a broadband network, an Internet of Things sensor and actuator network, a distributed electricity distribution grid, a network of electricity consumption sensors, roadways, airways, shipping lanes, a network of air quality sensors, a network of household water or gas consumption meters, or a social network. Furthermore, any or each of the other steps described herein may be incorporated into the method. In particular, the predicted future per-service traffic consumption may be used to allocate network resources. It is also to be understood that alternative analysis methods, other than a neural network, 3D deformable convolutions and 2D convolutions as described herein, may be provided. Additionally, it is to be understood that the neural network, 3D deformable convolutions and 2D convolutions as described herein may function or operate in a different manner yet still output and/or obtain the same result, which is ultimately a predicted future per-service traffic consumption.

The disclosure of the present invention may be better understood with reference to the following paragraphs from the research paper entitled “Mobile Service Traffic Decomposition for Network Slicing Using Deep Learning” which is incorporated herein in its entirety, and is not limiting on the scope of the claimed invention which is set out in the claims included herein.

Abstract - Microscope: Mobile Service Traffic Decomposition for Network Slicing as a Service

The growing diversification of mobile services imposes requirements on network performance that are ever more stringent and heterogeneous. Network slicing aligns mobile network operation to this context, by enabling operators to isolate and customize network resources on a per-service basis. A key input for provisioning resources to slices is real-time information about the traffic demands generated by individual services. Acquiring such knowledge is however challenging, as legacy approaches based on in-depth inspection of traffic streams have high computational costs, which inflate with the widening adoption of encryption over data and control traffic. In this paper, we present a new approach to service-level demand estimation for slicing, which hinges on decomposition, the inference of per-service demands from traffic aggregates. By operating on total traffic volumes only, our approach overcomes the complexity and limitations of legacy traffic classification techniques, and provides a suitable input to re- cent ‘Network Slice as a Service’ (NSaaS) models. We implement decomposition through , a novel framework that uses deep learning to infer individual service demands from complex spa- tiotemporal features hidden in traffic aggregates. (/) transforms traffic data collected in irregular radio access deployments in a format suitable for convolutional learning, and (ii) can accommo- date a variety of neural network architectures, including original 3D Deformable Convolutional Neural Networks (3D-DefCNNs) that we explicitly design for decomposition. Experiments with measurement data collected in an operational network demonstrate that accurately estimates per-service traffic demands with relative errors below 1.2%. Further, tests in practical NSaaS management use cases show that resource allocations informed by decomposition yield afford- able costs for the mobile network operator. Further information may be found in Zhang, C, Fiore, M, Ziemlicki, C Patras, P 2020, Microscope: Mobile Service Traffic Decomposition for Network Slicing as a Service, in The 26th Annual International Conference on Mobile Computing and Networking (MobiCom ’20). ACM, 26th Annual International Conference on Mobile Computing and Networking , Online, United Kingdom, 21/09/20.

Microscope: Mobile Service Traffic Decomposition for Network Slicing as a Service

1 Introduction

Next-generation mobile networks are expected to become a dominant General Purpose Technol- ogy (GPT) and enable new services with a trillion-dollar economic output [1], by fulfilling a growing variety of Quality of Service (QoS) needs that range from extreme mobile broadband (eMBB) for, ultra-high-definition streaming, to ultra-reliable low-latency communication (uRLLC) for, au- tonomous driving. In fact, strong service differentiation necessities are already emerging in current deployments, where, live video streaming must coexist with on-line gaming or shared cloud ser- vices. An important instrument for mobile communication infrastructures to answer such diverse necessities is the flexibility in resource management granted by the increasing virtualization of net- work functions, including dynamic spectrum allocation [2], baseband processing [3], scheduling [4], or task containerization [5].

Network slicing and MANO. On top of these technologies, operators are foreseen to deploy network slicing, by isolating dedicated resources and providing customized logical instances of the physical infrastructure to each tenant [6]. Under emerging ‘Network Slice as a Service’ (NSaaS) models, tenants will obtain full control of the resources and functions allocated within the slices they hold. NSaaS will allow tenants to take advantage of their precise knowledge of end-to-end service performance (application-level Quality of Experience indicators for individual users) to drive fine-grained network slice configurations (flow-level traffic engineering) [7, 8]. In such scenarios, network operators remain in charge of performing the management and orchestration (MANO) of resources dedicated to each slice [9]. Critical to MANO is the anticipatory provisioning of isolated capacity (spectrum, computation, storage, or transport) to each slice. This requires knowledge of the total traffic demand generated by each mobile service at a time granularity of minutes, the resource reallocation periodicity supported by state-of-the-art Virtual Infrastructure Manager (VIM) and Network Function Virtualization (NFV) architectures [10].

Traffic classification for NSaaS. However, operators do not have the direct visibility of the service-level traffic required to estimate such demands, and have to resort to inference techniques. Current common practices for extracting this information combine two steps: (/) Deep Packet In- spection (DPI) to collect packet header metadata, by sniffing on the GPRS Tunneling Protocol user plane (GTP-U) via probes tapping into the interfaces of the Packet Data Network Gateway (P-GW) in 4G systems [11]; (ii) Flow-level classification on such metadata to identify the associated ser- vice.

Yet, running DPI at line rate and at scale is computationally expensive, while the surge in mobile data traffic and rapid growth of transport link speeds beyond the Terabit-per-second barrier exacerbate the problem. Software-only DPI methods incur substantial latency, as they have to go through a demanding process of fetching each packet from the interface, buffering it, passing it to the CPU, waiting for the operating system task scheduler, and finally parsing the multiple protocol headers. This factually limits packet capture to 70% of the line rate, thereby disregarding a substantial fraction of traffic [12]. On the other hand, recent hardware-based or hybrid DPI solutions can operate close to the line rate [13], yet they come at a very high economic cost for the operator: dedicated FPGA hardware must be deployed at every collection point, and expensive equipment updates are required whenever new classification rules for emerging traffic categories are necessary.

Flow-level service identification on DPI-collected data also yields significant challenges. Due to the increasing adoption of traffic encryption, modern traffic classifiers rely on cleartext host- names in DNS queries, or revealing fields in the TLS handshake, like the Server Name Indication (SNI) [14], However, recent proposals for DNS over TLS [15] or encrypted SNI in TLS 1 .3 [16] will make classifiers based on DNS and TLS ineffective, and oblige the development of more complex fingerprinting techniques [17]. Not to mention that the design of future flow-level classification so- lutions will be further challenged by privacy concerns, and will have to abide by emerging strict regulations on personal data protection [18].

Overall, there is a concrete risk that current trends in mobile data usage, network architecture performance, security and privacy will render scalable on-line flow-level classification an even more tangled and resource-intensive problem than it is already today.

Mobile traffic decomposition. In this paper, we propose an alternative approach to demand estimation for sliced network MANO. Abiding by the requirements of NSaaS, we target the infer- ence of service-level demands that are required for capacity provisioning to individual slices. Our proposal builds on the novel concept of mobile traffic decomposition (MTD), the process of break- ing down aggregate traffic time series into individual service demands, as illustrated by the several examples in the figures.

An accurate MTD can be a viable complement to legacy approaches for service-level demand inference in emerging NSaaS network management models. As exemplified in the figures, op- erators could persistently rely on traffic estimates from MTD to drive service-level capacity provi- sioning, as it is a lightweight solution based on straightforward monitoring of total traffic volumes. Expensive DPI-based classifiers would only be needed to collect the ground truth needed to vali- date (and possibly update) the MTD models in presence of evolution in the consumption of mobile applications. Hence, DPI-based solutions would be only run on-demand, sporadically and asyn- chronously across services, with substantially reduced resource requirements and operation costs. Contributions. MTD is technically challenging, for multiple reasons: (i) the decomposition of a single signal into multiple time series yields inherent ambiguity among a multitude of feasible solutions; (ii) helpful complex spatial and temporal correlations exist in mobile traffic [19, 20], but capturing these to resolve the ambiguity is not trivial; (iii) traditional decomposition techniques used in other domains, including factorial hidden Markov models [21] or neural networks [22], work on single time series, whereas in our case multiple input time series must be concurrently decomposed at different network locations (diverse edge datacenters).

To tackle these challenges and achieve effective and scalable MTD, we design a dedicated deep learning framework. We rely on deep learning due to its demonstrated effectiveness in dis- covering knowledge from time series under spatial correlations [23], and in operating on large-scale mobile traffic in real time [24]. This leads to the following contributions:

I. We introduce the original concept of mobile network traffic decomposition (MTD), and prove that it can be a low-cost yet effective solution for the real-time inference of the demands generated by individual mobile services.

II. We propose , a framework that solves the MTD problem effectively by feeding suitably trans- formed mobile network traffic to interchangeable deep neural network architectures, including a new class of deformable convolutional neural networks that is explicitly designed for decomposi- tion.

III. We experiment with metropolitan-scale measurement data collected in a production network serving a major European city, and show that can infer per-service traffic demands with relative errors below 1 .2%.

IV. We provide a quantitative analysis of how would affect practical network operations, showing that resource allocations based on per-service demands estimated via MTD entail costs that are are on par with those incurred when perfect knowledge of per-service traffic is available.

2 Mobile Traffic Decomposition

We provide a formal definition of the MTD problem in Sec.2.1. Then, we outline the high-level structure of the framework we propose to solve such a problem in Sec. 2.2.

2.1 Problem Formulation

Let us consider a geographical region where mobile network coverage is provided by a set of antennas, 1 which accommodate traffic generated by a set S of mobile services. We denote by the traffic demand (expressed in Mbps) accommodated by antenna for a specific service at time t. The sum of demands over all services gives the aggregate demand at antenna a and time . A mobile network traffic snapshot is the set of demands recorded at

1 An eNodeB may serve users present in one or multiple sectors, each covered by a co-located antenna with a different azimuth. We refer to a single radio front-end unit as one ‘antenna’ hereafter. all antennas in the target region at a specific time. This applies both to individual services, leading to a service snapshot and to traffic aggregates over all services, obtaining an aggregate snapshot .

The MTD problem is formally defined as that of inferring the service snapshots D s (t) of all services at current time t, by only knowing the aggregate snapshots up to T previous time instants, . If we denote by the set of current service snapshots, the solution to the MTD problem can be expressed as (1 ) where denotes the estimated current traffic demands disaggregated over the service set S for all antennas in , and p(·) is the probability of the argument. The MTD results are restricted by two obvious constraints in the system (2) (3)

The expression in (2) enforces that all estimated traffic demands are positive, while (3) implies that the sum of the per-service traffic must be equal to the aggregate traffic.

We stress that MTD is a fundamentally different problem from prediction, as it assumes knowl- edge of the aggregate measurement data at the current time instant t, and aims at inferring in- formation relative to that same time instant. In fact, MTD is a one-to-many problem that seeks to decompose one aggregate traffic measurement into the underlying per-service snapshots: includes snapshots, each featuring the same cardinality as . Iteratively solving this problem over time ultimately leads to the reconstruction of per-service demand time series at each antenna from the aggregate traffic, as originally illustrated in the figures.

2.2 MICROSCOPE in a Nutshell

MICROSCOPE is a novel machine learning framework that is specifically designed for MTD. Here, we provide an overview of the framework, discussing the functionality and integration of the mod- ules it comprises; the following Sec. 3-5 give full details about the implementation of each compo- nent.

As shown in the figures, there are three main elements in MICROSCOPE. The first is a traffic snapshot transformation block, which receives the current aggregate mobile network traffic mea- surement data D(t) and converts them into a format that is suitable for the following analysis. Specifically, this component limits the spatial distortion of antenna locations as they are fed to the neural network, by solving an opportune association problem. This transformation is critical to gen- eralizing the framework, since it allows MICROSCOPE to accommodate any antenna deployment layout with minimum loss of geographical information. Details are in Sec. 3.

The second component implements a deep neural network model, whose goal is learning ab- stract spatiotemporal correlations that are unique of mobile traffic, as needed to solve the MTD problem. To this end, the model takes as input the transformed measurement data corresponding to the aggregate traffic recorded during most recent T time instants. The framework can accommo- date different neural network architectures, and we test a number of variants, including a novel 3D deformable convolutional neural network (3D-DefCNN) specifically designed for decomposition. Details are in Sec. 4.

The third component is concerned with the loss function used to drive the learning process. Also in this case, we consider and compare different options with properties that include (i) suitable output normalization, (ii) preservation of the total traffic across all service demands, and (Hi) overall accuracy in solving the MTD problem. The final outputs are the estimated service snapshots as per (1 ). Details are in Sec. 5.

3 Mobile Traffic Transformation relies on deep neural network models, and can accommodate a variety of architectures. Among those, convolutional neural networks (CNNs) are especially adapted to MTD, as they can take advantage of spatiotemporal correlations in mobile network traffic to improve the decomposition accuracy. Indeed, previous studies have repeatedly demonstrated the existence of spatial and temporal interrelationships between the demands generated by mobile users [25], which have proven essential in performing data-driven networking tasks. For instance, cell load forecasting is enhanced by harnessing information from adjacent sites, exploiting the fact that the mean of spatial conditional entropy varies with the number of considered adjacent cells [26]. Similarly, traffic super- resolution is improved when taking advantage of long-timescale correlations present in temporal sequences of traffic consumption snapshots [27]. In order to leverage these properties for MTD, the majority of models in Sec. 4 use convolutional entry layers.

However, convolutional layers require an input in matricial form. In our case, the input matrix must describe an aggregate snapshot D(t) of mobile network traffic, as explained in Sec. 2.1 . The constraint on the input format creates the problem of mapping antennas to matrix elements. Radio access infrastructure deployments typically have irregular spatial distributions that are driven by the area topography and varied density of subscriber presence, hence are not easily matched to a matrix. This is illustrated in the figures, which portrays the real-world antenna arrangement in the metropolitan-scale network we consider in our experiments. We present our strategy to address this problem in Sec. 3.1 , while we discussed its performance and alternative options in Sec. 3.2. 3.1 Minimum Displacement Grid Mapping

We solve the mapping problem by constructing a regular grid that has the same number of points as the number of antennas, and performing a one-to-one antenna-to-point association that mini- mizes the displacement of the original antenna locations. The rationale for this design is twofold: (i) it produces a regular grid, an inherent matricial form; and, (ii) it aims at preserving spatial cor- relations in traffic that convolutional layers can take advantage of, by linking geographically close antennas to adjacent points of the grid.

3.1.1 Regular Grid Design

To preserve spatial correlations within mobile network traffic, the grid should (i) overlap with the coverage of the antennas as much as possible, and (ii) help limiting spatial displacements after the mapping procedure. Given the target set of antennas, our grid design follows the logic below.

1 . We project the locations of all antennas to an Euclidean space, determine the extreme values on and axes, , and compute the aspect ratio of the overall coverage region as ;

2. We dimension the grid so that it best reflects such an aspect ratio, with rows and columns;

3. We assign geographical coordinates to the regular grid, by superposing the grid to the antenna positions, and then scaling distances by a factor κ to account for the heterogeneous antenna density typical of urban areas.

The center plot of the figures portrays the regular grid obtained with the technique above, in the case of the antenna deployment in the left plot of the same figure. In this scenario, we employed κ = 0.25 after extensive tests, and obtained a 33 × 24 grid.

3.1.2 Antenna-to-Point Mapping

We aim to minimize the overall displacement when performing the one-to-one association between antennas and grid points. We define the displacement c a,p as the Euclidean distance between the geographical positions of antenna a and grid point p. We construct a cost matrix to represent the distances between antennas and grid points. We further define a binary matrix , where , if and only if the antenna a is assigned to the grid point p. Hence, we formulate the association problem as:

(4)

(5) where constraints enforce that one antenna will be assigned to only one grid point, and vice versa.

The expressions in (4)-(5) define an assignment problem that is efficiently solved via the Hun- garian algorithm [28]. The algorithm has a polynomial complexity , hence runs efficiently in practical cases. For the antenna deployment used in our experiments, in the figures, it returns the mapping in the right plot. We observe that only antennas on the outskirts of the urban area are subject to significant spatial shifts, and movements are otherwise small.

3.2 Comparison with Other Mapping Strategies

We provide an in-depth view of the proposed minimum displacement grid mapping's performance in the figures. The heatmaps illustrate the displacement incurred by antennas in the network consid- ered for our experiments, upon traffic transformation. In the left plot, the displacement information is associated to the original antenna positions: this highlights how antennas at the conurbation borders may be shifted by 2 to 6 km, while displacements are below 1 km otherwise. The right plot associates displacement values to the final grid points, and offers an even clearer outlook of the mapping quality: the vast majority of antennas undergoes shifts below 500 m, hence preserving substantial spatial relationships in the traffic.

Although it works reasonably well, our transformation is not the only possible approach to the generation of input traffic matrices. An alternative design could replace the regular grid tessella- tion of space with a different one, based on the triangular or hexagonal tiling that is traditionally adopted in ideal network deployment models. Such tessellations, however, are not immediately translated to a matricial form, and introduce one step in the transformation process, thus incurring in higher displacement and weakened spatial correlations. For instance, we experiment with a Voronoi tessellation commonly adopted to mimic the coverage areas of antenna sites [29, 30]. By this, we move antennas to the barycenter of their corresponding Voronoi polygon, and then employ the approach in Sec. 3.1 .2. This design yields a very similar traffic matrix to that obtained with our method: 88% of the antennas are mapped to the exact same grid point, and differing elements are 1.66 cells apart (shifted to a neighboring grid point) on average. Still, the Voronoi design above leads to an average displacement of antenna original locations increased to 1.29 km from 1.22 km in our transformation.

A different solution to reconciling real-world antenna deployments with the input requirement of CNNs could be spatial supersampling [31]. This involves approximating the coverage area of each antenna (via the Voronoi tessellation above), assuming the traffic recorded at each antenna to be uniformly distributed within the associated coverage area, and superposing a dense grid to the resulting continuous traffic map. Then, each matrix element can be easily filled with the traffic in the underlying map. However, this option has significant shortcomings, since (i) it introduces strong unrealistic assumptions about the coverage and spatial distribution of users, and (ii) in the case of dense grids, it artificially increases the size of the matrices and the computational cost of the CNN.

4 Deep Neural Network Models

The core of is a deep learning architecture. The framework is flexible, and can accommodate a variety of neural network models. We experiment with a number of different architectures proposed in the machine learning literature, which are summarized in Tab. 1 , and detailed in Sec. 4.1 . In ad- dition, we design a novel 3D deformable convolutional neural network (3D-DefCNN) that is tailored to MTD in especially complex scenarios: it compensates for the spatial displacement in network traffic snapshots, discovers spatiotemporal correlations in aggregate traffic, and exploits them for decomposition, as thoroughly discussed in Sec. 4.2.

We remark that the openness to various deep learning architectures is an important feature of the framework. Indeed, our performance evaluation results, presented later in Sec. 6, indicate that there is not a single neural network that works best for all decomposition tasks, but different models should be adopted depending on the target network management scenario.

4.1 Legacy Architectures

We consider a comprehensive set of five legacy deep learning models among the many options available in the vast machine learning literature, (i) The multi-layer perceptron (MLP) is the sim- plest neural network class [32] and we consider it as a baseline solution, (ii) Convolutional neural networks (CNNs) are commonly used for imaging applications [33]. (iii) Long Short-Term Mem- ory (LSTM) is frequently exploited for modelling sequential data [34], speech signals and natural language, (iv) Convolutional LSTM (ConvLSTM) is a dedicated model for spatiotemporal data fore- casting [35]. (v) Deep zipper networks (ZipNets) were originally proposed for mobile traffic super resolution tasks with remarkable results [27].

A sixth variant for the neural network model integrated in is the (vi) deformable convolutional neural network (DefCNN) [36] originally proposed for computer vision application, such as image classification [37], object detection and semantic segmentation [38]. Whilst classic CNNs apply fixed geometric transformations to a 2D space, DefCNN architectures perform deformable convo- lutions over the same input. This allows flexible transformations that can compensate for distortions in the 2D input space. Table 1 : Legacy neural networks configuration.

As discussed in Sec.3, we adopt a traffic snapshot that mitigates but cannot completely re- move the displacement of antenna locations into a regular grid. Therefore, aggregate snapshots recorded in the 2D geographical space are distorted, with a risk that spatial correlations in the original data are misrepresented in the input matrix. DefCNN can reduce such a risk, by enabling convolutional filters to access any element of the input matrix. These connections are dynamically tailored to the input, and can be learned jointly with the other synapses via gradient descent. The approach is equivalent to letting the neural network re-organize the 2D spatial structure of the input data. In our context, this allows identifying and exploiting spatial correlations in the mobile network traffic that the transformation may weaken.

4.2 3D-DefCNN

We propose an additional (mi) 3D-DefCNN model, which is an enhancement of the DefCNN. The 3D-DefCNN design stems from an original 3D deformable convolution operation, which is detailed next, along with the overall network architecture.

4.2.1 3D Deformable Convolution

Legacy DefCNNs perform deformable convolution over 2D input matrices. In MTD, the data fed to the network includes time, thus it is three-dimensional. Specifically, the input consists of the aggregate snapshot set (Sec. 2.1), which is transformed into subsequent 2D matrices (Sec. 3.1 .2). Hence, we extend the deformable convolution operation to the temporal dimension. In doing so, we account for the fact that not all last input snapshots have the same relevance to MTD at the current time instant, and weight in an adaptive manner the different 2D input matrices.

To achieve our objective, we combine the DefCNN model with 3D convolution, a technique pre- viously used for action recognition [39], which operates over both time and a bidimensional space. Essentially, 3D convolution performs summation over an input map x weighted by a filter matrix w, which results in an output map y. 2 Each convolutional filter w has a receptive field G, which defines the spatiotemporal extent of connectivity between different locations in the input. As an example, the receptive field of a 3 × 3 × 3 convolutional filter can be defined as . For each location p y of the output y, the 3D convolution performs the follow- ing calculation: (6) where is a 3D vector in . For instance, if and , then is the value at (0, 3, 3).

Our proposed 3D-DefCNN extends (6) above with deformability, by adding an offset Ap G to each : (7)

Note that: (i) for each location a different may be applied; (ii) for each location, we have three-dimensional offsets, which can be globally seen as maps of offsets on the input grid, or one map for each (spatial or temporal) dimension; (Hi) all offset maps are learned by an additional 3D-CNN layer that takes x as input.

We illustrate the principle of 3D deformable convolution in the figures. As mentioned above, we first apply a 3D-CNN structure (3 layers) onto the original convolutional filter (which is compact) to predict offset maps . These offset maps essentially seek to alter the position of the fil- ter elements, in order to scan, at each step, locations in the input that are not necessarily adjacent. The offsets are learned and shared across the different input channels. Subsequently, we apply these offsets to the original convolutional filter, to construct a deformed convolutional filter. Finally, by performing convolution between input and deformed filter, we obtain the output at location via (7). Since 3D deformable convolution operations are fully differentiable, the offsets to be applied can be learned through standard back-propagation. As such, 3D-DefCNNs grant convolutional filters complete freedom to query any location in the input maps. This significantly improves the model flexibility and enables to adapt to any spatial displacement (spatial deformation) and diverse importance of historical data (temporal deformation).

A complication introduced by equation (7) is that the sample positions can be fractional. Indeed, while p and are indices of and , and thus integer values, may

2 In action recognition tasks, x and y are four-dimensional tensors: the first three are the spatiotemporal dimensions; the fourth is the RGB channel dimension. This allows defining dedicated filters on each channel. In our case, we employ a single filter shared across all channels of the input. Our neural network will still produce multiple channels throughout the hidden layers, which do not have a direct physical meaning, but rather provide intermediate outputs that aid extracting abstract features. The final layer making predictions has however one output channel for each mobile service. not be integer, as it is estimated by a CNN. To solve the issue, we compute the deformed input in (7) as: (8) where are the locations of the eight input samples around , and Tri(·) denotes the trilinear interpolation operator [40], whose principle we show in Fig. ??.

4.2.2 Overall 3D-DefCNN Structure

We embed the 3D deformable convolutional operations above in the complete 3D-DefCNN struc- ture shown in the figures. Our architecture design encompasses three major components: (i) 3D deformable convolutional blocks, (ii) zipper convolutional blocks, and (Hi) 2D convolutional blocks.

The 3D deformable convolutional blocks consist of stacks of 3D deformable convolutional lay- ers, batch normalization (BN) layers [41], and leaky rectified linear unit (LReLU) activation lay- ers [42]. As previously detailed, 3D deformable convolutions are employed to mitigate the spatial displacements and perform adaptive weighting over historical observations; in addition, they ex- tract important spatiotemporal patterns in mobile network traffic. BN layers perform normalization over a batch of output of each layer. This effectively reduces output's variance and can significantly accelerate the model training. LFteLUs perform as activation functions. They improve the model non-linearity and representability, which enables the model to extract even more abstract features.

The zipper convolutional blocks receive the output of the 3D deformable convolutional blocks and are responsible for feature extraction. The structure of these blocks is inspired by that of deep zipper networks (ZipNets) originally proposed for mobile data traffic super-resolution [27], which work demonstrably well in extracting spatiotemporal correlations hidden in this type of mea- surement data. More precisely, global and multiple skip connections are employed within these blocks, to perform effective residual learning [43], which is known to make the model more robust by constructing an ensembling system of neural networks with different depths [44]. Skip connec- tions also significantly smoothen the loss surface, which enables faster convergence of the model training [45].

Upon processing by the zipper convolutional blocks, the mobile network traffic data is trans- formed into highly abstracted representations, ready for the final MTD inference. This is performed by standard 2D convolutional blocks. Compared to the previous blocks, these are configured with a larger number of feature maps, so as to provide sufficient information for the inference process. The last layer of this block has channels, feature maps, each corresponding to the decomposed traffic volume of an individual mobile service. 5 Driving the learning process

We explore three different methods to train the neural network so that it solves the MTD problem in (1). Namely, we test (i) regression, (U) ratio prediction trained with Mean Square Error (MSE), and (Hi) ratio prediction trained with Cross-Entropy (CE).

1 . The Regression method trains neural networks with a loss function over the traffic volume that aims at minimizing the differ- ence between and , (9)

Due to model imperfections, the output obtained with this approach may violate the constraints (2) and (3).

2. The ratio prediction method based on Mean Square Error (MSE) seeks to infer the fraction of traffic consumed by each service relative to the corresponding aggregate, . Clearly, . A softmax function allows an equivalent transfor- mation for the estimated service demand:

(10)

Here denotes the intermediate (unnormalized) output of the neural network. In essence, by the above we map a vector output with elements of arbitrary values onto a probability vector where the elements are in the (0,1 ) range [46]. The expression of in (10) satisfies both (2) and (3). The network is trained to minimize the MSE between and , with a loss function:

(11)

Note that the actual mobile service demand can be easily retrieved from the output ratio as . We refer to this approach as MSE in the following.

3. The cross-entropy (CE) loss function is formally: (12) for the service snapshot estimated at time t. The overall CE for a given time span can then be computed as the average over all time instants . This expression is minimized when the estimated and actual traffic demand ratios match, . This maps to minimizing the Kullback-Leibler (KL) divergence between the model and target distribution, under the assumption that the ratio follows a multinomial distribution [47]. Training a neural network with CE usually finds a better optimum than other loss functions when the output is normalized [48], as in our case. This will be later confirmed by comparative evaluations.

6 Experiments

We implement using the open-source Python libraries TensorFlow [49] and TensorLayer [50]. We train and evaluate the framework on a high-performance computing cluster with two NVIDIA Tesla K40M GPUs with 2280 cores. The model parameters are optimized using the popular Adam stochastic gradient descent-based optimizer [51].

The evaluation is organized as follows, we first present in Sec. 6.1 the reference scenario used to run our experiments, and then introduce in Sec. 6.2 the metrics used to assess the accuracy of . We perform in Sec. 6.3 a comprehensive comparative evaluation of MTD performance on mobile network traffic recorded at the antenna level, which corresponds to end-to-end NSaaS set- tings where each network slice enjoys dedicated spectrum [52]. In Sec. 6.4, we investigate how such performance vary at different network levels, including MEC facilities, C-RAN or core network datacenters where network slices are also to be implemented. Finally, we comment on the com- plexity of the different models used in , discuss accuracy-complexity trade-offs, and investigate the importance of the temporal deformation operation for the overall system performance in Sec. 6.5.

6.1 Reference Scenario

We conduct our experiments on real-world 3G/4G mobile network traffic data collected by Orange, a major European mobile operator, in a large metropolitan area during 85 consecutive days. Each mobile network traffic snapshot consists of the total demand accumulated over a 5-minute time interval at 792 different antenna sectors, for different mobile services separately.

The measurement data is collected via DPI at the P-GW, and proprietary traffic classifiers are used to associate flows to specific services. Due to confidentiality constraints, we do not disclose the target urban region, or operation details of the classifiers. However, internal performance assessments by the operator report a typical flow-level classification accuracy at around 90%. We remark that all measurements were carried out in compliance with applicable local and European regulations, under the supervision of the competent national privacy agency. In particular, the dataset employed in our study only contains information on mobile service traffic accumulated at the antenna level, and does not hold any personal information about individual subscribers.

The set of services S considered in our analysis includes mobile games (Clash Royale and Clash of Clans, grouped under the Supercell label), messaging apps (Snapchat and WhatsApp), social media (Facebook, Twitter, Instagram), video (YouTube) and audio (Spotify) streaming plat- forms, as well as Google services. The figures illustrates the fraction of total traffic induced by each service. The rationale for selecting these nine services is that they are reasonable candi- dates for the allocated of dedicated network slices. Indeed, they have clear Quality of Service (QoS) requirements, and are heavy hitters, generate sizeable amounts of network traffic: owing to the well-known Zipfian distribution of the demand across services [11], these services are in fact responsible for more than 42% of the total mobile data traffic in the target region. In addition, the set S encompasses a variety of application types, whose demands yield strongly dissimilar temporal dynamics [11]; as such, these services provide a challenging but realistic ground for MTD.

The neural network is trained and validated on data collected by the operator in the first 68 days (60%), and tested on the traffic observed during the last 17 days (20%). In these conditions, the training process converges at the tenth epoch, taking around 48 hours in total. Performing MTD inference in the considered large-scale scenario requires less than 1 second per instance.

6.2 Performance Metrics and Benchmarks

We evaluate the performance of by means of two complementary metrics, mean absolute error (MAE) and Normalized MAE (NMAE). MAE is a commonly employed measure of prediction ac- curacy, and is known to be stable and mostly insensitive to large error values [53]. It is formally defined as: (13)

In addition, we explain the relative significance of the error via NMAE, which normalizes MAE by the range of ground truth traffic measurements. Formally, it is expressed as:

(14)

Note that in both (13) and (14), the set of antennas L shall be replaced by the set of facilities or datacenters, depending on the network level at which MTD is performed.

6.3 MTD at radio access

At radio access, network slicing can provide high QoS guarantees by isolating spectrum or even dedicated antenna sites for specific mobile services [54, 55]. To fulfill this vision, resource man- agement decisions need to be made as close as possible to the user, which in turn calls for fine- grained spatial estimates of mobile service traffic. Our first scenario for evaluation of MTD is thus one where decomposition is carried out at the antenna sector level. 6.3.1 Comparative analysis

As MTD is a completely novel problem, there is no previous solution (neither based on deep learn- ing, nor on other approaches) that we can directly use as a benchmark. Our modus operandi to a comparative evaluation is thus that of assessing how the different neural network models in the literature listed in Sec. 4 perform once integrated in .

Fig. ?? gives an overview of such a comparative performance evaluation in the RAN setting, under all training methods listed in Sec. 5. We emphasize in bold the error yielded by the best model with each loss function, while the overall best performance is highlighted in red. Overall, the results are very promising. Under all configurations, decomposes the total traffic recorded at one antenna into individual service-level demands with a 0.5% relative error, or lower. The correspond- ing MAE is always below 0.37 Mbps, which is a very reasonable inaccuracy for, spectrum resource allocation to slices.

A closer look at the performance of individual schemes reveals that training with the CE loss function on traffic demand ratios leads to better estimates than using Regression or MSE. While this holds in the vast majority of cases, the improvement brought by CE over the benchmark loss functions is especially consistent for the 3D-DefCNN architecture, where it allows performance gains up to 10% in terms of MAE over the competing loss functions. This confirms that the nor- malization and its combination with CE indeed improve the MTD accuracy of .

As far as neural network architecture are concerned, MLP delivers the poorest MTD perfor- mance among all approaches considered, as it lacks blocks capable of extracting spatial features. As such, the added estimation error introduced by MLP with respect to 3D-DefCNN can be as high 141%. In contrast, LSTM can effectively model the temporal correlations inherent to mobile traffic, which leads to much improved results. Smaller but further improvements are then granted by convolutional architectures that can leverage spatial correlations. The performance of CNN, ConvLSTM, ZipNet, DefCNN and 3D-DefCNN are fairly aligned. Still, 3D-DefCNN yields error reductions over all other convolutional approaches that range between 7% and 33%.

The results indicate that: (i) both spatial and temporal features are important to solve the MTD problem; (ii) the spatial deformation operations help reorganizing the geographical displacements in the aggregate snapshots provided as input; (iii) incorporating temporal deformation into the model enables scanning of historical data with frequencies that are tailored to their importance, equivalently to a attention-like mechanism [56] that samples relevant information more frequently; (iv) a 3D-DefCNN deep learning architecture trained using a CE loss function allows to attain the highest accuracy, although performance only slightly worsen, and stay very good, under other schemes. 6.3.2 Service-level performance of

We now focus on the best-performing combination of 3D-DefCNN and CE loss function, and shift our attention to the MTD performance on a per-service basis. The figures offers a breakdown of MAE (left bars) and NMAE (right bars) across the nine services in our reference set S. There exists some apparent variability in the estimation quality across services. For instance, streaming services (YouTube and Spotify), which consume a large fraction of the total traffic, are also subject to higher MAE. However, this does not necessarily lead to high relative error: in fact, their NMAE is as low as 0.61% and 0.29%, respectively. In contrast, MTD works very well for gaming traffic (Supercell), due to the smaller volume of data generated by such apps. The associated NMAE is the highest recorded among all services, yet it stays at a very reasonable 1 .2% figure. Overall, the MTD performance of is again remarkable, as the inference errors are well below 1% for high- demand services.

An illustrative example of the MTD quality granted by our framework is in the figures, which shows the inferred time series of the demand for three representative services, Spotify, Facebook, and Snapchat, based on the sole input provided by the aggregate traffic (top left subplot). The result focuses on one test week at a random single antenna. Observe that although the Spotify demand exhibits frequent fluctuations, our framework still captures well the overall traffic profile (top right). As for Facebook and Snapchat, the predicted traffic is very close to the ground truth. This confirms that yields precise MTD, irrespective of service type.

6.4 MTD at network datacenters

Network slicing heavily relies on the capability of the operator to dynamically orchestrate virtual- ized functions at the network edge and core [6]. Similarly to the radio access case, this requires efficient data-driven resource orchestration, fueled by service-level demands. In order to assess the flexibility of in helping slice management in heterogeneous edge and core network scenar- ios, we consider three use cases: (i) fifty MEC facilities deployed at the edge, aggregating traffic from 10-20 antennas each; (ii) thirty C-RAN datacenters, each providing MAC-layer functionalities for 20-40 antennas; (iii) ten core network datacenters implementing, Serving Gateway (S-GW) functions [57] and accommodating traffic generated by over 50 antennas each. To decide on the associations between antennas and MEC facilities, C-RAN and core datacenters, we run the bal- anced graph κ-partitioning algorithm proposed in [52] over the Delaunay triangulation graph [58] of the 792 antenna locations. This creates a fixed number κ of antenna clusters (50, 30 or 10, for MEC, C-RAN and core datacenters, respectively) that serve comparable traffic loads, while mini- mizing the latency (distance) with respect to associated antennas. Examples of partitions returned by the algorithm via the Karlsruhe Fast Flow Partitioning (KaFFPa) heuristic [59] are provided in Fig. ?? for the core and C-RAN datacenter scenarios.

Although the architectural settings above are arbitrary, and do not necessarily reflect the future Table 2: MAE (top, in MB per NSaaS management interval) and NMAE (bottom, in %) returned by all models over NSaaS management intervals from 5 minutes to 1 hour. Each element reports the mean MTD error at core datacenters (left), C-RAN datacenters (middle) and MEC facilities (right). organization of the NVF-compliant mobile network in the target region, they provide a reasonable approximation of such next-generation deployments, and allow us to investigate the performance of in diverse virtualized network scenarios. In all these cases, the geographic locations of datacenters (used by for spatial correlation inference) are the cluster centroids. We also consider different time resolutions for the management of NSaaS, from 5 mins to 1 h. The rationale is that Virtual Network Functions (VNFs) and their associated resources at datacenter level are likely to be reconfigured over longer timescales than at radio access. Hence, varying the temporal granularity offers a more complete analysis of the system.

Results are summarized in Tab. 2, using a CE loss function for all models. The minimum recorded error for each combination of network level and time resolution is highlighted in bold. The key takeaway is that still performs very well, and MTD allows reconstructing service-level demands with best relative errors well below 2% in all NSaaS network management settings.

Again, 3D-DefCNN outperforms all other architectures as long as MTD is run at high frequency (5 to 30 minutes) and closer to the user (at MEC facilities and C-RAN datacenters). We remark that these are the most challenging conditions for the estimation of mobile service traffic. Instead, when mobile traffic is accumulated in large volumes, by considering hourly aggregates or demands at network core datacenters, LSTM yields the lowest errors. In this case, coarse temporal gran- ularities lead to time series that are more regular, and generally easier to decompose. Similarly, considering traffic combined at a relatively small number of core datacenters diminishes the im- pact of spatial correlations between locations. Under these settings, the complexity of 3D-DefCNN becomes unnecessary, whereas LSTM thrives by avoiding looking for complicate interactions that are in fact absent in the input data. For the same reason, the performance of all CNN-based mod- els improves as we move from the core to the edge of the network, where the impact of spatial correlations increases.

Overall, our analysis indicates that: (i) MTD retains high accuracy and is a viable approach to service-level demand estimation also for NSaaS at the mobile network edge and core; (ii) different neural network models shall be adopted within to ensure the best performance at different network locations, as architectures capable of extracting spatial features are important close to the radio access, whereas temporal correlations become more critical in the network core and at longer management timescales.

6.5 Complexity Analysis

We also evaluate the complexity of models in terms of floating point operations (FLOPs) per infer- ence instance, a metric frequently employed with neural networks [60]. The number of FLOPs is computed by counting the number of mathematical operation or assignments that involve floating- point numbers. As shown in the figures, all models that include convolution operations (CNN, ConvLSTM, ZipNet, DefCNN, and 3D-DefCNN) are sensitive to the spatial granularity of the input data, hence their complexity varies significantly with the mobile network level. In contrast, MLP and LSTM yield a complexity that is not affected by the input size.

Interestingly, although LSTM yields the best accuracy at core network datacenters, as indicated in Tab. 4, it also entails the highest complexity in that scenario, exceeding that of 3D-DefCNN by more than one order of magnitude. LSTM also has very high complexity when MTD is run at C-RAN datacenters or MEC facilities. Conversely, the computational requirements of CNN-based models surpass those of LSTM only for antenna-level MTD.

Moreover, the figures makes it clear that deformable operations do not introduce significant additional complexity compared to plain CNN, despite the important advantage in terms of ac- curacy in the most testing scenarios that require fast management of NSaaS resources located close to the user. The reason is that a 3D deformable convolution effectively enables the scan- ning of historical data with frequencies that are dependent on their importance, equivalently to an attention-like mechanism that samples relevant information more frequently [56]. We illustrate the effect in Fig. ??, which compares the spatiotemporal distribution of the elements of one tensor in- put (a set of consecutive aggregate traffic snapshots) visited by a legacy 3D convolutional ingress layer (employed, by ZipNet and DefCNN) and by our novel 3D deformable convolutional ingress layer (adopted by 3D-DefCNN). The plots also show the density of sampled points projected onto the temporal dimension ( surface) as black solid lines subtending a shaded area.

The structure shown on the left is not flexible, leading to a regular sampling in space and time, which results into a uniform distribution of points over all input matrices. Instead, the 3D deformable convolutional layer learns which positions of the input tensor are the most relevant: as a result, it samples elements in each input matrix non uniformly, and with a higher density of samples in recent input matrices. This way, the layer can better exploit more relevant information in fresher snapshots. While the figures just provides a visual example for one specific tensor input, the same behavior is observed at the input layer of 3D-DefCNNs trained in all network scenarios and NSaaS management timescales. As a result, 3D-DefCNN models sample better, and not more often the input, which explains the improved performance at equivalent complexity.

Further details on the 3D-DefCNN computational cost are provided in the figures, which breaks down the complexity associated to each block of the model in the figures. The cost ratio of the three components remains fairly comparable across all network settings, proving that a 3D de- formable convolution does not entail complexity surges in any scenario, and thus corroborating the considerations above.

Overall, 3D-DefCNN requires around 4 × 10 9 FLOPS per inference instance for antenna-level MTD, which are easily handled by modern CPUs: an Intel Core i7980 XE can execute 1.076 × 10 11 FLOPS per second, and can thus fully support real-time MTD.

7 NSaaSNSaaS Management Use Cases

We complete our evaluation of the performance of by assessing the viability of MTD in practical case studies of NSaaS resource management. Specifically, we take the perspective of the mobile network operator, and assess the incurred costs when is used to determine the capacity allocated to each network slice, solely based on aggregate traffic information.

7.1 Datacenter resource management

At network datacenters, resource management costs directly stem from estimation errors in the decomposed demands, which may entail Service-Level Agreements (SLAs) violations (if the per- service traffic estimate is below the reference) or overprovisioning of unnecessary capacity to specific slices (if the same estimate is above the reference). To assess the quality of MTD in this context, we re-train with a recently proposed loss function that specifically aims at minimiz- ing monetary costs for network operators [61], hence fully integrating our framework into NSaaS management operations. The loss function is:

(15) where

(16)

Here , α controls the cost of an SLA violation, and ∈ is a small constant. Table 3: Total SLA violation cost and overprovisioning cost determined by MTD at different network levels.

The expression accounts for the cost of SLA violations (a fixed fee paid when ) and overprovisioning (growing as additional resources are erroneously reserved, when ), so that can perform MTD by trying to balance them. We configure α = 1 and ∈ = 0.01, as suggested in [61].

We test with this configuration in three case studies: (i) MTD at MEC facilities on traffic aggre- gated at every 30 minutes;

(ii) MTD at C-RAN datacenters on traffic aggregated at every 30 minutes; and (iii) MTD at core datacenters on traffic aggregated at every hour. Exclusively in the last scenario, we choose LSTM as the neural network architecture, according to the results in Sec.6.

The costs incurred by the operator in the datacenter scenarios (i)-(iii) above are listed in Tab. 3. The table expresses costs in MB/s, which can then be translated into actual monetary values based on the price of the technology implementing such capacity at each network level. For overprovisioning, the cost maps to proper additional MB/s of allocated capacity beyond what strictly required; for SLA violations, each infringement has the same cost as allocating additional capacity to cover a times the peak demand, as per (15). In both cases, results are reported as the total over all MEC (respectively, C-RAN and network core) nodes in the target region.

At C-RAN and core datacenters, carries percent costs in the range from 8% to 58%, computed with respect to the true demand. These are in fact comparable to the equivalent costs for capacity allocation to network slices at the same network levels, incurred when the operator has perfect knowledge of the traffic demand associated to each mobile service. Indeed, using a state-of-the- art one-step predictor in these conditions yields costs up to 30% for SLA violations and up to 18% for overprovisioning [61]. When pushing at MEC facilities, performance degrade sensibly: strong fluctuations in the traffic make a MTD-based estimation of the per-service demand less suitable to capacity allocation.

7.2 Radio access resource management

When MTD is performed at antenna level, a relevant metric are the extra subcarriers/spectrum resources required to support the unnecessary capacity overprovisioned at each antenna to every Table 4: Mean additional costs of antenna-level MTD. slice [62]. We can also quantify those resources in terms of CPU time, based on experimental models obtained with open LTE stacks [63]. Tab. 4 reports results under such models. When MTD is performed at antenna level, just 6 Mbps of additional throughput are needed per antenna, which yields a 3 MHz spectrum cost and requires 7.5% additional CPU time with respect to the case where perfect knowledge of service traffic is available.

These results, jointly with those for network datacenter use cases above, let us conclude that MTD can be a viable low-cost approach to service-level demand estimation in practical NSaaS management cases, where it enables effective network resource allocations.

8 Related Work

Relevant to our study are works on (i) mobile traffic analysis, and (ii) machine learning solutions for time series decomposition.

Mobile traffic analytics are becoming increasingly vital to mobile operators and a number of di- rections are being explored. For instance, Shafiq et al. carried out seminal work on the geospatial correlations of traffic volume and application usage in cellular networks [64, 65]. Wang et al. pro- posed models that combine location information, time dimension, and the traffic frequency spec- trum, to extract traffic patterns in urban settings [66]. Furno et al. investigated traffic signatures to classify mobile demands across 10 different cities [25]. Marquez et al. revealed strong het- erogeneity in the demand of mobile services, by employing correlation and clustering [11]. Traffic demand in narrowly localized regions was inferred from coarse aggregates using a neural network in [27]. However, to the best of our knowledge, the problem of mobile network traffic decomposition (MTD) that we tackle in this work has not been addressed to date.

Time series decomposition is traditionally posed as a single-channel blind source separation problem and solved using Independent Component Analysis (ICA) [67]. However, this approach only works on short sequences. Machine learning tools implementing additive factorial hidden Markov models (AFHMMs) circumvent this problem in the context of energy consumption disag- gregation [21]. More recently, CNNs were proposed as alternatives that perform sequence-to-point learning using a sliding window approach on very long time series [22]. Both these approaches are limited to single time series decomposition and, unlike the proposed 3D-DefCNN that we propose for use with , they do not exploit spatiotemporal correlations in the input data. 9 Conclusions

We introduced , a dedicated framework for aggregate Mobile Traffic Decomposition (MTD) into service-level demands, intended to assist resource allocation to network slices in NSaaS environ- ments. The framework feeds suitably transformed traffic data to a flexible deep learning model, whose architecture can be adapted to the NSaaS management location or timescale. Perfor- mance evaluations with measurement data demonstrate that provides accurate traffic inference in real-time, and we show that a resource allocation based on decomposition yields affordable costs for the operator. As a result, MTD via provides a means to complement and limit the need for ex- tensive deep packet inspection (DPI) in traffic collected at different levels of the network. As such, our approach has the potential to practically solve computationally intensive traffic analytics, which is essential to agile provisioning of resource in 5G mobile networks.

Acknowledgments

The authors would like to thank the HPC@POLITO academic computing initiative (http://hpc. polito.it) for providing the computational resources that supported this research. Paul Patras acknowledges the support received from Cisco Systems, Inc. through the University Research Program Fund, gift no. 2019-197006. This work was supported by the ANR CANCAN project (ANR-18-CE25-0011). The authors are grateful for the reviewers' constructive feedback, and for the shepherd's guidance during the revision process.