^{1}

^{2}

^{1}

Vulnerability forecasting models help us to predict the number of vulnerabilities that may occur in the future for a given Operating System (OS). There exist few models that focus on quantifying future vulnerabilities without consideration of trend, level, seasonality and non linear components of vulnerabilities. Unlike traditional ones, we propose a vulnerability analytic prediction model based on linear and non-linear approaches via time series analysis. We have developed the models based on Auto Regressive Moving Average (ARIMA), Artificial Neural Network (ANN), and Support Vector Machine (SVM) settings. The best model which provides the minimum error rate is selected for prediction of future vulnerabilities. Utilizing time series approach, this study has developed a predictive analytic model for three popular Desktop Operating Systems, namely, Windows 7, Mac OS X, and Linux Kernel by using their reported vulnerabilities on the National Vulnerability Database (NVD). Based on these reported vulnerabilities, we predict ahead their behavior so that the OS companies can make strategic and operational decisions like secure deployment of OS, facilitate backup provisioning, disaster recovery, diversity planning, maintenance scheduling, etc. Similarly, it also helps in assessing current security risks along with estimation of resources needed for handling potential security breaches and to foresee the future releases of security patches. The proposed non-linear analytic models produce very good prediction results in comparison to linear time series models.

A computer system is a collections of hardware and software components working together to perform a well defined objective as a unified whole entity. One of the core software components of the computer system is the Operating System (OS). An OS is a resource manager or a complex interactive software system. It enables the higher level application software to communicate with its hardware and memory. Vulnerabilities always exist on such software and causes tremendous security risks to software companies, developers, and individual users. Once an attacker compromised an Operating System via any vulnerability, this implies logically the whole computer system is in control of the hacker. If the computer system itself is in control of unauthorized people, very significant consequences occur in tremendous financial loses, among other serious damages.

It is well known that the overall rates of the software vulnerabilities are extensively increasing [

Some recent analytical study and modeling of general vulenerabilities can be found in [

Microsoft and Apple are the two giant companies developing proprietary desktop Operating Systems. Similarly, Linux develops one of the non proprietary desktop Operating System referred as Linux kernel. According to Netmarketshare up to July 2016, [

The schematic network of Desktop Operating Systems, given by

In the present research our objective is to develop a high quality analytical forecasting model utilizing both linear and nonlinear methods to predict the number of vulnerabilities of a given operating system. In addition we will perform statistical evaluations of other models that perform the same task the predicting process of OS. The selected model provides overall trend and behaviour of the OS ahead of time; OS companies can make strategical and operational decisions such as secure deployment of OS, facilitate backup provisioning, disaster recovery, diversity planning, and maintenance scheduling. Similarly, all the stakeholders related to the OS can access the current sercurity risk along with estimation of resources needed for handling potential security breaches and for see the future releases of security patches. Finally, the predictive results can be used for the competitive analysis of the product which significantly helps to develop the marketing strategies for the respective OS.

Our research paper is organized as follows: Section 2 introduces and explains the literature review. Section 3 explores the datasets and explanation of major methods employed in our study. In Section 4, the core analysis and predictive power of the developed models are discussed. Finally, the usefulness of our analytical model is presented along with our future research goals.

During the past years, scientists and researchers have given tremendous amounts of time and effort to develop vulnerability forecasting models to predict the future vulnerabilities of OS taking into consideration of their historical behavior with reported data. One can characterize these proposed developed models into two categories.

1) Code Characteristics Based Models: These types of models are relying on finding out the relationship between attributes of the code with its corresponding vulnerabilities. Rahimi and Zargham, [

2) Statistical Density Based Models: In this category, vulnerability forecasting models are based on historical data of the Operating System. To fulfill this objective, various kinds of models have been developed, mainly Alhazmi-Malaiya Logistic (AML) with different versions, [

For code characteristics based models, we need a source code of the given software to develop statistical models. In reality, source code of the commercial OS is not available to the public. Each and every day new vulnerability comes into existence and we need to handle new vulnerabilities, thus the software should be continuously updated which implies that the source code changes with respect to the life cycle of the software. A company always updates its software so as to fulfill the demands of current or potential customers and hence we need to update the source code regularly. Thus, process of making source code is always dynamic in nature. One of the prominent question that arises here is “how can we forecast the future vulnerabilities by utilizing static source code?”. Knowing very well that no OS is with zero or no vulnerability at all and this will continue in the future. On the other hand, considering statistic density based models, that have been developed are based on a series of underlying assumptions and criteria that may or may not be applicable. For instance, Rescola Linear Model (RL) attempts to fit vulnerability finding rates linearly with time but in reality situations are different for nonlinear behaviours. Because of such limitations that exist on both categories, we should identify an alternative approach to forecast the future vulnerability of an OS by using time series analysis. Our model considers trend, level, and seasonality components if they exist. Similarly, to analyze the non-linear behavior of the number of vulnerabilities, we implemented ANN and SVM methodology.

We have directly extracted the vulnerability data from the National Vulnerability Database (NVD). It is the U.S. governments repository that integrates publicly available vulnerability resources and provides the common references to the industry resources. NVD is a product of the National Institute of Standards and Technology (NIST), Computer Security Division and is sponsored by the Department of Homeland Security’s National Cyber Security Division. It contains reported vulnerabilities based on their Common Vulnerabilities and Exposures (CVE) identifier. The total number of vulnerabilities with respect to time in monthly basis is the fundamental quantitative values for our analysis and modeling. The schematic diagram in

We have collected the vulnerabilities for each Operating System, the earliest available data from NVD to December 2015 as training data, however, the whole

one year, 2016 data is considered as testing data to validate our model. We summed the total vulnerabilities over a monthly period. Linear and non-linear time series methods are implemented to select the best model with minimum forecasting error for each OS.

The following

Our analysis begins by investigating the total number of vulnerabilities accumulated by month for three OS,

Inspecting

Operating System | Collection Period | Total Vulnerabilities | Monthly Averages |
---|---|---|---|

Mac OS X | Jan. 2002-Dec. 2015 | 1441 | 102.93 |

Windows 7 | Jan. 2009-Dec. 2015 | 508 | 72.57 |

Linux Kernel | Jan. 2001-Dec. 2015 | 1292 | 86.13 |

Total | Jan. 2001-Dec. 2015 | 3241 | 261.68 |

than in previous years. One of the prominent reasons is due to the rapid market share gains of Mac OS X which leads to growing attack surface for sensitive data. There are several malicious malware introduced in 2015, for instance, XcodeGhost which inserts malicious components in to the applications made with Xcode [

To incorporate the sharp random fluctuations of the number of vulnerabilities in each year, we initially believe that non linear time series methods are the suitable method to build the analytical forecasting model. If the IT manager had a good forecast of the large number of vulnerabilities, the subject of our study, he/she would have taken appropriate action to address this critical issue.

The overall trend of the number of vulnerabilities of Linux Kernel OS is demonstrated by

double the number of vulnerabilities for the previous year. Years 2014 and 2015 were difficult years for Linux OS in terms of security perspective, for example “Heartbleed” is the severe vulnerability detected in OpenSSL that left large number of cryptographic keys and private data from important sites and services in the Internet that were open to the hackers. Similarly, Shellshock is the vulnerability that is dominantly used in Linux OS command line Shell, also called Bash or GNU Bourne Again Shell left the door open for a hacker to lunch malicious attack.

All three graphs mentioned above, Figures 4-6, provides a pictorial view comparison of the number of vulnerabilities of the three OS, MAC OS X, Windows 7 OS, and Linux Kernel OS from respective calender time, on monthly basis. It does not seem obvious seasonality components exist but random fluctuations have a significant influence on each case. It is clear that each of the OS has some sort of increasing or decreasing trend for a specific period of time and all of sudden some spikes come and changes the behavior of the signal. To incorporate all the mentioned facts, we have employed linear and non linear techniques to build the best analytic forecasting model. The following section provides a brief explanation of the techniques employed in this study.

Autoregressive Integrated Moving Average (ARIMA) models, are commonly used for linear models for univariate time series analysis. To construct the ARIMA model to forecast the vulnerabilities requires three steps. Before going to the first step, it is necessary to check if the vulnerability data is stationary, this implies that the number of vulnerabilities associated with each operating system shows no trend over monthly observations. We have implemented Dicky-Fuller and Philips-Perron [

y t = c + ( α 1 y t − 1 + ⋯ + α p y t − p ) + ( β 1 ε t − 1 + ⋯ + β q ε t − q ) + ε t (1)

where,

y t = differenced in series

c = a constant

α , β =coefficients or weights

p = order of the AR term

q = order of MA term

e t = residuals at time t

The second step is to construct the ARIMA model is to identify the number of parameters that are necessary to be included in the model which is a function of the order of the model. Furthermore we need to obtain estimates of the parameter that drive the model. We have implemented a graphical and statistical approach to find out the parameters used to forecast the vulnerabilities. For the graphical method, autocorrelation function (ACF) and partial autocorrelation function (PACF) is implemented. On the other hand, estimation of the required parameters requires complicated iteration procedure using maximum likelihood or non linear least square estimation methods. The final step of ARIMA is a diagnostic checking and forecasting vulnerabilities of the OS. The complete model fitting process is based on the law of principle of parsimony where the best possible model is the simplest with respect to accurately forecast the vulnerability of a given OS.

Utilizing the model building procedure of ARIMA model namely model formulation, model estimation, and model checking or model verification, we have developed three models for Mac OS X, Windows 7, and Linux as shown by the following Equations (2), (3), and (4) respectively:

Mac OS X (ARIMA(1,1,3)):

y t = 0.0203 − 0.8190 y t − 1 − 0.3626 e t − 1 − 0.8124 e t − 2 + 0.4432 e t − 3 + e t (2)

Windows 7 (ARIMA(2,1,1)):

y t = 0.0197 − 0.1956 y t − 1 − 0.3350 y t − 2 − 0.8533 e t − 1 + e t (3)

Linux Kernel (ARIMA(2,0,3)):

y t = 1.3367 + 0.0217 y t − 1 + 0.7517 y t − 2 − 0.0648 e t − 1 − 0.6713 e t − 1 + 0.3317 e t − 1 + e t (4)

Artificial Neural Networks (ANN) is one of the useful and popular method, which have been used in forecasting using time series data. A wide variety of applications can be found in market predictions, meteorological and network traffic forecasting [

y t = f ( w 01 ( 2 ) + w k j ( 2 ) ∑ k = 1 H g ( w 0 k ( 1 ) + ∑ l = 1 P w l k ( 1 ) y t − l ) ) , (5)

where y t is the total number of vulnerabilities reported in month t, p is the number of lags (number of vulnerabilities reported in the past p months) and the H is the number of hidden nodes, g and f are the activation functions

associated with the hidden and the output nodes. In order to have a better generalization with the ANN model, we need to develop new procedures. Here, in our analysis, we have used different number of lags and select the model with minimum Mean Absolute Error. In addition to that we have used time series cross validation (forecast evaluation with a rolling origin) methods to identify the optimal number of hidden nodes, which refelects on the quality of the forecast of a given OS.

One of the challenges that we face when we use ANN in time series prediction in identifying the number of inputs which is not fixed. We used a procedure to identify the best possible number of lags.

Traditionally SVMs are used for classification in pattern recognition applications. These learning algorithms have also been applied to general regression analysis, the estimation of a function by fitting a curve to a set of data points. The application of SVMs to general regression analysis case is called Support Vector Regression (SVR) and is vital for many of the time series prediction applications. SVMs used for time series prediction span many practical application areas from financial market prediction to electric utility load forecasting to medical and other scientific fields. One of the advantage in SVM is that it just correspond to a convex optimization problem when determining the model parameters and hence easily can be implemented. In using Support Vector (SV) regression, our goal is to find a function f ( x ) that has at most ϵ deviation from the actually obtained targets y i for all the training data, and will not accept any deviation larger than that. Anything beyond the specified ϵ -will be penalized in proportion to C, which is the regularization parameter. This can be explained with a linear function of the form

f ( x ) = w T ϕ ( x ) + b (6)

where our goal is to minimize

1 2 w T w + C ∑ i = 1 L ( ε i + ε i * ) , (7)

with respect to the constraints

y ( x i ) − f ( x i ) ≤ ϵ + ε i , f ( x i ) − y ( x i ) ≤ ϵ + ε i * , and ε i * , ε i ≥ 0 (8)

The constant C > 0 , determines the trade-off between the flatness off and the amount up to which deviations larger than ϵ are tolerated. Support vector machine can be generalized to deal with a nolinear function f ( x ) , and minimize the weights with respect to the constraint

min α , α * 1 2 ( α − α * ) T Q ( α − α * ) + ϵ ∑ i = 1 l ( α i + α i * ) + ∑ i = 1 l y i ( α i − α i * ) (9)

such that 0 ≤ α i , α i * ≤ C , and ∑ i = 1 l ( α i − α i * ) = 0 where, α i , α i * are the Lagrange multipliers, Q is a l by l positive semidefinite matrix with, Q i j ≡ y i y j K ( x i , x j ) and K ( x i , x j ) ≡ ϕ ( x i ) T ϕ ( x i ) is the kernel. However, in SVR we have no control on how many data vectors from the dataset become support vectors and the correct choice of kernel parameters is crucial for obtaining desirable results [

We began our analysis by dividing the vulnerability dataset into two groups; Training and Testing. The testing data set consists of vulnerabilities reported in year 2016. We then normalized the data by applying the min-max normalization method. Our analysis with ANN and SV regression makes the assumption that the number of future vulnerabilities depends on the vulnerabilities identified in the present and past months (lags). The number of significant lags in the partial auto correlation function has been used initially to determine the optimal number of lags. We proceeded by carring out further analysis by changing the number of lags from 2 to 10.

The radial basis functional kernel is used to develop the SV regression models with fine-tuning the two set of parameters; gamma and the regularization parameter C. In developing the ANN model we used 10-fold cross validation method for time series. When using this techniques we incremented the training sets data, gradually shifting the training data set window one by one. This was repeated for different number of hidden nodes. The optimal analytical model is selected based on the average mean absolute error (MAE). Finally, the selected analytical model is used to make the prediction in the testing data set.

Our statistical analysis follows the process that we have introduced in Section 3, where we described the overall time series trend of each OS. We need further investigation on each signal to see if any trend, cycles, and seasonality exists. Usually time series data consist of a specific trend, cycles, and seasonality. To identify the best analytical forecasting model, we will first proceed to identify the time series pattern in the data, and then select an appropriate method that will capture the patterns effectively.

In

In

We plotted vulnerabilities against the individual months in which data are observed. Similarly, plots have been developed where data from each month is overlapped. These graphs allow us to make a decision that there is no specific seasonal or cyclical pattern seen in terms of monthly or yearly basis. We have found there is a large jump of vulnerability in specific years. The remaining years exhibit fluctuations on the number of vulnerabilities but no obvious seasonal or cyclic patterns. Inspecting the signal of the number of venerability in each OS, we have found that trend, level and random fluctuations are the major ingredients to build the forecasting model. Incorporating these facts, we have utilized ANN, SVM, and ARIMA models to forecast the future level of vulnerabilities for the three OS.

Predictive Capability of ModelsOne of the most important criteria for evaluating forecasting accuracy is to evaluate the error (residuals) generated by the testing data sets. An optimal model is selected based on how accurately it forecast our testing data sets. We have computed Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Symmetric Mean Absolute Percentage Error(SMAPE), for each model to assist in the selection process of the best model. For each forecast error estimation lower values are preferred.

Prediction accuracy of the analytic model is one of the most important criteria to evaluate the model performance and reliability. In addition to RMSE and MAE, we utilized an error analysis based on Symmetric Mean Absolute Percent Error (SMAPE) rather than Mean Absolute Percent Error (MAPE) to convinence the validity of our model. Even though SAMPE is based on MAPE, it does consider data containing zeros and non zero values that may skew the error rate. It consist of 0% of lower bound and 200% of upper bound, thus it reduces the impact of zeros and non zero values on our data sets. The error is computed based on the analytical form defined by the equation below:

SMAPE = 2 N ∑ i = 1 N | P i − A i P i + A i | , (10)

where, N is the total number of prediction intervals, P i is the predicted number of vulnerabilities, and A i is the actual number of vulnerabilities. Once we employed ANN, SVM and ARIMA model on our testing data set following optimal model are selected based on our error measurement criteria in

Our ANN model evaluation results were quite good despite the fact that we did not have enough data to improve the training of our model. However, we believe that as more information of the subject matter becomes available the ANN model will be easier to implement and with higher accuracy in predicting the number of vulnerabilities of the present OS in the market place. For Windows 7 and Linux kernel is the analytical model, SVM driven by the final Equation (9). With reference to the

From

Criteria | MAC OS X | Windows 7 | Linux Kernel | ||||||
---|---|---|---|---|---|---|---|---|---|

ARIMA | ANN | SVM | ARIMA | ANN | SVM | ARIMA | ANN | SVM | |

RMSE | 19.6456 | 28.5637 | 24.6749 | 21.5971 | 9.5533 | 3.5819 | 22.9 | 4.08 | 3.99 |

MAE | 16.1739 | 22.0606 | 19.9257 | 21.2726 | 8.9114 | 3.1504 | 24.2 | 3.41 | 3.28 |

SMAPE | 0.3125 | 1.2257 | 0.9509 | 0.9926 | 1.2842 | 0.1245 | 1.57 | 0.73 | 0.141 |

OS | Best Model | SAMPE |
---|---|---|

Mac OS X | ARIMA | 0.3125 |

Windows 7 | SVM | 0.1245 |

Linux Kernel | SVM | 0.141 |

Linux Kernel but prediction accuracy is varied. In terms of forecasting, Linux kernel has a convincing SMAPE of (14.1%) but MAC OS X is reasonably accurate with a SMAPE (31.25%). One possible reason for high percentage error may be due to missing components in our analysis such as OS development process, patch cycles, difference in security enforcement criteria, as well as market share and popularity of the OS.

After the selection of the best model with minimum error rate, our study revealed the fact that the developed model provides a good fit for the OS datasets and can be used to forecast the future vulnerabilities. Fitting time series models to the vulnerability database is demonstrated via the graph as shown below.

All of the above plots 9, 10, and 11 a good fit for each OS but different degrees of prediction accuracy. From a careful reading of the fitted plots, we can conclude that best fitted model may not produce the best forecasting accuracy and vice versa. In case of Mac OS X, forecasted vulnerabilities is not that much better to the fit of data. Unlike Mac OS X, windows 7 has quite good fit but forecasted vulnerabilities are a way better than Mac OS X. We eventually used our models to forecast the future vulnerabilities of these OS and recommended choice for predicting monthly vulnerabilities is summarized by

Initially, we split our data sets in terms of training and testing data sets. The

collection period of training data set of each OS is mentioned in

Our study revealed the fact that seasonality and trends are not the major

OS | Forecasted Vulnerabilities | Total | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Jan. | Feb. | Mar. | Apr. | May. | Jun. | Jul. | Aug. | Sep. | Oct. | Nov. | Dec. | ||

Mac OS X | 28 | 24 | 32 | 26 | 32 | 34 | 31 | 26 | 27 | 36 | 38 | 37 | 371 |

Windows 7 | 11 | 10 | 9 | 11 | 9 | 11 | 11 | 10 | 10 | 11 | 11 | 13 | 127 |

Linux Kernel | 7 | 16 | 7 | 18 | 34 | 8 | 7 | 24 | 7 | 7 | 12 | 18 | 165 |

OS | 2016 Total Vulnerabilities | |
---|---|---|

Actual | Forecasted | |

Mac OS X | 396 | 371 |

Windows 7 | 134 | 127 |

Linux Kernel | 230 | 165 |

components of the forecasting models. Nevertheless, the level of the time series is only the significant component to build the model. This suggests that it is difficult to predict vulnerabilities based on monthly seasonal patterns or trends. Further investigation is needed whether weekly, quarterly, or annual patterns might produce remarkable trends or seasonal components but such data is not publically availiable to improve the quality of the model.

The ANN model did not perform well in forecasting the vulnerabilities because we did not have enough data to improve the training process so as to improve its forecasting accuracy. With more vulnerability data we believe that the ANN model will be very competitive in forecasting vulnerabilities of the OS.

We have developed effective linear and non linear analytic models to forecast future vulnerabilities by utilizing the vulnerability datasets of three major OS namely, Windows 7, Linux Kernel, and Mac OS X. We have not found any influential trend and seasonality components from the time series data. The proposed forecasting models reveal the fact that non linear time series models predict quite well with minimum level of error rate. The developed models can be used by the developers, the user community, and individual organization to predict the vulnerability level of their OS. Developers can examine the OS readiness by predicting the future vulnerability trend. Based on the projected vulnerabilities, they can allocate the security maintenance resources to detect the upcoming vulnerabilities. At the same time, they can implement the proper software security patch plan. The users can obtain useful information to compare different OS in terms of the risk associated with their vulnerability. They also can access the risk before patches are applied. Similarly, every organization has their own customized security policies that requires allocation of time and resources. The predictive vulnerability models we have developed can be used to quantitatively guide such policies.

Finally, the developed forecasting models of the three OS that dominate the global market can be used to predict their future vulenerabilities. The predictive vulnerabilities can be used to identify the risks associated with the forecast for each of the three OS. IT manager can implement the forecasting vulnerabilities in their operating strategies and contingency plans. Based on their predictions, each of the three manufacturing companies can make one of the decision: they are willing to live with it, make minor research and development design or major modifications. The predictive results can be used for competitive analysis of the three OS companies that are essential to their marketing strategies.

Pokhrel, N.R., Rodrigo, H. and Tsokos, C.P. (2017) Cybersecurity: Time Series Predictive Modeling of Vulnerabilities of Desktop Operating System Using Linear and Non-Linear Approach. Journal of Information Security, 8, 362-382. https://doi.org/10.4236/jis.2017.84023