Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DETECTING VULNERABLE SOFTWARE SYSTEMS
Document Type and Number:
WIPO Patent Application WO/2020/114920
Kind Code:
A1
Abstract:
A computer implemented method of detecting an increased vulnerability of a software system including a plurality of software components, the method comprising: generating a vector representation of each software component derived from a neural network trained using training data defined from known vulnerabilities of the software components in the software system; aggregating the vector representations for the software component to an aggregate vector representation for a particular time; repeating the generating and aggregating steps for a plurality of points in time to generate multiple generations of aggregate vector representations; comparing the multiple generations of aggregate vector representations to detect a change in an aggregate vector representation exceeding a maximum threshold degree of change as an indication of an increased vulnerability of the software system.

Inventors:
HERCOCK ROBERT (GB)
GIACONI GIULIO (GB)
Application Number:
PCT/EP2019/083203
Publication Date:
June 11, 2020
Filing Date:
December 01, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
BRITISH TELECOMM (GB)
International Classes:
G06F21/56; G06F21/57; G06N3/04; G06N3/08; H04L29/06
Foreign References:
US20180004948A12018-01-04
Other References:
ZHEN LI ET AL: "VulDeePecker: A Deep Learning-Based System for Vulnerability Detection", 5 January 2018 (2018-01-05), XP055556046, Retrieved from the Internet [retrieved on 20190213], DOI: 10.14722/ndss.2018.23158
RING ET AL.: "IP2Vec: Learning Similarities between IP Addresses", IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, 2017
MIKOLOV ET AL.: "Efficient Estimation of Word Representations in Vector Space", ARXIV, CORR (COMPUTING RESEARCH REPOSITORY, 2013
D. E. RUMELHARTG. E. HINTONR. J. WILLIAMS: "Institute for Cognitive Science Report 8506", September 1985, UNIVERSITY OF CALIFORNIA, article "Learning internal representations by backpropagating errors"
R. ZHANGJ. GUOY. LANJ. XUX. CHENG: "Aggregating Neural Word Embeddings for Document Representation", 2018, SPRINGER
Attorney, Agent or Firm:
BRITISH TELECOMMUNICATIONS PUBLIC LIMITED COMPANY, INTELLECTUAL PROPERTY DEPARTMENT (GB)
Download PDF:
Claims:
CLAIMS

1. A computer implemented method of detecting increased vulnerability of a software system, the method comprising:

accessing data records each corresponding to a known software vulnerability at a particular time, each data record including an identification of software affected by the vulnerability;

generating, for each of at least a subset of the data records, one or more training data items for a neural network, each training data item associating a vulnerability of the data record with affected software identified by the data record, the neural network having input units corresponding to items in a corpus of all software and output units corresponding to items in a corpus of all vulnerabilities;

training the neural network using the training data so as to define a vector representation for each software in the corpus of all software based on weights in the neural network for an input unit corresponding to the software;

aggregating, for a subset of software in the corpus corresponding to software in the software system, vector representations for each software in the subset to an aggregate vector representation for the software system for the particular time;

repeating the accessing, generating, augmenting, training and aggregating steps at subsequent times to generate multiple generations of aggregate vector representations for the software system, each generation corresponding to data records accessed at a different time; and

comparing the multiple generations of aggregate vector representations for the software system to identify a change in one or more aggregate vector representation as an indication of an increased vulnerability of the software system.

2. The method of claim 1 further comprising, responsive to the indication of an increased vulnerability of the software system, implementing protective measures for the software system include one or more of: deploying and/or configuring a firewall to protect the software system; deploying and/or configuring anti-malware facility to protect the software system; deploying and/or configuring an antivirus facility to protect the software system; adjusting a sensitivity and/or level of monitoring of a security facility associated with the software system; terminating execution of the software system; forcing an update to the software system; selectively replacing one or more software components in the software system; and selectively disconnecting one or more computer systems executing the software system from a computer network.

3. The method of any preceding claim wherein a new neural network is trained at each repetition of the training step.

4. The method of any preceding claim wherein the neural network has a single layer of hidden units, the number of which is smaller than each of: a number of input units of the neural network; and a number of output units of the neural network.

5. The method of any preceding claim wherein the identification of software in each data record includes an identification of at least one software component associated with a class of software components.

6. The method of claim 5 wherein training data items are generated for a data record to associate a vulnerability of the data record with one or more of each of: a class of software component identified by the data record; and a software component identified by the record.

7. The method of claim 5 wherein the neural network is a first neural network, the input units of the first neural network corresponding to items in a corpus of all classes of software component, and wherein the first neural network is trained using training data items associating vulnerabilities with classes of software component, and the method further comprises:

training a second neural network, the second neural network having input units corresponding to items in a corpus of all software components and output units

corresponding to items in a corpus of all vulnerabilities, the second neural network being trained using training data items associating vulnerabilities with software components so as to define a vector representation for each software component in a corpus of all software components based on weights in the second neural network for an input unit corresponding to the software component.

8. The method of claim 7 wherein the second neural network has a single layer of hidden units in number smaller than a number of input units and smaller than a number of output units of the second neural network.

9. The method of any preceding claim wherein comparing multiple generations of aggregate vector representations includes performing a vector similarity function on the aggregate vector representations to determine a degree of similarity.

10. The method of claim 9 wherein the identification of a change in an aggregate vector representation indicative of an increased vulnerability of the software system includes detecting a vector similarity below a predetermined threshold degree of similarity. 11. The method of claim 7 wherein comparing multiple generations of aggregate vector representations includes, for each of one or more software components in the corpus of all software components and/or for each of one or more classes of software component in the corpus of all classes of software component, training a recurrent neural network based on the multiple generations of aggregate vector representations such that the trained recurrent neural network is suitable for classifying a subsequent aggregate vector representation as indicative of increased vulnerability in relation to multiple generations of aggregate vector representations.

12. A computer implemented method of detecting an increased vulnerability of a software system including a plurality of software components, the method comprising:

generating a vector representation of each software component derived from a neural network trained using training data defined from known vulnerabilities of the software components in the software system;

aggregating the vector representations for the software component to an aggregate vector representation for a particular time;

repeating the generating and aggregating steps for a plurality of points in time to generate multiple generations of aggregate vector representations;

comparing the multiple generations of aggregate vector representations to detect a change in an aggregate vector representation exceeding a maximum threshold degree of change as an indication of an increased vulnerability of the software system.

13. A computer system including a processor and memory storing computer program code for performing the steps of the method of any preceding claim. 14. A computer program element comprising computer program code to, when loaded into a computer system and executed thereon, cause the computer to perform the steps of a method as claimed in any of claims 1 to 12.

Description:
Detecting Vulnerable Software Systems

The present invention relates to the detection of software systems vulnerable to attack.

Network connected computer systems, whether physical and/or virtual computer systems connected via one or more physical and/or virtual network communication mechanisms, can be susceptible to malicious attack. For example, one or more computer systems can become infected with malicious software such as botnet agents or the like, and such infected systems can instigate malicious communication with other systems such as communications intended to propagate such infections and/or communications intended to affect the operation of target computer systems (e.g. denial of service attacks, hijacking or the like). Such attacks can involve the exploitation of vulnerabilities within software systems executing on a computer system. Vulnerabilities can arise due to design flaws, inadequate or flawed programming or logic, oversight, erroneous or mis-configuration, use of software outside its design parameters, flaws or errors in the way software responds to other interfacing software or requests, and other sources and/or causes of vulnerability as will be apparent to those skilled in the art.

It is a longstanding desire to effectively detect and remediate vulnerabilities in software systems.

The present invention accordingly provides, in a first aspect, a computer implemented method of detecting increased vulnerability of a software system, the method comprising: accessing data records each corresponding to a known software vulnerability at a particular time, each data record including an identification of software affected by the vulnerability; generating, for each of at least a subset of the data records, one or more training data items for a neural network, each training data item associating a vulnerability of the data record with affected software identified by the data record, the neural network having input units corresponding to items in a corpus of all software and output units corresponding to items in a corpus of all vulnerabilities; training the neural network using the training data so as to define a vector representation for each software in the corpus of all software based on weights in the neural network for an input unit corresponding to the software; aggregating, for a subset of software in the corpus corresponding to software in the software system, vector representations for each software in the subset to an aggregate vector representation for the software system for the particular time; repeating the accessing, generating, augmenting, training and aggregating steps at subsequent times to generate multiple generations of aggregate vector representations for the software system, each generation corresponding to data records accessed at a different time; and comparing the multiple generations of aggregate vector representations for the software system to identify a change in one or more aggregate vector representation as an indication of an increased vulnerability of the software system.

Preferably, the method further comprises, responsive to the indication of an increased vulnerability of the software system, implementing protective measures for the software system include one or more of: deploying and/or configuring a firewall to protect the software system; deploying and/or configuring anti-malware facility to protect the software system; deploying and/or configuring an antivirus facility to protect the software system; adjusting a sensitivity and/or level of monitoring of a security facility associated with the software system; terminating execution of the software system; forcing an update to the software system; selectively replacing one or more software components in the software system; and selectively disconnecting one or more computer systems executing the software system from a computer network.

Preferably, wherein a new neural network is trained at each repetition of the training step. Preferably, the neural network has a single layer of hidden units, the number of which is smaller than each of: a number of input units of the neural network; and a number of output units of the neural network.

Preferably, the identification of software in each data record includes an identification of at least one software component associated with a class of software components. Preferably, training data items are generated for a data record to associate a vulnerability of the data record with one or more of each of: a class of software component identified by the data record; and a software component identified by the record.

Preferably, the neural network is a first neural network, the input units of the first neural network corresponding to items in a corpus of all classes of software component, and wherein the first neural network is trained using training data items associating vulnerabilities with classes of software component, and the method further comprises: training a second neural network, the second neural network having input units corresponding to items in a corpus of all software components and output units corresponding to items in a corpus of all vulnerabilities, the second neural network being trained using training data items associating vulnerabilities with software components so as to define a vector representation for each software component in a corpus of all software components based on weights in the second neural network for an input unit corresponding to the software component. Preferably, the second neural network has a single layer of hidden units in number smaller than a number of input units and smaller than a number of output units of the second neural network.

Preferably, comparing multiple generations of aggregate vector representations includes performing a vector similarity function on the aggregate vector representations to determine a degree of similarity.

Preferably, the identification of a change in an aggregate vector representation indicative of an increased vulnerability of the software system includes detecting a vector similarity below a predetermined threshold degree of similarity. Preferably, comparing multiple generations of aggregate vector representations includes, for each of one or more software components in the corpus of all software components and/or for each of one or more classes of software component in the corpus of all classes of software component, training a recurrent neural network based on the multiple generations of aggregate vector representations such that the trained recurrent neural network is suitable for classifying a subsequent aggregate vector representation as indicative of increased vulnerability in relation to multiple generations of aggregate vector representations.

The present invention accordingly provides, in a second aspect, a computer implemented method of detecting an increased vulnerability of a software system including a plurality of software components, the method comprising: generating a vector representation of each software component derived from a neural network trained using training data defined from known vulnerabilities of the software components in the software system; aggregating the vector representations for the software component to an aggregate vector representation for a particular time; repeating the generating and aggregating steps for a plurality of points in time to generate multiple generations of aggregate vector representations; comparing the multiple generations of aggregate vector representations to detect a change in an aggregate vector representation exceeding a maximum threshold degree of change as an indication of an increased vulnerability of the software system.

The present invention accordingly provides, in a third aspect, a computer system including a processor and memory storing computer program code for performing the steps of the method set out above.

The present invention accordingly provides, in a fourth aspect, a computer program element comprising computer program code to, when loaded into a computer system and executed thereon, cause the computer to perform the steps of the method set out above. Embodiments of the present invention will now be described, by way of example only, with reference to the accompanying drawings, in which:

Figure 1 is a block diagram a computer system suitable for the operation of embodiments of the present invention; Figure 2 is a component diagram of a computer implemented arrangement for detecting increased vulnerability of a software system in accordance with embodiments of the present invention;

Figure 3 is a depiction of an exemplary vulnerability database in accordance with embodiments of the present invention; Figure 4 depicts a partial exemplary neural network suitable for operation in embodiments of the present invention;

Figure 5 is a component diagram of a computer implemented arrangement for detecting increased vulnerability of a software system in accordance with embodiments of the present invention; Figure 6 is an alternative arrangement of the embedding generator of Figure 1 in accordance with embodiments of the present invention;

Figure 7 is a component diagram of a computer implemented arrangement for detecting and remediating increased vulnerability of a software system in accordance with embodiments of the present invention; Figure 8 is a flowchart of a computer implemented method of detecting increased vulnerability of a software system in accordance with embodiments of the present invention;

Figure 9 is a flowchart of an alternative computer implemented method of detecting increased vulnerability of a software system in accordance with embodiments of the present invention; and Figure 10 is a flowchart of a computer implemented method of detecting and remediating increased vulnerability of a software system in accordance with embodiments of the present invention.

Embodiments of the present invention address the challenge of detecting vulnerabilities in a computer software system and, in particular, monitoring vulnerabilities over time to identify changes in vulnerabilities affecting the software system. Changes in vulnerabilities can usefully trigger protective, corrective or remediative measures where remediation can include reducing an extent of vulnerability of a software system to an acceptable level. It is to be acknowldeged that software systems cannot always be rendered devoid of all vulnerabilities. This state of affairs arises due to the complexity of modern software, the interfaces between software systems and the scope of potential use of software systems that can extend beyond contemplated design constraints. Nonetheless, preference for maximum acceptable degree of vulnerability can be defined for a software system such that, for example, certain vulnerabilities are acknowledged and tolerated while other vulnerabilities are intolerable and require mitigation or remediation. Mitigation of a vulnerability can include introduction, deployment or configuration of protective measures. For example, a software system vulnerable to a malware infection or attack can be complemented by antimalware facilities to mitigate the effect of the such attacks. Remediation measures for a vulnerability can include removing a known vulnerability altogether such as by patching, updating or replacing a known flawed software component or the like.

Software systems are suites of software components arranged together to perform a function, provide a service or application. Software components of potentially multiple different classes (types) can be brought together in a software system. For example, classes of software component can include, inter alia: operating systems; databases; middleware; transaction handling services; applications software; communications libraries; user interface libraries; and other classes of software as will be apparent to those skilled in the art.

Software components within these classes can be combined or otherwise brought together to constitute a software system. For example, a particular operating system can be combined with a particular user application and a particular database to provide a software system. Such particular software are known as software components.

Embodiments of the present invention generate representations of a software system in terms of vulnerabilities affecting the software system, such representations being generated over time for comparison to detect changes in the representations that indicate an increased vulnerability of the software system. When detected, increased vulnerability can trigger protective measures and/or remediation measures for the software system. The

representations of vulnerabilities are known as embeddings in which a software system is represented as a vector representation derived based on attributes of vulnerabilities affecting the software system. Preferably, individual software components and/or classes of software component for a software system are represented in a vector representation and aggregated to provide an aggregate vector representation comparable over time

The generation of the vector representations as embeddings is achieved by training a fully connected neural network with a single hidden layer (the hidden layer being preferably smaller than input and output layers). Each unit in the input layer corresponds to a software component or class of components. Each unit in the output layer corresponds to a vulnerability. There is an input unit for each item in a corpus of all software, such corpus including, for example, all software components and all classes of software component. Similarly, there is an output unit for each vulnerability in a corpus of all vulnerabilities, such as a corpus of vulnerabilities stored in a vulnerability database or register.

The neural network is trained using training data derived from records of vulnerabilities stored in, for example, a vulnerability database. Such an approach to the generation of embeddings is described in detail in“IP2Vec: Learning Similarities between IP Addresses” (Ring et al, 2017 IEEE International Conference on Data Mining Workshops) which itself builds upon the approach described in detail in“Efficient Estimation of Word Representations in Vector Space” (Mikolov et al., ArXiv, CoRR (Computing Research Repository), 2013). Both Ring et al. and Mikolov et al. are specifically cited here for their respective disclosures which, combined with the present specification, are to be read to provide sufficient disclosure of the present invention.

Embodiments of the present invention improve upon the approach of Ring et al. and Mikolov et al. by, inter alia, the application of such embedding generation to vulnerability data over time for comparison to detect increased vulnerability of a software system. Further, embodiments of the present invention provide improvements including, inter alia, the triggering of mitigation and/or remediation measures. Yet further, embodiments of the present invention provide improvements including, inter alia, by use of pre-processing of training data to enhance the characterisation of vulnerabilities in the generated embeddings.

Figure 1 is a block diagram of a computer system suitable for the operation of

embodiments of the present invention. A central processor unit (CPU) 102 is

communicatively connected to a storage 104 and an input/output (I/O) interface 106 via a data bus 108. The storage 104 can be any read/write storage device such as a random- access memory (RAM) or a non-volatile storage device. An example of a non-volatile storage device includes a disk or tape storage device. The I/O interface 106 is an interface to devices for the input or output of data, or for both input and output of data. Examples of I/O devices connectable to I/O interface 106 include a keyboard, a mouse, a display (such as a monitor) and a network connection.

Figure 2 is a component diagram of a computer implemented arrangement for detecting increased vulnerability of a software system 200 in accordance with embodiments of the present invention. The software system 200 is an arrangement, aggregation and/or deployable specification of particular software components 220, each software component belonging to a class of software component. An embedding generator 202 is provided as a hardware, software, firmware or combination component for generating embeddings 214 as vector representations of the software system 220. The embedding generator 202 receives data records from a vulnerability database 222. The vulnerability database 222 is a data store storing information about known vulnerabilities affecting software as will be described in detail below with respect to Figure 3. While indicated and described as a database it will be appreciated that the vulnerability database 222 can equally be provided as a simple repository, website, data dump, document store or other suitable data store for storing vulnerability information. Similarly, while this description refers to records stored within, accessed within and/or received from the vulnerability database 222, it will be appreciated by those skilled in the art that such data records are items, structures, reports, documents or other units of data storing vulnerability information as will be described below.

Figure 3 is a depiction of an exemplary vulnerability database in accordance with embodiments of the present invention. The vulnerability database 222 of figure 3 includes a plurality of vulnerability records 342 each including information about a known vulnerability. Each vulnerability will be identified in the vulnerability database 222 such as by a unique identifier, a unique name or similar (not illustrated). In the exemplary record 342 the following exemplary data fields are provided. It will be appreciated that the exemplary data fields are not necessarily mandatory or exhaustive: · A score 344 for the vulnerability 342 such as a score for characterising

characteristics and a severity of the vulnerability. For example, the Common Vulnerability Scoring System (CVSS) as defined in“Common Vulnerability Scoring System v3.0: Specification Document” (FIRST.org available at

“www.first.org/cvss/cvss-v3-guide.pdf” accessed 2 nd December 2018). · A confidentiality impact 346 for the vulnerability 342 indicating, for example, a degree of information disclosure possible as a result of exploitation of the vulnerability 342. The confidentiality impact 346 can be provided as a discrete (enumerated) indication or a continuous score.

• An integrity impact 348 for the vulnerability 342 indicating, for example, an extent to which exploitation of the vulnerability 342 might compromise the integrity of a software system containing the vulnerability 342. The integrity impact 348 can be provided as a discrete (enumerated) indication or a continuous score.

• An availability impact 350 for the vulnerability 342 indicating, for example, an

extent to which exploitation of the vulnerability 342 might compromise the availability of a software system containing the vulnerability 342, such as by way of reduced performance, access to the software service, denial of service, resource availability and the like. The availability impact 350 can be provided as a discrete (enumerated) indication or a continuous score.

• An access complexity 352 of the vulnerability 342 indicating, for example, how readily a vulnerability can be exploited such as whether special preconditions must be satisfied and the like. The access complexity 352 can be provided as a discrete (enumerated) indication or a continuous score.

• A set of vulnerability types 354 for the vulnerability 342 including indications of one or more vulnerability types 356 each corresponding to a class of vulnerability. Such classes can include, for example, inter alia: denial of service vulnerabilities; ability to execute code vulnerabilities; data access or data leak vulnerabilities; memory overflow vulnerabilities; memory corruption vulnerabilities; and other classes of vulnerability as will be apparent to those skilled in the art.

• A set of one or more software components 358 known to be affected by the

vulnerability 342. For example, each software component 360 in the set 358 includes an indication of its class of software component.

• Other fields and/or attributes of the vulnerability 342 - whether objective or

subjectively defined - as will be apparent to those skilled in the art.

Thus, the vulnerability database 222 constitutes a source of information regarding vulnerabilities including an indication of affected software components 358 and, preferably, classes of software components 362. While a single vulnerability database 222 is illustrated it will be appreciated that a composite of a plurality of vulnerability databases 222 such as complementary databases or other data stores can be used. An example of a vulnerability database 222 is the Common Vulnerabilities and Exposures (CVE) repository provided by the Mitre Corporation at www.cvedetails.com and cve.mitre.org (accessed on 2 nd December 2018. The CVE repository is provided by way of webpages containing entries for each of a plurality of known cybersecurity vulnerabilities, each vulnerability including substantially all of the fields illustrated in the vulnerability record 342 of Figure 4. Additionally or alternatively, databases of vulnerability information can be obtained via, inter alia: exploit-db.com

(provided by Offensive Security, accessed on 2 nd December 2018); and the National

Vulnerability Database (NVD), a U.S. government repository of standards-based vulnerability management data represented using the Security Content Automation Protocol (SCAP)

(NVD is available at nvd.nist.gov accessed on 2 nd December 2018).

Returning to Figure 2, the embedding generator 202 receives a plurality of data records such as record 342 for vulnerabilities from the vulnerability database 222. The records received by the embedding generator 202 can constitute all records in the database 222 or some select subset of records in the database 222 such as records relating to vulnerabilities affecting a certain subset of classes of software component, vulnerabilities affecting certain software components, or vulnerabilities satisfying any number of criteria defined in terms of characteristics of vulnerabilities including those characteristics indicated in Figure 3 and other characteristics not indicated. The embedding generator 202 includes a training data generator 206 as a software, hardware, firmware or combination component arranged to access data records from the vulnerability database 222 and generate training data for a neural network 212. Preferably, the training data is generated according to the approach of Mikolov et al. and Ring et al. save for the data utilised for training the neural network 212 is software and vulnerability information (see, for example, section IV of Ring et al. and the example illustrated in Figure 3 of Ring et al.) Each vulnerability database 222 record is used to generate one or more training data items.

In one embodiment, the neural network 212 is trained using tuples each consisting of a class of software component and a vulnerability (i.e. (class, vulnerability) pairs). Additionally or alternatively, the neural network 212 is trained using tuples each consisting of a particular software component within a particular class and a vulnerability (i.e. ( software component of class c, vulnerability) pairs). Further additionally or alternatively, the neural network 212 is trained using tuples each consisting of a particular software component (of any class) and a vulnerability (i.e. ( software component, vulnerability) pairs). For provision to train the neural network 212 each element of a tuple can be encoded as a one-hot vector. In preferred embodiments, tuples are provided for each software (e.g. each software class or each software component) in a corpus of all software (e.g. each software component in a corpus of all software components, such as all software components indicated in the vulnerability database 222 or, at least, in a selected subset of records in the vulnerability database 222). Similarly, in preferred embodiments, tuples are provided for each vulnerability in a corpus of all vulnerabilities, such as all vulnerabilities included in the vulnerability database 222 or, at least, in a selected subset of records in the vulnerability database 222. Thus, a first element in a training data item tuple is used as an input for the neural network 212 (encoded as a one-hot vector) and a second element in a training data item tuple is used as a desired output for the neural network 212 (encoded as a one-hot vector). A backpropagation training algorithm can be applied to train the neural network using such pairs of values (see, for example,“Learning internal representations by backpropagating errors”, D. E. Rumelhart, G. E. Hinton, R. J. Williams, September 1985, Institute for Cognitive Science Report 8506, University of California, San Diego). Preferably, the neural network 212 is trained using stochastic gradient descent and backpropagation. Once trained, the neural network 212 serves as a basis for defining a plurality of vector representations for each software component and/or class in the corpus according to the input data of the neural network. According to Ring et al., for a particular input unit corresponding to a software component or class, a vector constituted by the values of the weights of all connections in the neural network 212 from the input unit to each hidden unit can constitute a vectoral embedding for the software component or class. Referring now to Figure 4 which depicts a partial exemplary neural network 212 suitable for operation in embodiments of the present invention. The neural network 212 of Figure 4 is a partial representation because it shows only the connections between a single input unit and hidden unit, and between the hidden units and an output unit. The neural network 212 of Figure 4 depicts a set of input units 420 each corresponding to a software component (e.g. OS A”, “OS B”,“Browser X”) or a class of software components (e.g.“Operating System”,“Brower”, “Databas”). Output units 424 are also provided, each being associated with a vulnerability including:“Overflow”;“Execute Code”;“SQL Injection”;“Denial of Service”;“Memory

Corruption”; and“Obtain Information”. Figure 4 also depicts weights w1 , w2, w3 and w4 of connections between one input unit for software component“Browser X” and a set of hidden units 422. In Figure 4, the neural network 212 is trained with the sample {“Browser X”,

“Denial of Service”}, so indicated by the weighted connections between the input unit for “Browser X” and the hidden units, and between the hidden units and output unit for“Denial of Service”. A vector with the components w1 , w2, w3 and w4 thus constitutes a vector representation of the software component“Browser X” and is an embedding for software component“Browser X”.

Thus, for each vulnerability in the vulnerability database 222 multiple embeddings can be generated stored as weights in the trained neural network 212. Additionally and optionally, embedding(s) for a software component or class can be extracted from the trained neural network 212 and stored separately in a data store or memory associated with or accessible to the embedding generator 202 (not shown). Subsequently, a set of embeddings for the software system 200 is generated as an aggregation of embeddings 214 corresponding to one or more of the software components 220 of the software system 200 and the classes of software components for components 220 in the software system 200. In one embodiment, all software components 220 and/or classes of the software system 200 are used.

Alternatively, at least a subset of software components and/or classes of the software system 200 are used. The aggregation of embeddings 214 thus constitutes an aggregation of multiple vectors corresponding to the embeddings being aggregated. Several aggregation techniques can be employed for this purpose, ranging from computing the average embeddings difference of all the embeddings for the software system 200 to using more advanced techniques, such as those disclosed in“Aggregating Neural Word Embeddings for Document Representation” (R. Zhang, J. Guo, Y. Lan, J. Xu and X. Cheng, Springer, 2018) and/or“Aggregating Continuous Word Embeddings for Information Retrieval” (S. Clinchant and F. Perronnin, in Proc. Workshop on Continuous Vector Space Models and their

Compositionality, Sofia, Bulgaria, 2013). Such techniques have been applied in information retrieval settings and for determining the content of documents or paragraphs of text. Thus, the embedding generator 202 generates an aggregate embedding 214 for the software system.

Notably, the embedding generator 202 is operable temporally on the basis of the vulnerability database 222 as constituted at a point in time t. Accordingly, an embedding 214 for the software system 200 generated by the embedding generator 202 on the basis of the vulnerability database 222 at time t is referred to as Embedding t . The vulnerability database 222 will change over time according to the addition, deletion and amendment of records therein as new vulnerabilities are exposed, existing vulnerabilities are better understood, more information is made available, more analysis of vulnerabilities are conducted and so on.

According to embodiments of the present invention, multiple such embeddings 214 are generated, one for each of multiple different times t (points in time) using the vulnerability database 222 at that point in time (vulnerability databasei) such that each embedding 214 is an aggregation of vectors for software components and/or classes derived from the vector neural network 212 as trained using training data generated from data records in the vulnerability database 222 at that point in time t. In this way, multiple generations of vector representations for the software system 200 are generated, each generation corresponding to data records in the vulnerability database 222 at a different time t. The neural network 212 is trained using training data for the time t and embedding 214 is an aggregate of vector representations of weights in the trained neural network 212 for at least a subset of the software components and/or classes represented by the input units 420 for the software system 200. The aggregate vector representation for the software system 200 for a time t is then stored as embedding 214 associated with the time t, where time t changes for each aggregate thus: embeddingst=i, embeddingst=2 ... embeddingst= n .

Figure 2 also includes a comparator 216 as a software, hardware, firmware or combination component operable to receive and compare a plurality of embeddings 214, each for a different time t. In particular, the comparator 216 performs a similarity analysis on a pair of vector representations for the software system 200, each vector representation corresponding to an embedding 214 generated for a different time t. In this way, differences between embeddings 214 at different times t for the software system 200 can be discerned and, where such differences meet or exceed a maximum threshold degree of difference, a determination of the existence of an indication of a change to the vulnerability of the software system 200 can be identified. In particular, where a series of vector representations

(embeddings) 214 remain or become more similar for a series of times t, such similarity is indicative of a degree of consistency of a level of vulnerability of the software system 200 over time. On the other hand, where a series of vector representations (embeddings) 214 become more dissimilar for a series of times t, such dissimilarity is indicative of a change to a level of vulnerability of the software system 200 over time and indicates an increased vulnerability of the software system 200.

The comparator 216 can make the vector comparison using a vector similarity function such as a cosine similarity function for comparing vectors as is known in the art. Sufficiency of dissimilarity (or similarity) can be predetermined in terms of a degree of difference characterised in dependence on the particular vector similarity function employed - such as an angular difference, an extent of vectoral magnitude difference or a combination or other such characterisations of difference as will be apparent to those skilled in the art.

In response to a determination of increased vulnerability of the software system 200, protective measures 218 can be implemented to protect the software system 200 and/or reduce a vulnerability of the software system 200. Such protective measures can be mitigating measures to mitigate against exploitation of vulnerabilities of the software system 200, such mitigation measures being determined by reference to, for example, the vulnerability database 222 to identify the vulnerabilities of the software system 200. Such mitigating measures can include deployable software, hardware, firmware or combination facilities or features that can include, for example, the deployment of firewalls, new security measures, additional authentication or authorisation checks, execution or updating of antimalware services, preventing communication with the software system 200, increasing a level of monitoring, tracing or logging and other protective measures as will be apparent to those skilled in the art. Thus, in use, the embedding generator 202 coupled with the comparator 216 provide for the effective characterisation of the software system 200 in terms of vulnerabilities of the software system 200 as vector representations (embeddings 214) for each of a plurality of times t such that changes detected between vector representations 214 can trigger the deployment of protective measures 218.

In one embodiment, a new neural network 212 is trained afresh for each different time t for which the embedding generator 202 operates. Thus, the neural network 212 for a time t is trained using training data 208 derived from vulnerability database 222 data records at that time, and each embedding 214 is defined as an aggregate of vector representations derived from the neural network 212 accordingly. In accordance with an alternative embodiment, a single neural network 212 is used for all times t such that the same neural network 212 is trained initially for a first time t= 1 and is further trained (constructively) for each subsequent time t= 2 to t= n for n times. In such an approach the embeddings for each time embedding t =i, embedding t =2 ... embedding t = n constitute a development of an embedding for a preceding time. Consequently, a series of embeddings arising from multiply training the same neural network 212 constitute a temporal sequence of embeddings suitable for training a further neural network as a recurrent neural network. Recurrent neural networks can be used to analyse sequential data due to their ability to take multiple inputs one after the other and save state information between inputs. Such recurrent neural networks can be trained in an unsupervised way by making the target output at each step the embeddings for a next step (prediction) or by training a sequence-to-sequence model to reconstruct the entire sequence (autoencoder). Prediction or reconstruction errors derived from a recurrent neural network can then be used by the comparator 216 to indicate how likely a given sequence of embeddings 214 is to be diverging (or becoming increasingly dissimilar) in comparison to “normal” sequences that can be defined during a training time of operation when the vulnerabilities of the software system 200 are known to be consistent. Thus, the comparator 216 can be adapted to comparing multiple generations of vector representations by training a recurrent neural network for the software system 200 based on the multiple generations of vector representation. In this way, the trained recurrent neural network is suitable for classifying a subsequent vector representation as changed or sufficiently different in relation to multiple generations of vector representations.

Figure 5 is a component diagram of a computer implemented arrangement for detecting increased vulnerability of a software system 200 in accordance with embodiments of the present invention. Many of the elements of Figure 5 are identical to those described above with respect to Figure 2 and these will not be repeated here. In Figure 5 the distinction between software components and classes of software component are clearly indicated and a separate neural network is trained for each of software components and classes. In this way a plurality of embeddings 514a over times t are generated for the software system 200 based on vector representations of classes of software component, and a plurality of embeddings 514b over times t are generated for the software system 200 based on vector representations of software components. Accordingly, the embedding generator 202 receives data records from the vulnerabilities database 222 including vulnerability information for classes of software components 568 and vulnerability information for software components 570. The training data generator 206 thus generates two separate sets of training data. A first set of training data 572a is generated based on software component class information from vulnerability database 222 records. A second set of training data 572b is generated based on software component information from vulnerability database 222 records.

Subsequently, trainers 510a and 510b (which could be provided as a single trainer) each train a neural network 512a and 512b such that neural network 512a is trained for classes of software component and neural network 512b is trained for software components. Thus vector representations for each class of software component in a corpus of all classes can be extracted from class neural network 512a. Similarly, vector representations for each software component in a corpus of all software components can be extracted from component neural network 512b. Subsequently, class embeddings 514a are generated as aggregates of class vector representations corresponding to software component classes for software system 220. Further, component embeddings 514b are generated as aggregates of software component vector representations corresponding to software components for software system 220. According to such an embodiment, the comparator is thus arranged to compare a series of class embeddings 514a and a series of component embeddings 514b for each of different times t in order to identify changes to embeddings over time as a trigger for protective measures 218.

Figure 6 is an alternative arrangement of the embedding generator of Figure 1 in accordance with embodiments of the present invention. Many of the elements of Figure 6 are identical to those described above with respect to Figure 2 and these will not be repeated here. Further, the embedding generator 602 of Figure 6 is equally applicable to the arrangement of Figure 5 with appropriate modifications as will be apparent to those skilled in the art. The embedding generator 602 of Figure 6 is adapted to enhance the training data 680 by augmenting it. Thus, the embedding generator 602 includes an augmenter 682 as a software, hardware, firmware or combination component arranged to receive or access each item of the training data 680 and the data records received from the vulnerability database 222 (not shown in Figure 6) on the basis of which the training data 680 was generated. The augmentor 682 generates augmented training data 684 corresponding to the training data 680 with augmentations. The augmented training data 684 is then used by the trainer 210 to train the neural network 212 as previously described.

Specifically, the augmenter 682 performs augmentation of the training data 680 by replicating training data items so that they appear more than once in the augmented training data 684. A determination of whether to replicate a training data item is made based on one or more attributes of a vulnerability database 222 data record corresponding to the training data item. The replication may involve mere duplication of one, more than one or all training data items generated based on a vulnerability database 222, or repetition of such training data items multiple times in the augmented training data 684. In this way, characteristics of vulnerability database 222 data record deemed significant (based on one or more attributes) are emphasised in the augmented training data 684 by replication of training data items. In one embodiment, such significance is determined based on a value of one or more attributes in a vulnerability database 222 data record, such as attributes corresponding to a score, impact or complexity of a vulnerability. In this way, more significant vulnerabilities are emphasised in the augmented training data 684 by replication of training data items.

Similarly, characteristics of the software component and/or class indicated in a vulnerability database 222 data record can lead to replication. For example, the augmenter 682 can augment training data 680 to augmented training data 684 by replicating training data items in response to a determination that a value of a score attribute of a vulnerability

corresponding a training data item exceeds a predetermined threshold.

Figure 7 is a component diagram of a computer implemented arrangement for detecting and remediating increased vulnerability of a software system 200 in accordance with embodiments of the present invention. Some of the features of Figure 7 are identical to those described above with respect to Figures 2, 5 and 6 and these will not be repeated here. According to the arrangement of Figure 7 the comparator informs an increased vulnerability determiner 718 as a software, hardware, firmware or combination component adapted to determine whether a vulnerability for software system 200 has increased sufficiently to trigger remediation of the software system 200 so as to reduce a vulnerability of the system 200. Notably, remediation differs from mitigation in that an objective of remediation is to affect a vulnerability of the software system 200 such that a level or degree of vulnerability of the software system 200 is within permitted levels. Permitted levels of vulnerability can include a threshold degree of vulnerability indicated by a degree of difference (or change) between embeddings for the software system between two points in time t and t+u. For example, an acceptable or permitted level of vulnerability can be a level of vulnerability that is known to be effectively mitigated by mitigation measures such as those previously described. Where a degree or level of vulnerability exceeds such a permitted or acceptable level, remediation by way of improving the level of vulnerability (by reducing vulnerability such as by eliminating one or more vulnerabilities) is preferred.

Accordingly, where the increased vulnerability determiner 718 determines that a level of vulnerability exceeds an acceptable threshold (or meets a predetermined level), a software modifier 790 is triggered to effect modifications to the software system 200 so as to affect a change to a degree of vulnerability of the software system 200 that is intended to reduce a degree of difference between an embedding generated for the software system 200 so changed and embeddings for previous times t. Notably, in such a situation, comparison with an embedding known to exceed a maximum threshold degree of difference from previous embeddings should be excluded from comparison with an embedding or a new modified version of the software system 200.

According to embodiments of the present invention in accordance with Figure 7, software components in the software system 200 are iteratively adjusted and, at each iteration, a new aggregate vector representation (embedding) 214 for the software system 200 so adjusted is generated. Each regenerated embedding 214 is compared with the multiple previousl embeddings to identify one or more software component adjustments leading to a change in the embedding not exceeding a maximum threshold degree of change. In this way, the vulnerability of the software system 200 is reduced.

Adjustment to the software system 200 can include, inter alia: replacing one or more software components with alternative compatible software components; replacing one or more software components with alternative versions of the software components; upgrading or updating one or more software components in the software system 200; and patching one or more software components in the software system with an available software patch intended to reduce a vulnerability of the software components.

Figure 8 is a flowchart of a computer implemented method of detecting increased vulnerability of a software system 200 in accordance with embodiments of the present invention. Initially, at step 802, an iteration through each of a series of times t is initiated. At step 804 a nested iteration is initiated through each of at least a subset of data records in the vulnerability database 222. At step 806 one or more training data items are generated for a current data record from the database 222. At step 808 the method iterates until all records are processed. At step 810 the method trains a neural network 212 using the training data generated at step 806. At step 812 vector representations (embeddings) for each of the software components and/or classes of software component are generated from the trained neural network 212. At step 814 an aggregate vector representation (embedding) 214 is generated for the software system 200 by aggregating vector representations for software components and classes of the software system 200. At step 816 the method iterates through all times t. At step 818 the comparator compares the multiple generations (for each time t) of embeddings 214 for the software system 200 and, where a change between embeddings 214 exceeds a maximum threshold change, the method implements protective measures at step 822. Notably, in some embodiments, step 822 triggers the implementation of remediations as described above with respect to Figure 7 and below with respect to Figure 10. Figure 9 is a flowchart of an alternative computer implemented method of detecting increased vulnerability of a software system 200 in accordance with embodiments of the present invention. The steps of Figure 9 are identical to those described above with respect to Figure 8 except that, at step 909, training data items are augmented before training the neural network 212 at step 810 as previously described with respect to Figure 6.

Figure 10 is a flowchart of a computer implemented method of detecting and remediating increased vulnerability of a software system 200 in accordance with embodiments of the present invention. Initially, at step 1066, the method generates multiple generations of aggregate vectors for the software system 200 using the method of, for example, Figure 8 or Figure 9. At step 818 the comparator compares the multiple generations (for each time t) of embeddings 214 for the software system 200 and, where a change between embeddings 214 exceeds a maximum threshold change, the method proceeds to step 1068. At step 1068 the method commences an iteration of steps 1070 to 1074. At step 1070 the software components of the software system 200 are adjusted as previously described. At step 1072 a new aggregate vector representation (embedding) is generated for the software system 200 so adjusted. At step 1074 the generations of vector embeddings generated at step 1066 are compared with the new embedding generated at step 1072. Notably, preferably, a generation of embedding generated at step 1066 that caused a change exceeding a maximum threshold degree of change at step 820 is excluded from the comparison at step 1074 as it represents an unacceptable degree of change. At step 1076, if the threshold degree of change determined by the comparison of step 1074 exceeds the maximum acceptable degree, the method iterates to step 1068. Otherwise the method concludes as the adjusted software system 200 has a suitable reduction in a degree or level of vulnerability.

Figure 5 is a flowchart of an exemplary method of detecting anomalous behaviour within a computer network in accordance with embodiments of the present invention. Initially, at step 502, the method commences an iteration through a series of predetermined or determinable time periods t. At step 504 a set of network communication data records 204 for the time period are accessed. At step 506 an iteration is commenced for each data record in the set of accessed network communication data records 204. At step 508 the method generates one or more training data items for the current network communication data record 204. At step 510 the iteration continues for all network communication data records 204. According to some embodiments, the method subsequently augments, at step 512, the set of all training data items generated at step 508 for the current time period t by the augmentor 430 as previously described. At step 514 the neural network 212 is trained using the training data 208 (or, where augmented, the augmented training data 432). At step 516 the vector embeddings 214 for each value in the corpus of attribute values are stored for the current time period t. At step 518 the method iterates for all time periods. Subsequently, at step 520, the anomaly detector 216 compares generations of vector representation embeddings 214, e.g. using a vector similarity function such as cosine similarity. At step 522 detected anomalies lead to step 524 at which protective measures are deployed. Insofar as embodiments of the invention described are implementable, at least in part, using a software-controlled programmable processing device, such as a microprocessor, digital signal processor or other processing device, data processing apparatus or system, it will be appreciated that a computer program for configuring a programmable device, apparatus or system to implement the foregoing described methods is envisaged as an aspect of the present invention. The computer program may be embodied as source code or undergo compilation for implementation on a processing device, apparatus or system or may be embodied as object code, for example.

Suitably, the computer program is stored on a carrier medium in machine or device readable form, for example in solid-state memory, magnetic memory such as disk or tape, optically or magneto-optically readable memory such as compact disk or digital versatile disk etc., and the processing device utilises the program or a part thereof to configure it for operation. The computer program may be supplied from a remote source embodied in a communications medium such as an electronic signal, radio frequency carrier wave or optical carrier wave. Such carrier media are also envisaged as aspects of the present invention. It will be understood by those skilled in the art that, although the present invention has been described in relation to the above described example embodiments, the invention is not limited thereto and that there are many possible variations and modifications which fall within the scope of the invention.

The scope of the present invention includes any novel features or combination of features disclosed herein. The applicant hereby gives notice that new claims may be formulated to such features or combination of features during prosecution of this application or of any such further applications derived therefrom. In particular, with reference to the appended claims, features from dependent claims may be combined with those of the independent claims and features from respective independent claims may be combined in any appropriate manner and not merely in the specific combinations enumerated in the claims.