Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
CREATION AND VERIFICATION OF BEHAVIORAL BASELINES FOR THE DETECTION OF CYBERSECURITY ANOMALIES USING MACHINE LEARNING TECHNIQUES
Document Type and Number:
WIPO Patent Application WO/2019/220363
Kind Code:
A1
Abstract:
A system (10) for the creation and verification of behavioral baselines, comprising a central processing device (12) which comprises a control unit (14) and enriched data storage means (22) and which is connected to and communicates with a plurality of target apparatuses (36) and with an Identity & Access Management (IAM) apparatus (38). The central processing device (12) comprises: - an IAM state collection module (18) configured to generate a real-time synchronized copy of data on the IAM state which are recorded by the IAM apparatus (38), minimizing the overhead on said IAM apparatus (38); - a data enrichment module (20) configured to identify an entity in real time; - a Markovian module (24), configured to build a Markov transition matrix adapted to track the transition from a first activity to a second, temporally subsequent activity; - a baseline module (26), configured to calculate a plurality of individual score values, one for each individual activity/entity pair, and a plurality of collective score values, one for each individual activity/time window pair; - a log anomaly verification module (28) configured to assess the presence of a behavioral anomaly of the entity with respect to an individual space, on the basis of the plurality of individual score values; - a peer anomaly verification module (30), configured to assess behaviors of similar peer entities; and - a noise reduction module (32), configured to reduce the number of false positives on the basis of the assessment of the behavior of the similar peer entities.

Inventors:
MARTINELLI GIUSEPPE (IT)
VALENTINI GIANLUCA (IT)
TOMASSI ANDREA (IT)
ZACCARIA ANDREA (IT)
LUCCI CRISTIAN (IT)
Application Number:
PCT/IB2019/054027
Publication Date:
November 21, 2019
Filing Date:
May 15, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
SHARELOCK S R L (IT)
International Classes:
G06F21/55; G06F21/31; H04L29/06
Foreign References:
US9516053B12016-12-06
US20120066763A12012-03-15
US20150235152A12015-08-20
Attorney, Agent or Firm:
MODIANO, Micaela (IT)
Download PDF:
Claims:
CLAIMS

1. A system (10) for the creation and verification of behavioral baselines, comprising a central processing device (12) which comprises a control unit (14) and enriched data storage means (22), said central processing device (12) being connected to and communicating with a plurality of target apparatuses (36) and with an Identity & Access Management (IAM) apparatus (38), characterized in that said central processing device (12) comprises:

- an IAM state collection module (18) configured to generate a real- time synchronized copy of data on the IAM state recorded by said IAM apparatus (38), minimizing the overhead on said IAM apparatus (38);

- a data enrichment module (20) configured to identify an entity in real time starting from: a plurality of accounts corresponding to a target apparatus (36), or from the detected source and destination IP address, or from the hostname, or from the MAC address, or from a combination of at least the detected source and destination IP address, of the hostname and of the MAC address;

- a Markovian module (24), configured to build a Markov transition matrix adapted to track the transition from a first activity to a second, temporally subsequent activity, both of said activities being defined by enriched data and being performed by said entity on said target apparatuses (36);

- a baseline module (26), configured to calculate a plurality of individual score values, one for each individual activity/entity pair, and a plurality of collective score values, one for each individual activity/time window pair, said individual and collective score values representing the deviation from a mean behavioral baseline and providing a quick and informative indication of the presence of any anomalies that can be simply translated in probabilistic terms;

- a log anomaly verification module (28) configured to assess the presence of a behavioral anomaly of said entity with respect to an individual space, i.e., with respect to the history of past activities of said entity, on the basis of said plurality of individual score values;

- a peer anomaly verification module (30), configured to assess i) the presence of any similar behaviors between similar peer entities by means of unsupervised machine learning techniques, i.e., clustering techniques in order to determine any peers; and ii) the presence of a behavioral anomaly of said entity with respect to a collective space, i.e., with respect to the current activities of other, similar peer entities, on the basis of said plurality of collective score values; and

- a noise reduction module (32), configured to reduce the number of false positives on the basis of the assessment of the behavior of said similar peer entities.

2. The system (10) for the creation and verification of behavioral baselines according to claim 1, characterized in that said IAM state collection module (18) of said central processing device (12) is further configured to collect said data on the IAM state which are currently recorded by said IAM apparatus (38), obtaining an updated mapping of all the accounts of said entity on said target apparatuses (36).

3. The system (10) for the creation and verification of behavioral baselines according to claim 1 or 2, characterized in that said central processing device (12) further comprises a raw event collection module (16) configured to collect and normalize raw data on the events recorded by said target apparatuses (36).

4. The system (10) for the creation and verification of behavioral baselines according to claim 3, characterized in that said data enrichment module (20) of said central processing device (12) is further configured to cross-reference said raw data on the events, collected by said raw event collection module (16), and said data on the IAM state, collected by said IAM state collection module (18), identifying and grouping raw events associated with said entity independently of the number of accounts of which said entity is the owner.

5. A method for the creation and verification of behavioral baselines, by means of a central processing device (12) which comprises a control unit (14) and enriched data storage means (22), which is connected to and communicates with a plurality of target apparatuses (36) and with an Identity & Access Management (IAM) apparatus (38), which comprises the steps of:

- building (76) a Markov transition matrix adapted to track the transition from a first activity to a second, temporally subsequent activity, by means of a Markovian module (24) comprised in said central processing device (12), both of said activities being defined by said enriched data and being performed by an entity on said target apparatuses (36);

- calculating (78) a plurality of individual score values, one for each individual activity /entity pair, and a plurality of collective score values, one for each individual activity/time window pair, by means of a baseline module (26) comprised in said central processing device (12);

- assessing (80) the presence of a behavioral anomaly of said entity with respect to an individual space, i.e., with respect to the history of past activities of said entity, on the basis of said plurality of individual score values, by means of a log anomaly verification module (28) comprised in said central processing device (12); and

- assessing (82) the presence of a behavioral anomaly of said entity with respect to a collective space, i.e., with respect to the current activities of other, similar peer entities, on the basis of said plurality of collective score values, by means of a peer anomaly verification module (30) comprised in said central processing device (12).

6. The method for the creation and verification of behavioral baselines according to claim 5, characterized in that it comprises the step of collecting and normalizing (70) raw data on the events recorded by said target apparatuses (36), by means of a raw event collection module (16) comprised in said central processing device (12).

7. The method for the creation and verification of behavioral baselines according to claim 5 or 6, characterized in that it comprises the step of collecting (72) data on the current state which are recorded by said IAM apparatus (38), by means of an IAM state collection module (18) comprised in said central processing device (12), obtaining an updated mapping of all the accounts of said entity on said target apparatuses (36).

8. The method for the creation and verification of behavioral baselines according to claims 6 and 7, characterized in that it comprises the step of cross-referencing (74) said data on the events, collected by said raw event collection module (16), and said data on the IAM state, collected by said IAM state collection module (18), by means of a data enrichment module (20) comprised in said central processing device (12), identifying and grouping events associated with said entity independently of the number of accounts of which said entity is the owner.

Description:
CREATION AND VERIFICATION OF BEHAVIORAL BASELINES FOR THE DETECTION OF

CYBERSECURITY ANOMALIES USING MACHINE LEARNING TECHNIQUES

The present invention relates to a system and a method for the creation and verification of behavioral baselines, particularly but not exclusively useful and practical in the field of business or infrastructure IT security.

1 Definitions

In the present description, the term“user” is understood to refer to a real-life physical person, who uses one or more IT systems comprised in a more or less complex business information system and who is the owner of one or more accounts, while the term“account” is understood to refer to an entity that is recorded and authorized to access one of the IT systems, typically defined by username and password.

In the present description, the expression“behavioral baseline” is understood to refer to the behavioral model or profile of at least one entity (hereinafter users will be considered as entities to be observed, although the present invention can be generalized to also consider an entity any apparatus generating emissions broadly speaking), constituted by the set of actions and operations that that same user or apparatus has performed and therefore would perform in a normal situation (i.e., in a“base” situation).

2. Background of the invention

Currently, company systems and networks are particularly difficult to protect from IT threats due to a series of factors, including for example: the ability of aggressors to operate from any part of the world, the connections between the internet and company systems, and the difficulty of reducing vulnerabilities in complex IT networks.

Furthermore, IT threats to critical infrastructures (such as for example electricity power stations, airports, hospitals, and so forth) are increasingly worrying, since such infrastructures are subject to an increasing risk due to increasingly sophisticated intrusion attempts. Since Information Technology (IT) is becoming increasingly integrated with the functionality of critical infrastructures, the risk consequently increases of large-scale or high-impact events that could cause damage or interrupt services on which the global economy and the daily lives of millions of people depend.

In view of the high risk and of the potential consequences of IT attacks, strengthening security and resilience in the IT sector has become an important security mission for all companies and governments.

For companies, the rapid detection of threats to IT security is essential in order to prevent their IT systems from being compromised and their data from being stolen. A lot of such data is very important commercial and/or personal information, or even industrial secrets, and are not intended for disclosure or public viewing and any exposure, theft or manipulation of this data could cause economic or reputational damage to the organization and individuals concerned. A large number of these IT attacks, as reported by the media, have involved fraud, violations of data, intellectual property, or national security.

Companies typically use a layered and compartmentalized network topology to separate the internal network from the internet. Workstations and servers are generally protected from direct access via the internet or other external networks by a proxy server; internet traffic is generally terminated in so-called“demilitarized zones” of the company network, and incoming traffic is filtered through one or more firewalls.

Normally, attackers concentrate their attacks on the elements that are exposed toward the external perimeter of the company network, and many solutions exist that ensure perimeter security. However, the borders of modern company networks are no longer so well-defined; the increase in the use of cloud applications and the use of mobile devices have made the company network more fluid and dynamic, with borders that are difficult to identify. Once the attackers have breached the perimeter and have entered the company network, in general they operate with the appearance of an internal user, by stealing the account of an existing user or creating a new one. They use legitimate accounts or trusted systems and can move freely around the company network by taking advantage of the lack of an effective supervision of the internal network.

Currently, various security solutions are known which have the ability to detect potential malicious activities by a user. Most security solutions of the known type use a static and reactive approach, looking for signatures of known attacks in order to identify and alert to the presence of similar attacks.

3 Drawbacks of the background art

However, these solutions of the known type, which utilize a signature-based threat detection approach, are not free from drawbacks, which include the fact that the development of signatures for new threats requires an in-depth analysis on an infected system, thus wasting considerable time and resources, which are in any case insufficient to face rapidly evolving threats. Furthermore, signatures do not adapt to the very rapid changes that threat vectors undergo. Finally, signature-based approaches are ineffective for what are known as“zero-day” attacks, which utilize vulnerabilities that are unknown and thus not available for detecting threats.

An evolution of this signature-based threat detection approach consists in identifying internal attacks by means of manually building various profiles of“normal” user behaviors, detecting deviations from these profiles, and assessing the threat risk of these anomalies.

One widespread approach is to build a baseline of the behavior of the user (or more frequently of a user account, or of one of the IP addresses used by the user, thus incurring in the limitations described above), so as to “learn” the operations that the user normally uses. In general, a time window of a certain preset length is fixed. After the learning phase, systems based on machine learning algorithms consider deviations from this baseline as anomalies.

The flaw in this approach is that the order in which the operations are performed within the window is completely ignored. As long as the operations have already been performed in the past, and as long as the frequency within a certain mean and variance is respected, the behavior of the user will appear to be consistent with his or her baseline in the analyzed window.

However, these solutions of a known type, which utilize a threat detection approach based on normal behavior profiles, are not free from drawbacks, which include the fact that building and configuring profiles that precisely characterize the normal behavior of a user is very difficult, since human behavior is extremely changeable and highly variable.

Furthermore, the use of such profiles for detecting behavioral anomalies can produce incorrect results and lead to many false positives that overwhelm IT security analysts. The balance between an excessively permissive detection method, with the risk of missing a real threat, and a close-meshed detection method, which floods security analysts with warnings, is a difficult compromise.

To this must be added the further drawback introduced by the currently very widespread practice of using partial information of the activities of the user as input data. In particular, usually it is not possible to attribute to the same user the emissions that originate from his or her various different accounts, with the result that the capacity for correlation with the totality of the user’s activities is reduced. If in fact there is no awareness of the fact that two activities are performed by the same user, it is evident that it is impossible to detect effectively all possible fraud patterns.

4. Aim and objects of the invention

The aim of the present invention is to overcome the above mentioned limitations of the background art, by devising a system and a method for the creation and verification of behavioral baselines based on activities or sequences of activities performed by the entities (comprising both users and apparatuses) that operate within a company network, for the purposes of identifying any behavioral deviations that could constitute harmful activities and IT threats.

Within this aim, an object of the present invention is to conceive a system and a method for the creation and verification of behavioral baselines that enable a dynamic, adaptive and proactive IT threat detection method, for the purpose of combating external and internal threats, which are constantly evolving and a priori unknown.

Another object of the present invention is to devise a system and a method for the creation and verification of behavioral baselines that enable a method for the detection of behavioral anomalies that is not driven by signatures or by preset policies, but which adapts to the behavior of the entities and to the use of the company information systems.

Another object of the present invention is to conceive a system and a method for the creation and verification of behavioral baselines that minimize false positives, thus increasing the effectiveness of the response actions to an IT attack or threat.

A further object of the present invention is to devise a system and a method for the creation and verification of behavioral baselines that are not limited to making correlations on the activities of the accounts, but make correlations on the activities of the individual physical user, independently of the number of accounts assigned to him or her.

Another object of the present invention is to provide a system and a method for the creation and verification of behavioral baselines that are highly reliable, relatively easy to provide, and low cost when compared with the background art.

5. Definition of the invention The above described aim and objects are achieved by a system for the creation and verification of behavioral baselines, comprising a central processing device which comprises a control unit and enriched data storage means, said central processing device being connected to and communicating with a plurality of target apparatuses and with an Identity & Access Management (IAM) apparatus, characterized in that said central processing device comprises:

- an IAM state collection module configured to generate a real-time synchronized copy of data on the IAM state recorded by said IAM apparatus, minimizing the overhead on said IAM apparatus;

- a data enrichment module configured to identify an entity in real time starting from: a plurality of accounts corresponding to a target apparatus, or from the detected source and destination IP addresses, or from the hostname, or from the MAC address, or from a combination of at least the detected source and destination IP addresses, the hostname and the MAC address;

- a Markovian module, configured to build a Markov transition matrix adapted to track the transition from a first activity to a second, temporally subsequent activity, both of said activities being defined by enriched data and being performed by said entity on said target apparatuses;

- a baseline module, configured to calculate a plurality of individual score values, one for each individual activity /entity pair, and a plurality of collective score values, one for each individual activity/time window pair, said individual and collective score values representing the deviation from a mean behavioral baseline and providing a quick and informative indication of the presence of any anomalies that can be simply translated in probabilistic terms;

- a log anomaly verification module configured to assess the presence of a behavioral anomaly of said entity with respect to an individual space, i.e., with respect to the history of past activities of said entity, on the basis of said plurality of individual score values;

- a peer anomaly verification module, configured to assess i) the presence of any similar behaviors between similar peer entities by means of unsupervised machine learning techniques, i.e., clustering techniques in order to determine any peers; and ii) the presence of a behavioral anomaly of said entity with respect to a collective space, i.e., with respect to the current activities of other, similar peer entities, on the basis of said plurality of collective score values; and

- a noise reduction module, configured to reduce the number of false positives on the basis of the assessment of the behavior of said similar peer entities.

This aim and these objects are also achieved by a system for the creation and verification of behavioral baselines, comprising a central processing device which comprises a control unit and enriched data storage means, which is connected to and communicates with a plurality of target apparatuses and with an Identity & Access Management or IAM apparatus, characterized in that said central processing device comprises:

- a Markovian module, configured to build a Markov transition matrix adapted to track the transition from a first activity to a second, temporally subsequent activity, both of said activities being defined by enriched data and being performed by an entity on said target apparatuses;

- a baseline module, configured to calculate a plurality of individual z-score values, one for each individual activity/entity pair, and a plurality of collective z-score values, one for each individual activity/time window pair;

- a log anomaly verification module, configured to assess the presence of a behavioral anomaly of said entity with respect to an individual space, i.e., with respect to the history of past activities of said entity, on the basis of said plurality of individual z-score values; and

- a peer anomaly verification module, configured to assess the presence of a behavioral anomaly of said entity with respect to a collective space, i.e., with respect to the current activities of other, similar peer entities, on the basis of said plurality of collective z-score values.

The intended aim and objects are also achieved by a method for the creation and verification of behavioral baselines, by means of a central processing device which comprises a control unit and enriched data storage means, which is connected to and communicates with a plurality of target apparatuses and with an Identity & Access Management or IAM apparatus, which comprises the steps of:

- building a Markov transition matrix adapted to track the transition from a first activity to a second, temporally subsequent activity, by means of a Markovian module comprised in said central processing device, both of said activities being defined by said enriched data and being performed by an entity on said target apparatuses;

- calculating a plurality of individual score values, one for each individual activity /entity pair, and a plurality of collective score values, one for each individual activity/time window pair, by means of a baseline module comprised in said central processing device;

- assessing the presence of a behavioral anomaly of said entity with respect to an individual space, i.e., with respect to the history of past activities of said entity, on the basis of said plurality of individual score values, by means of a log anomaly verification module comprised in said central processing device; and

- assessing the presence of a behavioral anomaly of said entity with respect to a collective space, i.e., with respect to the current activities of other, similar peer entities, on the basis of said plurality of collective score values, by means of a peer anomaly verification module comprised in said central processing device.

6 Brief description of the drawings

Further characteristics and advantages of the invention will become better apparent from the description of a preferred but not exclusive embodiment of the system and method for the creation and verification of behavioral baselines according to the invention, illustrated by way of non limiting example with the aid of the accompanying drawings, wherein:

Figure 1 is a block diagram that illustrates schematically an embodiment of the system for the creation and verification of behavioral baselines, according to the present invention;

Figure 2 is a flowchart that illustrates schematically an embodiment of the method for the creation and verification of behavioral baselines, according to the present invention.

7 Detailed description of the system according to the invention

With reference to Figure 1, the system for the creation and verification of behavioral baselines according to the invention, generally designated by the reference numeral 10, comprises a central processing device 12, in short central computer 12, which is connected to and communicates with a plurality of target apparatuses 36 and with an Identity & Access Management apparatus 38, in short IAM apparatus 38, for example via a LAN local telematic communication network.

The target apparatuses 36 are computers (for example of the server type), electronic devices, network devices, all configured by means of software applications or the like, all comprised in a more or less complex information system, that are capable of detecting, directly or indirectly, the presence of IT attackers and attacks, and which therefore can be sources of emissions that it is interesting to collect as input data items.

The input data item is collected by the central computer 12 as emitted by the source, i.e., by the corresponding target apparatus 36, but before it is stored operations occur which extend it, increasing its value. The original raw input data item is always available; the additional information “accompanies” it.

A typical operation that occurs in this step of collecting the input data item consists in applying some parsing rules (i.e., the isolation and tagging of notable fields within the individual data item) which are specific for the data source. The output of the parsing consists in extrapolating some specific information from the original input data item, making it available in the form of key/value pairs in which the key is known and standardized within the solution. This makes it possible to“normalize” the data of all the sources and to obtain a consistency and an ease of consultation which is essential for subsequent analyses.

In an embodiment, the central computer 12 is configured to apply the following“normalization”: the value corresponding to the IP address of the target apparatus 36 that generated the input data item is selected (if present) and saved in a key named“source ip”. The same key is used to contain the same information independently of the manner in which this IP address is referenced by the data source, i.e., by the target apparatus 36, that produced it.

Another operation that can occur in this input data item collection step is the enrichment of the data item. Using the same technique as “normalization”, i.e., by adding key/value pairs that accompany the original raw input data item before it is stored, some additional information can be associated with the data item, such as for example, but not exclusively, the user associated with the account, the IP address tied to the MAC address, the host associated with the IP address, the user tied to the IP and the HOST.

The IAM (Identity and Access Management) apparatus 38 is an apparatus configured to manage the accounts of the users and the authorizations of these accounts within an information system, such as the one of which the central computer 12 and the target apparatuses 36 form a part.

In particular, the IAM apparatus 38 manages centrally the accounts of the users (typically represented by a username for each account), the security credentials (typically represented by a password or access key associated with the username of the account), and the authorizations that ensure that the account owner users can access all and only the resources that are assigned to them.

The central computer 12 comprises a control unit 14, a raw event collection module 16, an IAM (Identity and Access Management) state collection module 18, a data enrichment module 20, enriched data storage means 22, a Markovian module 24, a baseline module 26, a log anomaly verification module 28, a peer anomaly verification module 30, and a noise reduction module 32.

The control unit 14 is the main functional element of the central computer 12, and for this reason it is connected to and communicates with the other elements comprised in the central computer 12.

The control unit 14 of the central computer 12 is provided with appropriate capabilities for calculation and interfacing with the other elements of the central computer 12, and is configured to control, monitor and coordinate the operation of the elements of the central computer 12 to which it is connected and with which it communicates.

The raw event collection module 16 of the central computer 12 is configured to collect the raw (i.e., non-normalized) data on the events recorded by the target apparatuses 36 which are connected to the central computer 12, considering them as emissions of those target apparatuses 36. The raw event collection module 16 is further configured to normalize these data on the events, clustering similar information under the same semantic name so as to make them groupable.

In practice, each activity of one of the target apparatuses 36 corresponds to one emission. For example, but not exclusively, the emissions of the target apparatuses 36 may include the lines of log files of SAP systems, DBMS audit events, lines of the Apache httpd audit file, syslog events, and so forth.

The IAM state collection module 18 is configured to generate a real- time synchronized copy of the IAM state data recorded by the IAM apparatus 38, minimizing the overhead on said IAM apparatus 38. In an embodiment, maintaining this data copy consists in using queries with low computational cost, with the sole purpose of monitoring any changes of interest to the database of the IAM apparatus 38, and only at that point proceeding with the synchronization, which occurs, again for the purpose of performance optimization, by again assessing exclusively the differences with respect to the previous synchronization session.

Advantageously, the IAM state collection module 18 of the central computer 12 is configured to collect the data on the current state recorded by the IAM apparatus 38, which manages the accounts of the users and the authorizations of those accounts. By doing so, an updated mapping of all the accounts of the user on the target apparatuses 36 is obtained from the IAM apparatus 38.

The data enrichment module 20 of the central computer 12 is configured to cross-reference the raw data on events that arrive from the raw event collection module 16 and the data on the IAM state that arrive from the IAM state collection module 18, identifying and grouping the raw events associated with a specific physical user, or more generally with an entity, independently of the number of accounts of which that same user is the owner.

7.1 Reconstruction of the activity of a physical user starting from his or her accounts

The raw events always contain information on the user who generated the emission. Depending on the target apparatus 36, this information on the user who is the“author” of the emission can comprise a source IP address, a hostname, a username, or a combination of these items of information. In particular, the username can be always traced back to a user by virtue of the constantly updated mapping taken from the IAM apparatus 38.

By integrating all these information items, the data enrichment module 20 traces the user who is operating by virtue of a certain username, who is using a certain IP address or who is traced by means of a certain hostname.

Enriching the data available before analysis maximizes the correlation capacity. Usually a user, in order to access IT systems, performs a login, an operation by which he or she is identified by the system. The login occurs using personal credentials, typically a username and a password. The login is therefore the operation that allows the user to access his or her own account on the IT system. This account, which is characterized by a unique username, may have been configured by the administrator of the system in order to access all the functionalities offered by the system, or it may be in turn an administration account, or it may even have only a subset of the functionalities offered by the system.

However, a user often has various different accounts distributed on the various company IT systems. It is also possible for a user to have more than one account on the same IT system and therefore he or she can use the same system with different credentials (i.e., different usernames) and therefore different authorization levels.

IT systems, particularly company IT systems, often maintain a log file or in any case a record in which the operations performed by the users on the system are recorded. In the best case, the user is tracked by means of the username with which he or she authenticated himself or herself. As mentioned above, instead, some systems merely report the hostname or the IP address of the device used by the user.

The family of IT systems that tracks the actions of the physical user by means of his or her username is the best case, since if one wishes to trace back to the physical user, starting from the log file or the record of system activities, the way to proceed is relatively simple. One must in fact trace back to the owner of the account whose username has been found in the log file. Therefore, these systems, taken individually, keep track of the actions performed by the accounts, and cannot trace back to the user who is operating with that account. As mentioned above, this information is present (see IAM apparatus 38), but it is not the province of an individual system.

For example, in the case of a user who has two accounts on the same system, the operations of the accounts would appear in the log file as a transcription of events associated with two different usernames. The fact that in reality the same physical user is operating behind both usernames, and therefore behind both accounts, cannot be deduced directly.

As mentioned before, tracing back to the information that both accounts belong to the same physical user requires further information on the assignment or ownership of the account. This information is contained in the IAM apparatus 38, and consists in assigning the various usernames to the respective physical users.

Therefore, the data enrichment module 20 is configured to cross- reference the account/user information, preferably in real time, in order to “enrich” the data related to the activities of the accounts with“contextual” information tied to the user, in order to then use it in the data correlation step.

This data enrichment operation, to be performed before the analysis, increases drastically the quality of the data item used to identify anomalies, security incidents and finally fraud patterns. The enrichment makes it possible in fact to cross-reference correctly all the events that can be traced back to the same physical user, independently of how many different accounts he or she has used of those available to him or her. This reconstructs a complete user context and maximizes the capacity to correlate the events produced by the user within the company infrastructure.

Hereinafter, therefore, when referring to operations performed by a user, or activities performed by a user, what is intended is indeed that all the accounts of a user have been previously traced back to his or her real and unique identity.

As mentioned above, there are some IT systems that operate in a different manner, not account-based. This family of IT systems, in which there are for example networking-level security apparatuses and probes, record in the log file or in the activity record the hostname or the IP address of the device through which the user is connected to the company network.

In a company context, however, it is possible to be able to trace back to the assignee of the devices connected to the company network by virtue of the existence of an asset management policy and some enforcement measures that limit the possibilities for connection by means of unknown devices. It is thus possible, once again, to consider replacing the physical user with the IP address used by him or her, as well as with the hostname of his or her device. On the other hand, however, a physical user might use different hostnames or different IP addresses, even simultaneously. Once again, being able to perform resolution in real time, thus assigning the events relating to the devices to the respective users before analysis, greatly improves the ability to make correlations and to detect anomalous behaviors in an effective manner

Consider the example of a user who uses three IP addresses simultaneously. The events generated by these three IP addresses must be considered as sequential operations of the same user. If however one merely performs the simple analysis of the events of the individual IP addresses, and user identification is used only in the event of anomalous behavior of a certain IP address, it is evident that a great part of the information is ignored. In fact, any anomalous behavior that is detectable only by summing the events of the three IP addresses that belong to the same user would be missed if the IP addresses were considered individually.

Also in this case, therefore, attention is drawn to the greater potential offered by performing this cross-referencing of information between IP address/user or hostname/user upstream of the analysis, and not downstream of it.

Users, in using the company systems, thus leave traces in the log files and in the records. As shown, however, in the log files the user does not appear with his or her name and surname, nor can there be a global unique identifier thereof. In the log files we only find references to an account, i.e., a username that the user used on that system. Or, in the case of network apparatuses and probes, we find references to the IP address or the hostname from which the user logged in.

All companies keep these records of events, since strict regulations exist that mandate the collection and retention, for a certain number of years, of these records, so as to facilitate audit and forensic analysis operations (HIPAA, Sarbanes-Oxley, PCI-DSS, FISMA). Otherwise, the company would not be in compliance with these regulations.

These data are absolutely heterogeneous, since there is no standard shared by the many company systems. The activities, i.e., the events determined by the actions of the user on the system, are tracked in a customized and different manner from system to system. There are, however, some common points: for example, in almost all cases we will find, in the individual entry corresponding to the individual recorded event:

- a date of the event/activity;

- an entity that generated the event, or to which the activity refers (identified in a different manner depending on the system; it might be a username, a hostname or an IP address);

- an action, a description of the generated event, of the activity performed, an operation in a set of possible operations that can be performed, or a recipient who is affected by the event.

In any case, since there is a date, or more generally since the log files and the records are sorted according to a time criterion, it is always possible to perform an analysis of the events by following the order in which they were generated by the activity of the user. The enriched data storage means 22 of the central computer 12, such as for example a database stored on suitably dimensioned storage media, are configured to store the data corresponding to each user which are produced by the data enrichment module 20 according to the procedure described above.

By way of example, the enriched data storage means 22 can store data corresponding to the user John Doe, to the accounts associated with that user and identified by the usernames j.doe@acme.org and johnny@acme.org, and to the activities performed by that user in the various IT systems.

7.2 Identity Resolution enrichment

A very important enrichment of the input data item, performed by the data enrichment module 20 of the central computer 12, is the one that entails Identity Resolution, a possible embodiment of which is described hereinafter. The data item originating from the audit log of a target apparatus 36 of the UNIX type might contain the“userid” of the UNIX account. This userid is recorded by the audit log in order to identify the user who was authenticated and is therefore responsible for the operation being audited. Since it is a company system, accesses to the UNIX systems (i.e., the credentials associated with a certain account identified uniquely by a userid) are managed by an IAM apparatus 38.

The central computer 12 is functionally connected to this further data repository and draws from it via the IAM state collection module 18 in order to find out the assignments of the accounts to the users.

The IAM state collection module 18 of the central computer 12 performs a continuous synchronization with the IAM apparatus 38, ensuring the presence of constantly updated information.

The synchronization operation occurs incrementally. Only modified data are actually read and synchronized.

In order to avoid unnecessary overloading of the IAM apparatus 38 which is present in the company system, the IAM state collection module 18 uses the lightest possible queries on the database. In an embodiment, the technique used in a JDBC database is as follows: the size (number of lines) and the sequence (key generator) of a table of the database are monitored. These queries are scarcely onerous but they make it possible to notice any creations/deletions of entries. Only if a modification were detected would the database actually be further interrogated in order to synchronize the difference. In this case also, however, the queries are executed taking care to limit the workload on the database and performing a synchronization“on the differences” with respect to the preceding state, and thus without rereading the entire table, but only rereading the most recent entries.

It should be noted that the subsequent reads of the data of the IAM occur on the copy of the data item drawn from the IAM apparatus 38 (a copy made by the IAM state collection module 18 for example with the aid of very fast NO-SQL databases) and that therefore the subsequent reads with the purpose of“event enrichment” occur on this copy of the data item, and never on the original IAM apparatus 38, again with the purpose of minimizing the overhead on that apparatus, which typically is present in the company in order to carry out functions other than event enrichment, and which in almost all cases is not designed, installed or configured to support queries at rates in the order of tens of thousands per second.

It is exactly this integration between the central computer 12 and the IAM apparatus 38 that makes it possible therefore to complete this step of enrichment of the raw input data item. When our system identifies a known userid (coming for example from the company IAM system), it associates the corresponding user, who is the owner of the account. In this manner the data item is immediately traceable back to a specific user.

If, as can be expected, a user owns tens of accounts with which he or she operates within the company system, then this type of enrichment makes it possible to trace back all these accounts to the owner user, thus increasing the ability to detect frauds and enabling subsequent wider and more complete correlations. The use of a copy of the IAM data makes it possible to avoid overloading the main IAM apparatus 38 with continuous requests for information on the account-user association. The copy however can be read at rates of the order of tens of thousands of times per second, i.e. for each and every event that transits the system.

7.3 Other types of enrichment

Similarly to the Identity Resolution operation described above, other types of enrichment of the original raw input data item can occur, performed by the data enrichment module 20 of the central computer 12, by cross- referencing the information present in the input event with other data collected by the system. The method according to the present invention comprises, but is not limited to, the following enrichments of the original raw input data item:

- the IP addresses can be accompanied by information on the hostname associated with them;

- the IP addresses and the hostnames can be accompanied by information about the MAC address of the network device; and

- each of the information items (IP, hostname and MAC address) can be accompanied by an item of information on the corresponding asset.

By proceeding with the different enrichments, it is possible to associate the original raw input data item with the user who generated it, starting from an IP address, or from a MAC address, or from a hostname, from an identifier of the asset, or from a different combination of these items of information.

The central computer 12 always operates with the ultimate goal of performing the maximum possible correlation. Since some associations, unlike those deriving from Identity Governance with respect to the account, might be valid only for a limited period of time, the data enrichment module association is still valid or can no longer be considered such.

In the positive case, the data enrichment module 20 uses the association for the enrichment of the raw input data item, completing it with as many“derivative” associations as it is capable of adding. If the negative case, i.e., when the association with other interesting values is not present or has became obsolete, enrichment of the raw input data item is not performed.

Together with the additional value, the data enrichment module 20 also assigns a value known as“confidence”. This value represents the reliability of the information that has been added, i.e., to what extent the system has multiple“proofs” of the proposed association which increase its reliability.

It should be noted that the operations of normalization and enrichment of the raw input data item described above occur in real time.

The data enrichment module 20 of the central computer 12 during the enrichment steps can highlight interesting situations, which do not require further processing in order to become worthy of attention.

In some embodiments, events of this type, i.e.,“attention- worthy” situations, can be: detection of activity from an account not registered in the IAM apparatus 38, detection of an IP address not assigned by DHCP (but present in the address space reserved for DHCP), detection of activity from a blocked account on the IAM apparatus 38, detection of activity from an account that is not assigned to any user on the IAM apparatus 38, detection of a MAC address identifier associated with an IP address other than the assignment detected by DHCP.

Other attention-worthy situations can be configurable using suitable rules.

The above detections can lead to different reactions, such as for example the generation of an alert (email, SMS, instant messaging, etc.), or to the activation of a workflow on an external system, or even the blocking of an involved account.

7 4 Markovian module

7.4.1 Purpose

The Markovian module 24 of the central computer 12 is configured to build a Markov chain, or rather a Markov transition matrix, adapted to track the transition from one emission to the subsequent emission, i.e., from one activity to the temporally subsequent activity, both activities being defined by enriched data and performed by a specific physical user on the target apparatuses 36. Clearly, depending on how many emissions are considered simultaneously, the matrix will have a different size.

The goal of the construction of the Markov transition matrix is to create an operating context within which it is possible to detect behaviors that are anomalous with respect to the previously observed behaviors, i.e., to detect new activities or sequences of activities that deviate significantly from a normality defined by activities or sequences of activities observed during an initial period known as a learning period.

7.4.2 Input data

The data used by the Markovian module 24 to build the Markov transition matrix are called emissions, which correspond to the enriched data stored in the enriched data storage means 22 of the central computer 12.

A company computer infrastructure (centralized in a single data center, distributed geographically in different locations, cloud-based or hybrid) comprises electronic entities, which can be physical or virtual, which are mutually interconnected by means of one or more communication protocols. Authorized users interact with these entities by means of some specific entities, i.e., end-user devices. Typical examples of entities are servers (local, remote, physical or virtualized), databases, network and security apparatuses, workstations, smartphones, printers, IP cameras and any other device that can interface with the infrastructure. An emission is the individual information record concerning an entity, and it can be produced by the entity itself or by another of the entities connected to the infrastructure. An emission is characterized temporally by a precise date (typically with a precision of in the order of seconds or milliseconds). An emission contains in it at least one reference to an entity as“source”, and at least one“event” or one“action”, i.e., a more or less rigorous description of the event, of the measurement or of the observation at the indicated time. Depending on the context, the entity referred to as “source” can be the subject that performs the described action or the object of the reported observation.

Typical examples of repositories that are present in company infrastructures and which contain the emissions of the entities that make up the infrastructure are: log files, system registries, audit tables, syslog collectors, WMI collectors and NetFlow collectors.

7.4 3 Operation

In particular, the operation of the Markovian module 24 is as follows. Assume that there is a certain number U of users who perform some activities in a precise time sequence. Assume that there are users indexed by u = l ... U, activities a = 1 ... A, and time windows / = 1 ... F. These activities can be any sequence of events, detections or measurements relating to the user. In general, they are connected to the emissions of the user and are considered in chronological order.

The Markov transition matrix describes a sequence, in chronological order, of activities performed by the users within each time window. It should be noted that the term "activities" is used generically: an activity might mean having observed a certain emission, having observed two emissions in a row, or having observed a particular sequence of emissions that is complex at will and with which one wishes to associate a specific activity.

The Markov transition matrix is organized in a tensor with three indices A, the element A uaf of which defines how many times the user u performs the activity a in the window / Then, with the activities thus defined, one focuses on the frequencies of the activities.

The time windows / are At long (for example a day). The learning period during which the normal behaviors are defined is T = FAt long (for example one year).

The activity that one is interested in monitoring is every transition from one emission to the subsequent one. That is to say, given a sequence of events observed by looking at the emissions of a certain entity (for example a user) within a day, the Markov transition matrix is organized so that the tensor A contains the frequencies with which the users u, each window /, transition from one emission to the subsequent emission, thus performing the activity a. If T is the total number of emissions observed, then at most A can be equal to J 2 , if all the ordered pairs of possible emissions are performed during the learning period.

During the learning period, the Markovian module 24 builds the tensor with three indices L, the element A uaf of which defines how many times the user u performs the activity a (i.e., transitions from a certain emission e to another emission e’) in the window indexed by / These frequencies of activities will constitute the benchmark with respect to which any anomaly will be verified subsequently.

7.5 Baseline module comparison between user activity and normality

The baseline module 26 of the central computer 12 is configured to analyze the transitions between the activities performed by a specific physical user, or more generally by an entity, by means of known machine learning techniques, starting from the Markov transition matrix for each user u, for each time window / (for example a day). In particular, the baseline module 26 is configured to calculate a plurality of individual z- score values, one for each individual activity/entity pair (such as for example a user), and a plurality of collective z-score values, one for each individual activity/time window pair. In practice, the z-score values represent the probability that an activity or a sequence of activities of a specific physical user, or more generally of an entity, is a behavioral anomaly. Therefore, this provides a quantitative index on the series of emissions observed.

This learning approach enables the use, unlike traditional analyses, of a great deal of information that is useful to identify the user, particularly concerning the presence of repeated chronological patterns. In simple terms, instead of merely“learning” that the user usually performs the operations “foo”,“bar”, and“foobar” during a certain time window, for example the baseline module 26 learns that the user always performs“bar” once after performing“foo”, and then performs“foobar” once after performing“bar”, i.e., the user reproduces the pattern“foo”,“bar”,“foobar” in this precise order.

In doing so, an attacker who performed, again in the same time interval, “bar”,“foo”,“foobar” would appear in line with the baseline according to traditional approaches, while the present approach would be able to identify it as anomalous.

To conclude, recording correctly the order with which the user performs the operations makes it possible to build a probabilistic model that is capable of quantifying the probability that a certain pattern might belong to the user or not. This is regardless of the probable presence of individual known elements. In fact, the same operations (for example, the only authorized ones) might be performed using unexplored or little-known patterns, highlighting an anomaly in which the operations do not change, but only their order of execution changes.

In particular, the operation of the baseline module 26 is as follows. Preliminarily, it should be noted that anomalous behaviors can be such with respect to a normality that can be defined in two ways: with respect to the past of the same user or with respect to the present of other users. Therefore, it is possible to use two distinct benchmarks: an individual benchmark, comparing the user with himself or herself in past windows, and a collective benchmark, comparing at a fixed time the user with other users.

In fact, although the use of clustering and anomaly detection algorithms can aid in purging false alarms, if we limited ourselves to taking into consideration only the “personal history” of the user in order to determine an anomalous behavior thereof, there would always be the risk of a“false positive”. For this reason, the field of analysis is also extended horizontally, with respect to other users, as well as vertically, on the past history of the analyzed user. By using the data of users“similar” to the one being analyzed, in fact, the baseline module 26 is capable of calculating a z- score value. Based on this z-score value, a certain behavior of the user may be considered“normal” or“abnormal” with respect to the behavior of colleagues, in the same time window.

Let us start with the comparison between the user and his or her past, and thus let us set u = u*. In an embodiment, for each activity a , i.e., for each ordered pair of emissions (e, e’), it is possible to calculate the mean and the standard deviation of the frequencies of the activities observed, according to the following formulas, respectively:

These two quantities express respectively the normality, i.e., on average how much the user u * has performed the activity a , and the unit of measurement of the distance from said normality that can be tolerated in order to consider a deviation from said behavior as a reasonable statistical fluctuation. In other embodiments, the quantification of the deviation from normality can and must be quantified in a different manner according to the specific system being examined. It should be noted, for example, that in various apparatuses currently in use a certain event is defined as“rare” if it occurs and had not occurred before, or if it had occurred but with a low frequency with respect to a threshold preset by the user. The present invention, which in an embodiment provides for the use of the z-score, instead evaluates the relative frequency, i.e., the deviation with respect to the baseline. In particular, even if an event occurs many times, this constitutes and generates an anomaly.

In a fully similar manner it is possible to perform the comparison between the user and other users, thus evaluating any anomalies not with respect to the past of the user M* but with respect to the behavior of the other users, thus fixing a certain window j* .

In an embodiment, determining a mean that is no longer with a fixed user and by varying the windows, but with a fixed window and by varying the user, it is possible to obtain an analysis that is no longer of the individual space but of the collective space, in which a comparison is made not with the same user, each one with his or her own past, but users with respect to each other. The mean and the standard deviation of the frequencies of the activities observed are calculated according to the following formulas, respectively:

In this case also, in other embodiments, the quantification of the deviation from normality can and must be quantified in a different manner according to the specific system being examined.

Now assume that a certain frequency f for the activity a of the user M* in the time window f is observed in the test period. Assuming a Gaussian distribution of the frequencies, we can calculate how improbable it is to observe f with respect to the measurements of normality introduced previously. In an embodiment, the z-score can be calculated with respect to the individual space and with respect to the collective space, according to the following formulas, respectively:

Under the hypotheses of the Central Limit Theorem, the variables z tend to a Gaussian as F or U increases, respectively. For sufficiently large F or U, it is therefore possible to associate with each z a probability for each f. This probability is a measurement of the anomaly of the individual frequency observed during the test phase.

It should be noted that the above can be generalized to sequences of emissions longer than two emissions, at the expense of increasing the dimension of the tensor along the dimension of the activities, i.e., increasing A.

1.6 Anomaly Detection Module

The log anomaly verification module 28 of the central computer 12 is configured to evaluate the presence of any behavioral anomaly of a user u , or more generally of an entity, with respect to the individual space, i.e., with respect to the history of past activities of the user M, on the basis of the individual z-score value calculated previously by the baseline module 26.

The peer anomaly verification module 30 of the central computer 12 is configured to evaluate i) the presence of any similar behaviors among similar users, called peers, by means of unsupervised machine learning techniques, i.e., clustering techniques, in order to determine any peers; and ii) the presence of any behavioral anomaly of a user u, or more generally of an entity, with respect to the collective space, i.e., with respect to the current activities of other, similar users, called peers, on the basis of the collective z-score value calculated previously by the baseline module 26.

In an embodiment, in the evaluation of the presence of an anomaly, the log anomaly verification module 28 and the peer anomaly verification module 30 use known clustering and anomaly detection techniques, which here can be used in order to highlight possible anomalies in these z-scores.

By way of example, let us suppose that the baseline module 26 has calculated, for the reference time window F*, and after having learned the behavior of a user f/* over a sufficiently long previous period, a z-score value that is high enough in modulus to represent an anomaly. We assume, therefore, that the notification of a behavior that deviates in a significant manner from the behavior considered normal by the log anomaly verification module 28 has been received. At this point it is possible to highlight and indicate that the user f/* has performed one or more activities, within the window F*, that are such as to make his or her behavior anomalous with respect to his or her usual way of operating. A specific anomaly detection algorithm, the specific nature and provision of which depends on the system being examined, can be added to this kind of evaluation.

In some known methods, the degree of anomaly is quantified by counting the number of“rare” events, introducing the need for thresholds which are selected arbitrarily by the user or defined a priori. In the present invention, the use of z-score values or similar measures enables an objective probabilistic interpretation that highlights two essential advantages: i) it removes the need to introduce arbitrary thresholds and ii) it enables the composition of multiple z-score values originating from different analyses. 7.7 Noise Reduction Module

Some known methods, with purposes similar to those of the present invention, assume that if there is more than one anomalous behavior with respect to a certain baseline, the indication of any anomaly must be increased in any case. It is believed that this assumption has two problems and that in particular i) it can entail the increase of false positives, i.e., the assignment of high anomaly values to normal collective behaviors such as for example a simple variation of the company workflow and ii) it does not have the necessary requirement of generality and adaptability to the system that is decisive if one wishes to apply the present invention to mutually different companies and work contexts.

Hereinafter it is shown how the present invention overcomes the above cited problems relating to known solutions, by means of the noise reduction module 32 of the central computer 12. In practice, this noise reduction module 32 is configured to reduce the number of false positives on the basis of peer behavior assessment.

Suppose that for learning the behavior of all users“similar” to U* is used, but which has occurred only in the window F*. We focus therefore on the behavior of the group within the window F* alone, evaluating the behavior of our user f/* as a function of the behavior of the group to which he or she belongs, instead of by means of his or her past as done previously.

In this case also, the behavior of the user f/* can be evaluated, but this time the baseline is constituted by the behavior of all the other users for the same window. These users have been chosen only from peers, if possible by means of a dedicated clustering algorithm, and therefore a certain homogeneity of behaviors is expected with respect to selecting all the users. The result of the anomaly detection algorithm thus indicates to what extent the behavior of the user f/* is anomalous for the period * with respect to what his or her peers have done in the same period.

Since the user f/*, in the example being considered, has already been indicated as anomalous with respect to his or her past, now it is possible to confirm or contradict this anomaly with respect to the group as well.

Should the user U* appear relatively anomalous for the group as well, this could be a confirmation that indeed the series of activities that led the user U* to be identified as anomalous with respect to his or her own past likewise has no match even in the group of his or her peers.

Therefore, we can exclude the involvement of a new system, a new procedure or an update of an existing company workflow that has suddenly modified the way in which some of the activities within the company are performed. If it were so, in fact, then certainly this change would have been found also in some other peer of the user U*, and therefore the user U* would not have been found to be anomalous with respect also to the group.

Conversely, should the user U* appear to be relatively aligned with the group, this might be a confirmation of the fact that there is a “disturbance” in the IT system, in the company procedures or in the operating methods that has had repercussions in carrying out work activities on the systems, causing widespread “baseline changes”. It is likewise improbable that multiple users have agreed to perform a fraudulent action all simultaneously.

In this respect, the output of the analysis on the collective baseline can have a noise reduction action with respect to what has been detected from the analysis on the individual baseline, particularly with a substantial reduction of false positives.

In an embodiment, the noise reduction module 32 can be activated by the user of the system 10 according to the invention. In another embodiment, the noise reduction module 32 is activated automatically on the basis of a set of evaluations on the characteristics of the system being examined. In particular, it is possible to consider the numerousness of the users and of the accounts, the numbers of different activities performed by those users, the time horizon of the training and test sets, and so forth. In an embodiment, it is also possible to consider a series of independent random samplings of the users and the resulting clustering from each one of these samplings. The results of these clusterings are then compared with each other. If there is a relative stability of the results for different samplings, it is thus possible to deduce a reasonable adequacy and the need to apply the noise reduction described previously. In an embodiment, the result of the analysis described above can be used to quantify a weight to be assigned to noise reduction, which therefore will be applied more or less on the basis of the greatest or lowest definition of the peers that correspond to the user being examined.

8 Detailed description of the method according to the invention

With reference to Figure 2, the operation of an embodiment of the system for the creation and verification of behavioral baselines, i.e., an embodiment of the method for the creation and verification of behavioral baselines, according to the invention, is described hereinafter.

Initially, in step 70, the raw event collection module 16 of the central computer 12 collects and normalizes the raw data on the events recorded by the target apparatuses 36 connected to the central computer 12, considering them as emissions of those same target apparatuses 36.

In step 72, preferably simultaneously with step 70, the IAM state collection module 18 of the central computer 12 collects data on the current state recorded by the IAM apparatus 38, which manages the accounts of the users and the authorizations of those accounts. In doing so, an updated mapping of all the accounts of the user on the target apparatuses 36 is obtained from the IAM apparatus 38.

In step 74, the data enrichment module 20 of the central computer 12 cross-references the raw data on the events collected in step 70 and the data on the IAM state collected in step 72, identifying and grouping the raw events associated with a specific user, or more generally with an entity, independently of the number of accounts of which that user is the owner. The raw events always contain information on the user that generated the emission. Depending on the target apparatus 36, this information on the “author” of the emission can comprise a source IP address, a hostname, a username, or a combination of these information items. In particular, the username can be always traced back to a user by virtue of the always- updated mapping drawn from the IAM apparatus 38.

By integrating all these pieces of information according to step 74, the data enrichment module 20 traces back to the user who is operating using a certain username, who is using a certain IP address, or who is traced through a certain hostname.

In step 76, the Markovian module 24 of the central computer 12 builds a Markov chain, or rather a Markov transition matrix, adapted to track the transition from one emission to the subsequent emission, i.e., from one activity to the temporally subsequent activity, both activities being defined by the enriched data and performed by a specific physical user on the target apparatuses 36. Clearly, depending on how many emissions are considered simultaneously, the matrix will have a different size.

The data that the Markovian module 24 uses to build the Markov transition matrix are called emissions, which correspond to the enriched data stored in the enriched data storage means 22 of the central computer 12.

The operation of the Markovian module 24 has already been described in the present description, and reference is made to the corresponding paragraphs for greater detail.

In step 78, the baseline module 26 analyzes the transitions between the activities performed by a specific physical user, or more generally by an entity, by means of known machine learning techniques, starting from the Markov transition matrix for each user u, for each time window / (for example a day). In particular, the baseline module 26 is configured to calculate a plurality of individual z-score values, one for each individual activity/entity pair (such as for example a user), and a plurality of collective z-score values, one for each individual activity/time window pair. In practice, the z-score values represent the probability that an activity or a sequence of activities of a specific physical user, or more generally of an entity, is a behavioral anomaly. Therefore, this provides a quantitative index on the series of emissions observed.

The operation of the baseline module 26 has already been described in the present description, and reference is made to the corresponding paragraphs for greater detail.

In step 80, the log anomaly verification module 28 evaluates the presence of any behavioral anomaly of a user u, or more generally of an entity, with respect to the individual space, i.e., with respect to the history of past activities of the user u, on the basis of the individual z-score value previously calculated in step 78.

The individual z-score value is used in order to determine whether the behavior of the user u is to be considered anomalous with respect to the history of past activities of the user u.

In step 82, the log anomaly verification module 28 evaluates the presence of any behavioral anomaly of a user u, or more generally of an entity, with respect to the collective space, i.e., with respect to the current activities of other, similar users u, called peers, on the basis of the collective z-score value previously calculated in step 78.

The collective z-score value is used to determine whether the behavior of the user u is to be considered anomalous with respect to the current activities of the peers.

In an embodiment, in the evaluation of the presence of an anomaly, the log anomaly verification module 28 and the peer anomaly verification module 30 use known clustering and anomaly detection techniques, which highlight the anomalies in these z-scores.

A user, or more generally an entity, might or might not be anomalous with respect to his or her past behaviors, downstream of the analysis according to step 78. In the event of an anomaly, an anomaly index would also be associated with the user. Furthermore, the user might be the only anomalous user in the time window considered, or there might be others in the group of his or her peers (users similar to him or her). This generates four possible outcomes, depending on how the results of the evaluation according to steps 80 and 82 are combined.

In the first outcome 84, the activity of the user is aligned with respect to his or her past and also with respect to the peers (also not anomalous) in the current window. Therefore, in this case, the activity appears absolutely normal and in agreement with the expected result.

In the second outcome 85, the activity of the user is aligned with respect to his or her past but is anomalous with respect to the peers (which however are anomalous) in the current window. Therefore, in this case, many users have modified their usual behavior, but it is improbable that many users are simultaneously performing a fraudulent activity. It is more probable, instead, that a change of procedures or an application update has caused a widespread modification of the functionality in the use of the target apparatuses 36, or of a specific target apparatus 36. The selected user, who is the only one that is not anomalous, probably has not adjusted to the change imposed on the others. It is possible to use this information in order to improve the active and prompt participation of users in procedure changes and thus improve security.

In the third outcome 86, the activity of the user is anomalous with respect to his or her past, but is aligned with respect to the peers (also anomalous) in the current window. Therefore, in this case, many users have modified their own usual behavior. However, reporting the user anomaly might lead to a false positive, since it is improbable that many similar users are performing a fraudulent activity simultaneously. It is more probable that, instead, a change of procedures or an application update has caused a widespread modification of the functionality in the use of the target apparatuses 36, or of a specific target apparatus 36. With the third outcome 86, the user anomaly is considered mitigated by virtue of a noise reduction.

Finally, in the fourth outcome 87, the activity of the user is anomalous with respect to his or her past and also with respect to his or her own peers (which instead are not anomalous) in the current window. Therefore, the user observed has performed anomalous activities with respect to those he or she performs usually, while the users similar to him or her have not modified their own activity. This is the scenario that most probably deserves being investigated and reported, since it might be a fraudulent behavior or otherwise dangerous for the IT system.

9 Conclusions

In practice it has been found that the invention fully achieves the intended aim and objects. In particular, it has been shown that the system and the method for the creation and verification of behavioral baselines thus conceived make it possible to overcome the qualitative limitations of the background art, since they make it possible to create and verify behavioral baselines based on activities or sequences of activities performed by the entities (comprising both users and apparatuses) that operate within a company network, in order to detect any behavioral deviations that might represent dangerous activities and IT threats.

An advantage of the system and method for the creation and verification of behavioral baselines according to the present invention consists in that they make possible a dynamic, adaptive and proactive method for detecting IT threats in order to combat external and internal threats that are constantly evolving and unknown a priori.

Another advantage of the system and method for the creation and verification of behavioral baselines according to the present invention consists in that they make possible a method for the detection of behavioral anomalies that is not guided by preset signatures or policies, but which adapts to the behavior of the entities and to the use of the company information systems.

Another advantage of the system and the method for the creation and verification of behavioral baselines according to the present invention consists in that they minimize false positives, thus increasing the effectiveness of the actions in response to an IT attack or threat.

Another advantage of the system and method for the creation and verification of behavioral baselines according to the present invention consists in that they do not merely make correlations on the activities of the accounts, but they also make correlations on the activities of the individual physical user, regardless the number of accounts assigned to him or her.

Although the system and method for the creation and verification of behavioral baselines according to the present invention have been conceived particularly to increase the IT security of company or infrastructural networks and systems, they can be used in any case, more generally, in order to increase the IT security of systems and networks of any entity of medium or large size.

The invention thus conceived is susceptible of numerous modifications and variations, all of which are within the scope of the appended claims. All the details may furthermore be replaced with other technically equivalent elements.

In practice, the materials used, so long as they are compatible with the specific use, as well as the contingent shapes and dimensions, may be any according to the requirements and the state of the art.

To conclude, the scope of the protection of the claims must not be limited by the illustrations or by the preferred embodiments illustrated in the description as examples, but rather the claims must comprise all the characteristics of patentable novelty that reside in the present invention, including all the characteristics that would be treated as equivalents by the person skilled in the art. The disclosures in Italian Patent Application no. 102018000005412, from which this application claims priority, are incorporated herein by reference.

Where technical features mentioned in any claim are followed by reference signs, those reference signs have been included for the sole purpose of increasing the intelligibility of the claims and accordingly such reference signs do not have any limiting effect on the interpretation of each element identified by way of example by such reference signs.