Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DETECTION OF MALICIOUS BEHAVIOUR OF A COMPUTER PROGRAM
Document Type and Number:
WIPO Patent Application WO/2020/145885
Kind Code:
A1
Abstract:
Method for determining real-time malicious behavior of a computer program, such as on Android systems. A first sequence of APIs from a total sequence of intercepted APIs generated by the computer program are saved and converted into vector representation and comprise inputs, together with statistical information about API's in the first sequence and APIs in the total sequence, for determining whether the behavior of the computer program constitutes abnormal behavior of the computer program.Determining uses pre-trained dataset and model in various types of machine learning.

Inventors:
LI YINGJIU (SG)
WANG DAIBIN (SG)
Application Number:
PCT/SG2019/050016
Publication Date:
July 16, 2020
Filing Date:
January 10, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI INT PTE LTD (SG)
SINGAPORE MANAGEMENT UNIV (SG)
International Classes:
G06F21/56; G06F21/52; G06F21/55
Foreign References:
US20160057159A12016-02-25
Other References:
V CHANDOLA ET AL: "Anomaly Detection for Discrete Sequences: A Survey", IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 1 January 2012 (2012-01-01), New York, pages 823 - 839, XP055170010, Retrieved from the Internet DOI: 10.1109/TKDE.2010.235
C. LUEG., 8,400 NEW ANDROID MALWARE SAMPLES EVERY DAY, vol. 8, 2017, pages 400
D. ARP; H. GASCON; M. HIIBNER; K. RIECK; M. SPREITZENBARTH: "Proceedings of the 21st Annual Network and Distributed System Security Symposium (NDSS'14", February 2014, INTERNET SOCIETY, article "Drebin: Efficient and Explainable Detection of Android Malware in Your Pocket"
N. MCLAUGHLIN; J. M. DEL RINCON; B. KANG; S. YERIMA; P. MILLER; S. SEZER; Y. SAFAEI; E. TRICKEL; Z. ZHAO; A. DOUPE: "Proceedings of the Seventh ACM on Conference on Data and Application Security and Privacy (CODASPY '17", article "Deep Android Malware Detection", pages: 301 - 308
Attorney, Agent or Firm:
ALLEN & GLEDHILL LLP (SG)
Download PDF:
Claims:
CLAIMS

1. A method of monitoring a computer program, comprising

- Obtaining, at time tO, a second number N2 of monitoring units invoked by the computer program from a first number N1 of monitoring units invoked by the computer program, the second number N2 of monitoring units is preconfigured to be smaller than the first number N1 of monitoring units, and a monitoring unit corresponds to an action performed by the computer program;

- Obtaining statistical information corresponding to the first number N 1 of monitoring units;

- Converting monitoring units of the second number N2 of monitoring units into a vector representation of the second number N2 of monitoring units, so that monitoring units of the second number N2 of monitoring units have a corresponding respective vector representation; and

- Determining, according to the vector representation of the second number N2 of monitoring units and according to the statistical information, whether a behavior of the computer program constitutes abnormal behavior of the computer program.

2. The method of monitoring a computer program according to claim 1, wherein obtaining the second number N2 of monitoring units invoked by the computer program from the first number N1 of monitoring units invoked by the computer program further comprises the monitoring units in the second number N2 and / or first number N1 of monitoring units are arranged in a sequence of chronological order, such as a chronological order of invoking by the computer program. 3. The method of monitoring a computer program according to claim 1 or 2, wherein the monitoring units in the second number N2 are monitoring units which have been more recently obtained compared to monitoring units in the first number. 4. The method of monitoring a computer program according to any of claims 1-3, wherein the obtaining, at time tO, the second number N2 of monitoring units invoked by the computer program from the first number N1 of monitoring units invoked by the computer program comprises

obtaining the second number N2 of monitoring units such that an end monitoring unit in the second number N2 of monitoring units is, out of a set comprising the first number N1 of monitoring units, a most recent monitoring unit, and the other end monitoring unit in the second number N2 of monitoring units is a more recent monitoring unit compared to the monitoring units in the first number N1 of monitoring units.

5. The method of monitoring a computer program according to any of claims 1-4, wherein the obtaining the statistical information corresponding to the first number N1 of monitoring units comprises converting into the statistical information used in the determining step by representing the obtained statistical information as a vector representation of the statistical information.

6. The method of monitoring a computer program according to any of claims 1-5, wherein the obtaining the second number N2 of monitoring units invoked by the computer program from the first number N1 of monitoring units invoked by the computer program comprises obtaining the second number N2 of monitoring units from the first number N1 of monitoring units using a sliding window of preconfigured size, such that at any given time t, monitoring units in the second number N2 of monitoring units comprise the most recent monitoring units from a chronological sequence in the first number N1 of monitoring units.

7. The method of monitoring a computer program according to any of claims 1-6, comprising

obtaining the preconfigured size of the sliding window by

subtracting an index value of the sliding window from the quantity of the first number N1 of monitoring units, wherein the index value corresponds to the earliest monitoring unit in the second number N2 of monitoring units.

8. The method of monitoring a computer program according to any of claims 1-7, wherein the obtaining the second number N2 of monitoring units invoked by the computer program from the first number N1 of monitoring units invoked by the computer program comprises

storing the second number of N2 of monitoring units. 9. The method of monitoring a computer program according to any of claims 1-8, wherein the obtaining the second number N2 of monitoring units invoked by the computer program from the first number N1 of monitoring units invoked by the computer program comprises

not saving monitoring units which are in the first number N1 of monitoring units but not in the second number N2 of monitoring units.

10. The method of monitoring a computer program according to any of claims 1-9, wherein the obtaining the statistical information corresponding to the first number N1 of monitoring units comprises

obtaining a rate of occurrence of a monitoring unit based on a number of times of occurrence of the said monitoring unit in the second number N2 of monitoring units relative to a number of times of occurrence of the said monitoring unit in the first number N1 of monitoring units. 11. The method of monitoring a computer program according to any of claims 1-9, wherein the obtaining the statistical information corresponding to the first number N1 of monitoring units comprises

obtaining a count value of monitoring unit.

12. The method of monitoring a computer program according to any of claims 1-9, wherein the obtaining the statistical information corresponding to the first number N1 of monitoring units comprises

obtaining a rate bias of a monitoring unit in the second number N2 of monitoring units compared to a mean rate of monitoring unit in the sliding window.

13. The method of monitoring a computer program according to any of claims 1-12, wherein the determining whether a behavior of the computer program constitutes abnormal behavior of the computer program is further according to information about one or more sets of fourth numbers N4 of monitoring units invoked at time t-1, where t-1 is earlier than time tO, the size of each set of fourth numbers N4 is equal to a size of the second number N2 of monitoring units.

14. The method of monitoring a computer program according to any of claims 1-13 wherein the determining whether a behavior of the computer program constitutes abnormal behavior of the computer program is further according to stored statistical information about monitoring units.

15. The method of monitoring a computer program according to any of claims 1-14, wherein the monitoring unit is an Application Programming Interface, API, or a system call.

16. The method of monitoring a computer program according to any of claims 1-15, wherein the converting into vector representation of the monitoring units and/or statistical information comprises converting into one-hot representation or converting based on a word-embedding technique.

17. The method of monitoring a computer program according to any of claims 1-16, wherein the determining whether a behavior of the computer program constitutes abnormal behavior of the computer program comprises:

- inputting the vector representation of the second number N2 of monitoring units and the statistical information into a determining module which has been pre-trained, and

- determining, by the determining module, whether the behavior of the computer program constitutes abnormal behavior of the computer program.

18. The method of monitoring a computer program according to any of claims 1-17, wherein the method further comprises: - Obtaining, at time tl which is greater than the time tO, a third number N3 of monitoring units invoked by the computer program from the first number N 1 of monitoring units invoked by the computer program, wherein the first number N1 of monitoring units includes the second number N2 of monitoring units, and the third number N3 of monitoring units is preconfigured to be smaller than the first number N1 of monitoring units; - Converting each of the third number N3 of monitoring units into a vector representation of the third number N3 of monitoring units, so that each monitoring unit of the third number N3 of monitoring units has a corresponding respective vector representation;

- Obtaining statistical information corresponding to the first number N 1 and third number N3 of monitoring units; and

- Determining, according to the vector representation of the third number N3 of monitoring units and according to the statistical information, whether a behavior of the computer program constitutes abnormal behavior of the computer program.

19. An apparatus for monitoring a computer program based on monitoring units invoked by the computer program, comprising

- a statistical information obtaining module, which is configured to obtain statistical information corresponding to a first number N1 of monitoring units;

- a vector representation conversion unit, which is configured to convert monitoring units in a second number N2 of monitoring units into corresponding respective vector representations, wherein the second number

N2 of monitoring units are a subset of a set comprising the first number N 1 of monitoring units invoked by the computer program, the second number N2 of monitoring units is preconfigured to be smaller than the first number N1 of monitoring units; and - a data storage unit, which is configured to store the statistical information obtained by the statistical information obtaining module.

20. The apparatus according to claim 19, further comprising

an obtaining module which is configured to obtain, the second number N2 of monitoring units invoked by the computer program from the first number N1 of monitoring units invoked by the computer program,

wherein the obtaining module is configured to select the second number N2 of monitoring units from the first number N1 of monitoring units by using a sliding window of pre-configured size, which is configured to select the second number N2 of monitoring units to be more recent compared to other monitoring units in the set of the first number N1 of monitoring units.

21. The apparatus according to claim 19 or 20, further comprising

a detection module which is configured to, according to inputs comprising the vector representation of the second number N2 of monitoring units and the statistical information, determine whether a behavior of the computer program constitutes abnormal behavior of the computer program.

22. The apparatus according to claim 19 or 20 or 21, further comprising a notification module, which is configured, when the computer behavior is determined to not be normal behavior, to transmit a notification message concerning said not normal behavior.

Description:
DESCRIPTION

DETECTION OF MALICIOUS BEHAVIOUR OF A COMPUTER PROGRAM

Technical Field

The present disclosure relates to computer science, and in particular to real-time detection of whether a computer program’s behavior is abnormal. Decisions are based on inputs which comprise: 1. a recent subset of system-type calls taken from a larger set of intercepted system-type calls of the computer program, and 2. statistical information about the system-type calls in both the subset and the larger set. A pre-trained determining module uses machine learning to determine whether the inputs indicate the behavior is abnormal. For reducing computational and storage requirements, such as on mobile devices, inputs are converted into vector representation and only a recent subset is saved, with comparatively older information expressed by the statistical information.

Background Art

According to a security firm G Data’s report (C. Lueg.“8,400 new Android malware samples every day”) in 2017, 8,400 Android malwares are discovered every day, i.e., a new one is found every 10 seconds. In order to combat the evolving Android malware, in recent years, machine learning-based app-vet systems have been developed for automatic Android malware detection. While these static- or dynamic-analysis systems deployed on market-side or device-side can detect most of malware, they suffer from some limitations, e.g., code-obfuscation technology, anti-emulator technology.

The Drebin system, published by D. Arp, H. Gascon, M. Hiibner, K. Rieck, and M. Spreitzenbarth as“Drebin: Efficient and Explainable Detection of Android Malware in Your Pocket” in Proceedings of the 21st Annual Network and Distributed System Security Symposium (NDSS’14), San Diego, California, USA, February, Internet Society, 2014, is a lightweight system designed to detect Android malware directly on the smartphone at install time. Drebin performs a static analysis on Android app and collects 8 feature sets from both the manifest file and disassembled code. These features are mapped into 545,000-dimension vector space using one-hot vector representation. Then, a linear Support Vector Machine model is trained based on malware class and benign class and it is deployed on the smartphone for detecting malicious applications.

In N. McLaughlin, J. M. del Rincon, B. Kang, S. Yerima, P. Miller, S. Sezer, Y. Safaei, E. Trickel, Z. Zhao, A. Doupe, and G. J. Ahn. “Deep Android Malware Detection” published in: Proceedings of the Seventh ACM on Conference on Data and Application Security and Privacy (CODASPY Ί7), New York, NY, USA, 301-308, a deep android malware detection system is disclosed. A deep convolutional neural network (CNN)-based Android malware detection which takes the disassembled opcode sequence as input. These disassembled opcodes are encoded as one-hot 218-dimension vectors.

The above two systems focus on static malware analysis of Android applications. Therefore, they suffer from the shortcomings of static analysis, e.g., hard to defeat code-obfuscation techniques, native code. The inventors are aware that third party learning-based real-time detection methods exist, but public information on these is limited. Further, the inventors’ analysis has revealed that such methods suffer from performing limited statistical analysis on monitoring units, whilst having comparatively high storage and computational requirements arising from the size of sequences of monitoring units.

Summary

Embodiments provide a method for determining real-time malicious behavior of a computer program, such as in Android systems. A first sequence of system-type calls, such as APIs, from a larger (such as a total) sequence of intercepted system-type calls generated by the computer program are saved and converted into vector representation. The vector representations comprise inputs, together with statistical information about API’s in the first sequence and APIs in the larger sequence, for determining whether the behavior of the computer program constitutes abnormal behavior of the computer program. In embodiments, the determining uses a pre-trained dataset and model in various types of machine learning. Other embodiments provide an apparatus and computer program for implementing the methods of the present disclosure.

To achieve the foregoing objective, the following technical solutions are used in the embodiments.

The first aspect of an embodiment provides a method of monitoring a computer program, comprising

- Obtaining, at time tO, a second number N2 of monitoring units invoked by the computer program from a first number N1 of monitoring units invoked by the computer program, the second number N2 of monitoring units is preconfigured to be smaller than the first number N1 of monitoring units, and a monitoring unit corresponds to an action performed by the computer program;

- Obtaining statistical information corresponding to the first number N 1 of monitoring units;

- Converting the second number N2 of monitoring units into a vector representation of the second number N2 of monitoring units, so that monitoring units of the second number N2 of monitoring units have a corresponding respective vector representation; and

- Determining, according to the vector representation of the second number N2 of monitoring units and according to the statistical information, whether a behavior of the computer program constitutes abnormal behavior of the computer program.

The method may be employed in any computing environment in which there is a possibility of malware such as from third party applications. Preferably, the method is employed in a computing environment which uses an operating system such as the Android operating system.

As used in the present disclosure,“preferably” will be understood to indicate a possible but non-limiting implementation form, but not necessarily the only implementation, i.e. other implementations may be possible.

Further, it will be understood that although the present disclosure relates primarily to the Android system, it will not be limited to that.

Further, the term“and/or” will be construed as representing three possibilities, namely either of the terms joined by“and/or” by themselves, or both together.

Preferably, the method is employed in an end-user device, such as a portable device e.g. a terminal or phone, which has comparatively limited processing capability and memory.

The method may be part of a detection system.

The computer program may be pre-installed or installed later. The computer program may be a third party application, including hacked or malware versions of apparently official computer programs. It is well known that hacked or malware versions may behave in a detrimental way, such as causing instability, data loss or theft or stealing processing cycles.

The computer program is preferably running at the time of obtaining the monitoring units, such that the method is performing real-time monitoring of the computer program based on actions emanating from the computer program.

Obtaining a monitoring unit may comprise intercepting or recording or temporarily saving a monitoring unit. For example, sensitive APIs (e.g., sendSMS()) can be hooked and thus intercepted by modifying the OS framework. It may not be possible to obtain all monitoring units invoked by the computer program. A monitoring unit may comprise a system call from the kernel level, or an Application Programming Interface from the Android framework level. These allow the computer program to access system resources and services. Thus each monitoring unit corresponds to an action of the computer program when running. A monitoring unit may be invoked or generated by the computer program whilst it is running. Typically when a computer program is running, it generates a sequence of monitoring units which, as the computer program continues to run, is incremented (i.e. the sequence increases in length or size) over time. There may be millions of monitoring units invoked by a computer program.

The first number N1 of monitoring units may comprise any number of monitoring units. It may comprise the total number of monitoring units invoked by the computer program whilst running this time. For example, if at time t the monitoring units invoked by the computer program comprise API1, API2, API3 and API4, then the first number N1 of monitoring units is 4. The first number N1 of monitoring units preferably starts with a first monitoring unit invoked by the computer program. Preferably, the first number N1 of monitoring units are obtained and arranged in a chronological sequence corresponding to their order of invoking by the computer program. For example, where the monitoring unit is APIs, the total sequence may be API1, API2, API3, API4 at time t, arranged in sequential order.

The second number N2 of monitoring units may comprise any number of monitoring units that is less than the first number N1 of monitoring units. The second number N2 of monitoring units may comprise a subset, preferably in chronological order, taken from the first number N1 of monitoring units. The second number N2 of monitoring units is preconfigured (fixed), such as according to operating system requirements and capabilities. The preconfigured size is chosen as a balance between precision of detection and performance cost. The larger the preconfigured size, the more precise the detection but at more storage cost. In an example, where the monitoring unit is APIs, the second number N2 of monitoring units may be API3 and API4 at time t, arranged in sequential order of invoking. Thus the second number N2 of monitoring units may represent comparatively more current API sequence which is a subset of the first number N 1.

Statistical information corresponding to the first number N1 of monitoring units may preferably relate to all, or optionally some of the first number N1 of monitoring units. For example, the sampling may be performed on the first number N1 of monitoring units and the statistical information may correspond to that. The statistical information may correspond to monitoring units remaining in the first number N1 of monitoring units which are not in the second number N2 of monitoring units. The statistical information corresponding to the first number N1 of monitoring units may also correspond to the monitoring units in the second number N2 of monitoring units and their interrelation. The statistical information may relate to frequency of occurrence (rate), count, or bias rate of each API in the second number N2 of monitoring units compared to the first number N1 of monitoring units. The statistical information may relate to a total API sequence, which after conversion into vector representation can be seen as a compressed vector representing the total API sequence.

By taking account in the statistical information of monitoring units which are not in the second number N2 of the monitoring units, detection precision is improved. The skilled reader will be aware of several ways of measuring detection precision for detecting malware, including accuracy, recall, precision, FI -score which can be used to measure the detection. Preferably, in the present disclosure, accuracy is used to measure the binary classification. A formula for quantifying binary accuracy is Accuracy = (TP + TN)/ (TP+TN+FP+FN), where TP=True positive, FP=False positive, TN=True negative, FN=False negative.

By not saving or including a full representation of each of the monitoring units which are not in the second number N2 of the monitoring units in the determining, computational requirements are reduced and speed is increased.

Preferably, each monitoring unit of the second number N2 may be converted into a corresponding respective vector representation for convenient analysis or input into machine learning models. The vector representation may be a numerical vector representation. Monitoring units of the first number N1 may also be converted into corresponding respective vector representations, but may not be individually stored.

Whether a behavior of the computer program is abnormal or normal is determined according to the inputs of the vector representations of the second number N2 of monitoring units and according to the statistical information. Preferably this is performed by a detection model, such as Support Vector Machine, Random Forrest and Deep Neural Network. The detection model may have been pre-trained from a suitable dataset.

Depending on the determination result, the method may take further actions. For example, when the determination result is that the behavior is abnormal, then the method may inform the operating system or end user, delete or quarantine the computer program. By performing real-time detection and enabling subsequent actions, damage to the device’s resources or data on the device may be minimized. If the behaviour is not determined to be abnormal, the method may continue to perform monitoring by repeating the steps of the method, such as at predetermined time intervals, e.g. every 2 minutes. The order of the various steps of the method is not fixed but may be performed in any order, including that shown. In an example, the order of obtaining statistical information corresponding to the first number N1 of monitoring units; and converting the second number N2 of monitoring units into a vector representation may be interchanged with respect to each other.

In a possible design, the obtaining the second number N2 of monitoring units invoked by the computer program from the first number N1 of monitoring units invoked by the computer program further comprises

the monitoring units in the second number N2 and / or first number N1 of monitoring units are arranged in a sequence of chronological order, for example such as a chronological order of invoking by the computer program.

In a possible design, the monitoring units in the second number N2 are monitoring units which have been more recently obtained compared to monitoring units in the first number.

Whether a monitoring unit is recently obtained may be determined with reference to an invoking time of that monitoring unit.

By only including the monitoring units which are more recent in the second number N2 of monitoring units, storage and processing requirements are comparatively lower. For example, where the monitoring unit is APIs, the total sequence may comprise 4 API, being API1, API2, API3, API4 at time t, arranged in sequential order. API4 is more recent than API3, API3 is more recent than API2 and API2 is more recent than API1. If, for example, the second number N2 of monitoring units is preconfigured to be 2, which is smaller than 4, then the second number N2 may comprise API3 and API4. At time t+1, when for example another most recent API, API5, has been generated, the total sequence is API1, API2, API3, API4, API5 and the second number N2 is now API4, API5.

In a possible design the obtaining, at time tO, a second number N2 of monitoring units invoked by the computer program from a first number N1 of monitoring units invoked by the computer program comprises obtaining the second number N2 of monitoring units such that an end monitoring unit in the second number N2 of monitoring units is, out of a set comprising the first number N1 of monitoring units, a most recent monitoring unit, and the other end monitoring unit in the second number N2 of monitoring units is a more recent monitoring unit compared to the monitoring units in the first number N1 of monitoring units.

In a possible design, the obtaining the statistical information corresponding to the first number N1 of monitoring units comprises

converting into the statistical information used in the determining step by representing the obtained statistical information as a vector representation of the statistical information.

The statistical information used in the determining step is preferably in vector representation format for convenient processing in the machine learning, especially when relating to large numbers of monitoring units. Preferably, for simplicity of computation, the vector representation of the statistical information is combined with vector representations of the second number N2 of monitoring units prior to determining. In a possible design, the obtaining the second number N2 of monitoring units invoked by the computer program from the first number N1 of monitoring units invoked by the computer program comprises

obtaining the second number N2 of monitoring units from the first number N1 of monitoring units using a sliding window of preconfigured size, such that at any given time t, monitoring units in the second number N2 of monitoring units comprise the most recent monitoring units from a chronological sequence in the first number N1 of monitoring units.

Each computer program may have a respective corresponding preconfigured size of sliding window. The sliding window is a sampling tool of fixed size when the computer program is running. The sliding window size may be adjusted but not when the computer program is running. The sliding window moves over or filters an input sequence of monitoring units, such that the most recently added monitoring unit to the second number N2 of monitoring units is preferably the latest (i.e. most recent) monitoring unit in the input sequence. The last sequential unit in or covered by the sliding window is the oldest covered by the sliding window, and when the sliding window advances by one monitoring unit to the most recent unit, said last sequential unit will be removed from the second number N2 of monitoring units.

In a possible design, the method of monitoring a computer program comprises obtaining the preconfigured size of the sliding window by subtracting an index value of the sliding window from the quantity of the first number N1 of monitoring units, wherein the index value corresponds to the earliest monitoring unit in the second number N2 of monitoring units.

At a stage of API sampling using the sliding window, fix-size sliding window may be used such that at runtime, only the APIs in the sliding window are required to be stored n be the size of API sequence and m be the index of sliding window’s header.

The API sequence can be represented as S = {xi , i e [1, n] }, where xi is an API intercepted by detection system. Preferably the size of set S is greater than 1 and may include several million corresponding to APIs at run time. The sliding window can be represented as W = {x j , j e [m, n] }, where x j is an API belong to S. Thus, the size of sliding window is n-m and the size is preconfigured. The size may be fixed when the OS is designed, and is empirically determined. In the present disclosure, the size ranges of sliding windows is preferably [200, 500].

In a possible design the obtaining the second number N2 of monitoring units invoked by the computer program from the first number N1 of monitoring units invoked by the computer program comprises

storing the second number of N2 of monitoring units.

This saves storage space.

In a possible design, the obtaining the second number N2 of monitoring units invoked by the computer program from the first number N1 of monitoring units invoked by the computer program comprises

not saving monitoring units which are in the first number N1 of monitoring units but not in the second number N2 of monitoring units. This saves storage space. Only statistical information which is relevant is kept.

Different types of statistical information may be used. In the present disclosure, three types of statistical information are disclosed by way of non-limiting example.

In a possible design, the obtaining the statistical information corresponding to the first number N1 of monitoring units comprises

obtaining a rate of occurrence of a monitoring unit based on a number of times of occurrence of the said monitoring unit in the second number N2 of monitoring units relative to a number of times of occurrence of the said monitoring unit in the first number N1 of monitoring units.

In this possible design, a definition of rate may be: for the API x j e W, its rate is rate_X j = the size of set {xi, xi = x j and i e [1, j] } / the size of set {xi, i e [1, j] }.

In an example, let the size of the sliding window be 2 and the total sequence be <API_1, API_2, API_3, API_2, API_1>. Thus, the API sequence in the sliding window is <API_2, API_1>. For the API_2 in the sliding window, its rate is 2/4 = 0.5. For the API_lin the sliding window, its rate is 2/5 = 0.4.

Then, the API with its obtained rate are stored such as by a data storage unit. The data storage unit stores data, which may include a list of recent API sequences whose size is equal to the sliding window’s size, and a list of common or all API rates, which can be used as an input to the determining about the computer program behaviour. A key value can be used to retrieve each API rate. The data storage unit can use different types of storage, such as HashMap structure and shared memory, to store these data. In a possible design, the obtaining the statistical information corresponding to the first number N1 of monitoring units comprises

obtaining a count value of monitoring unit.

In this possible design, a count value of mounting unit refers to quantity

(number) of invoked APIs. The value means the frequency of an API. In an example, let the size of sliding window be 2 and the total sequence be <API_1, API_2, API_3, API_2, API_1 >. Thus, the API sequence in sliding window is <API_2, API_1 >. For the API_2 in the sliding window, its API count is 2. For the API_1 in the sliding window, its API count is also 2.

The method of the present disclosure may be repeatedly performed from time to time, e.g. every 2 minutes, each time point being a stage of detection. The number of detection is a value representing how many times the stage of detection is called. For example, for the above example, at 2 min, the number of detection is 2 and at 4 min, the number of detection is 3. As a computer program runs, each API count would increase. Optionally, by dividing the count value by the number of detection stage, the statistical information can represent whether an API is frequently invoked.

In a possible design, the obtaining the statistical information corresponding to the first number N1 of monitoring units comprises

obtaining a rate bias of a monitoring unit in the second number N2 of monitoring units compared to a mean rate of monitoring unit in the sliding window.

In this possible design, the statistical information may be defined as: for the API x j e W, the bias value is rate_x j - å ratc_x k / (n-m), where rate_x is the API rate and x k e W. Several levels may be pre-defined for the bias such as high large, low large, equal, low small, and high small. High/low large level means that the value of bias is high/low positive number whereas high/low small level means the value of bias is high/low negative number. Preferably, three levels are defined being high, equal, and small. High level means that the rate for an API in sliding window is larger than the mean of API rate in the sliding window. Low level means that the rate for an API in sliding window is less than the mean of API rate in the sliding window. Equal means that the rate is the same as the said mean. The bias can represent the frequency of API calls.

In a possible design, the determining whether a behavior of the computer program constitutes abnormal behavior of the computer program is further according to stored information about one or more sets of fourth numbers N4 of monitoring units invoked at time t-1, where t-1 is earlier than time tO, the size of each set of fourth numbers N4 is equal to a size of the second number N2 of monitoring units.

For example, the stored information may comprise the list of recent API sequences whose size is equal to sliding window’s size, and / or the list of API rates as disclosed above.

In a possible design, the determining whether a behavior of the computer program constitutes abnormal behavior of the computer program is further according to stored statistical information about monitoring units.

For example, the stored information may comprise the list of recent API sequences whose size is equal to sliding window’s size, and / or the list of API rates as disclosed above. In a possible design, the monitoring unit is an Application Programming Interface, API, or a system call.

In a possible design, the converting into vector representation of the monitoring units and/or statistical information comprises converting into one-hot representation or converting based on a word-embedding technique.

In an example, there are 3 different APIs (i.e., API_1, API_2, API_3), for one-hot representation. They may be represented in one-hot representation as:

API_1 = [1, 0, 0],

API_2 = [0, 1, 0],

API_3 = [0, 0, 1]

Preferably, word2vec (a type of word embedding technique) is used, as it is efficient.

In a possible design, the determining whether a behavior of the computer program constitutes abnormal behavior of the computer program comprises:

- inputting the vector representation of the second number N2 of monitoring units and the statistical information into a determining module which has been pre-trained, and

- determining, by the determining module, whether the behavior of the computer program constitutes abnormal behavior of the computer program.

The determining module may be pre-trained according to a suitable dataset, which can be obtained from various sources including for example from apks from Google Play store. The determining module may implement any kind of machine learning. In a possible design, the method further comprises:

- Obtaining, at time tl which is greater than the time tO, a third number N3 of monitoring units invoked by the computer program from the first number N 1 of monitoring units invoked by the computer program, wherein the first number N1 of monitoring units includes the second number N2 of monitoring units, and the third number N3 of monitoring units is preconfigured to be smaller than the first number N1 of monitoring units; - Converting each of the third number N3 of monitoring units into a vector representation of the third number N3 of monitoring units, so that each monitoring unit of the third number N3 of monitoring units has a corresponding respective vector representation;

- Obtaining statistical information corresponding to the first number N 1 and third number N3 of monitoring units; and

- Determining, according to the vector representation of the third number N3 of monitoring units and according to the statistical information, whether a behavior of the computer program constitutes abnormal behavior of the computer program.

The method may be iterated, such that at each subsequent time point a new set of more recent monitoring units is obtained from a larger sequence of monitoring units. The various possible designs disclosed herein may be adapted and applied at each subsequent time point. Time tl may be a subsequent time to time to, such as a regular interval of 2 minutes as disclosed above i.e. time tO, tl, tl+n. The second aspect of an embodiment provides an apparatus for monitoring a computer program based on monitoring units invoked by the computer program, comprising

- a statistical information obtaining module, which is configured to obtain statistical information corresponding to a first number N1 of monitoring units;

- a vector representation conversion unit, which is configured to convert monitoring units in a second number N2 of monitoring units into corresponding respective vector representations, wherein the second number N2 of monitoring units are a subset of a set comprising the first number N 1 of monitoring units invoked by the computer program, the second number N2 of monitoring units is preconfigured to be smaller than the first number N1 of monitoring units; and

- a data storage unit, which is configured to store the statistical information obtained by the statistical information obtaining module.

The vector representation conversion unit is responsible for mapping the APIs in a current sliding window and their rates into numerical vectors. The vector

representation conversion unit can retrieve the APIs in the sliding window from memory such as in the data storage unit where they may be stored. Optionally, the data storage unit may also be pre-configured or store additional information such as the list of recent API sequences whose size is equal to sliding window’s size, and / or the list of API rates.

In a possible design, the apparatus further comprises an obtaining module which is configured to obtain, the second number N2 of monitoring units invoked by the computer program from the first number N1 of monitoring units invoked by the computer program,

wherein the obtaining module is configured to select the second number N2 of monitoring units from the first number N1 of monitoring units by using a sliding window of pre-configured size, which is configured to select the second number N2 of monitoring units to be more recent compared to other monitoring units in the set of the first number N1 of monitoring units.

The obtaining module may obtain the second number N2 of monitoring units, or they may be obtained and provided by other means to the statistical information obtaining module.

In a possible design, the apparatus further comprises

a detection module which is configured to, according to inputs comprising the vector representation of the second number N2 of monitoring units and the statistical information, determine whether a behavior of the computer program constitutes abnormal behavior of the computer program.

In a possible design, the apparatus further comprises a notification module, which is configured, when the computer behavior is determined to be abnormal behavior, to transmit a notification message concerning said abnormal behavior. Equally, a notification may be transmitted if the behavior is determined to be normal. The apparatus, or the relevant corresponding functional modules thereof, may be configured to perform any of the methods or steps of methods of the present disclosure.

The present disclosure further provides that the method may be implemented in computer program instructions, such as computer program code. The computer program code may be stored in a storage device connected to a processor. The storage device may be a memory device, such as an HDD (Hard Disk Drive), an SSD (Solid State Drive), an optical disc, a magneto-optical disk, or a semiconductor memory. The storage device may also be a non-transitory computer-readable storage medium where a computer program for controlling the operation of a mobile device is stored.

A processor may read the computer program stored in the storage device and store the computer program in a RAM, and control the operation of a mobile device 10 according to the computer program read from the RAM. It should be noted that the computer program for controlling the operation of the mobile device may be stored in advance in the ROM, or may be downloaded over a network by using the communication capability of the mobile device.

For the purpose of clarity, any one of the foregoing embodiments may be combined with any one or more of the other foregoing embodiments to create a new embodiment within the scope of the present disclosure. These and other features will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings and claims.

The embodiments of the present disclosure provide at least the following advantages:

Low performance cost: instead of storing the whole API sequence at runtime, a fix-size sliding window stores the APIs in the sliding window, thus performance cost (e.g., storage) is smaller;

High detection precision: instead of ignoring the early APIs that are not in the sliding window, the vector representation of the embodiments takes the API’s statistical information into consideration. The API’s statistical information is reserved, i.e. the information of those APIs that are not in the sliding window is not discarded. This improves the detection precision. Experiments using the methods of the present disclosure show that the true positive rate is increased by about 5 percent compared to the conventional solutions.

Brief Description of Drawings

Fig. 1 is a diagram illustrating an overview of the steps performed by the method of monitoring the computer program according to an embodiment of the present disclosure.

Fig. 2 is a block diagram for describing an example of configuration and workflow of functional units of an apparatus according to an embodiment of the present disclosure.

Fig. 3 is a diagram for illustrating functional modules of the apparatus provided on a mobile device according to an embodiment of the present disclosure.

Fig. 4 is a block diagram for describing an example of a hardware configuration of a mobile device according to an embodiment of the present disclosure.

Detailed Description The following describes technical solutions of the embodiments, referring to the accompanying drawings. It will be understood that the embodiments described below are not all but just some of embodiments relating to the present disclosure. It is to be noted that all other embodiments which may be derived by a person skilled in the art based on the embodiments described below without creative efforts shall fall within the protection scope of the present disclosure.

With reference to Fig. 1, explanation will be hereinafter provided for a manner of monitoring a computer program on an operating system.

When the computer program is running, it generates an API sequence, which increases as the computer program continues to run, as illustrated by the arrow to the right of the API sequence. An intercepted API sequence 10, being the first number N1 of monitoring units, is shown, and comprises a sequence of eight APIs 1 2 3 4 5 6 7 8 at given time t when the detection is performed.

Out of the eight APIs in the intercepted API sequence, the four most recent APIs 5 6 7 and 8 fall inside a sliding window 20 of fixed size of four APIs, which performs sampling to obtain these four most recent APIs. These four most recent APIs are the second number N2 of monitoring units in this embodiment and form a sequence in chronological order, wherein the higher the number the more recent the API.

The four most recent APIs are converted 200 into vector representations 50 60 70, of which there is a corresponding respective vector representation for each of the four APIs, although only three are shown for simplicity. Thus vector representation 50 corresponds to API 5, 60 corresponds to API 6 and 70 corresponds to API 7.

Statistical information on the APIs in the intercepted sequence is obtained and converted 300 into a vector representation, which is shown as a compact vector 55, 65, 75 below the dotted line for each of the vector representations of API. In this embodiment, the statistical information corresponds to both the APIs in the first number N1 and second number N2. Thus there is corresponding statistical information for each of the four most recent APIs, although only three are shown. Vector representation 55 of statistical information corresponds to vector representation 55, vector representation 65 of statistical information corresponds to vector representation 60, and vector representation 75 of statistical information corresponds to vector representation 70.

The vector representations of statistical information is combined in this embodiment with the corresponding respective vector representation.

The combined vectors, each comprising a vector representation of an API in the second number N2, and its corresponding vector representation of said APIs statistical information are input 400 into a learning-based and pre-trained model 30 which determines 500 whether the behavior of the computer program is abnormal or otherwise i.e. malware or benign.

Fig. 2 provides an illustration of the functional modules corresponding to the actions in Fig. 1 and their interconnection. The functional modules may be provided by hardware, software or a combination thereof. The functional modules may be logical entities or library functions, for example, that are configured to perform specific functions.

An obtaining module 2000, data processing module 3000, detection module 4000 and notification module 5000 are provided. The data processing module comprises sub modules which are a statistical information obtaining module 3005, a vector representation conversion unit 3015 and a data storage unit 3010. It will be understood that the sub modules may equally be provided outside the data processing module.

The obtaining module 2000 is responsible for obtaining the whole framework APIs (i.e. all APIs generated by the computer program whilst running) or a specific set of APIs with their information, such as parameters, PID and UID. The obtaining module 2000 intercepts the APIs. The intercepted APIs thus form an API sequence 10. The obtaining module is also responsible for obtaining the second number N2 of APIs using a selection function such as a fixed size sliding window such as that in Fig. 1, which thus form another sequence.

Once an API is intercepted by the obtaining module, the statistical information obtaining module will obtain statistical information about the or each sequence. In this embodiment, the statistical information in this embodiment is API rate. The API sequence can be represented as S = {xi , i e [1, n] }, where xi is an API intercepted. The sliding window can be represented as W = {x j , j e [m, n] }, where x j is an API belong to S. The definition of API rate is: for the API x j e W, its rate is rate_x j = the size of set {xi, xi= X j and i e [1, j] } / the size of set {xi, i e [1, j] }.

Then, each API with its rate is stored by the data storage unit 3010. The data storage unit 3010 stores a list of recent API sequences whose size is equal to sliding window’s size, and a list of API rates. The data storage unit 3010 can use different types of storage, such as HashMap and shared memory, to store these data.

The vector representation conversion unit 3015 is responsible for mapping the APIs in the current sliding window and their rates into numerical vectors.

Then, these vectors are fed into a pre-trained model provided by the detection module 4000.

Finally, the notification module 5000 is responsible for notifying end users the details if a malicious behavior is detected.

In another embodiment (not shown), instead of API rate, the statistical information obtained by the statistical information obtaining unit is the value of API count as disclosed herein. Other functions are unchanged in this embodiment and are performed similarly to those in other embodiments.

In another embodiment (not shown), instead of API rate, the statistical information obtained by the statistical information obtaining unit is API rate bias against the mean of API rate in the sliding window. The statistical information can be defined as: for the API x j e W, the bias value is rate_x j - å ratc_x k / (n-m), where rate_x is the API rate and X e W. Various levels, as disclosed herein may be defined to improve the detection. Other functions are unchanged in this embodiment and are performed similarly to those in other embodiments.

Fig. 3 and Fig.4 relate to hardware configurations of a mobile device.

In Fig. 3 the mobile device 6000 is provided with a data processing module 3000 as disclosed herein for performing the monitoring of the computer program 8000, which may be running in memory such as RAM 7400 (Fig. 4).

In Fig. 4 the mobile device 6000 is shown, wherein the functions of the data processing module, and optionally other modules as disclosed herein, are provided by programming stored in ROM 7200 and/or RAM 7400 which instructs over the bus the CPU 7000 to perform the corresponding actions.

APIs may be intercepted via the hooking means 7600, which obtains information about the intercepted APIs.

A notification module may provide a warning output which is displayed on a screen of the mobile device 6000.

Further, a storage device 8000 is optionally provided which may be configured with historical information about API sequences and common rates of API. The storage device may also store or obtain training dataset information which is used in the detection module.

The foregoing disclosure merely discloses exemplary embodiments, and is not intended to limit the protection scope of the present invention. It will be appreciated by those skilled in the art that the foregoing embodiments and all or some of other embodiments and modifications which may be derived based on the scope of claims of the present invention will of course fall within the scope of the present invention.