SYSTEM, METHOD AND DEVICE FOR DETECTING EXCESSIVE DATA EXPOSURES

Title:

SYSTEM, METHOD AND DEVICE FOR DETECTING EXCESSIVE DATA EXPOSURES

Document Type and Number:

WIPO Patent Application WO/2024/086877

Kind Code:

Abstract:

A method comprising: receiving, by a fuzzing module of a computing device, data, the data being existing data for input into an application for testing; modifying, by the fuzzing module, the existing data to generate test data; communicating the existing data and the test data to the application to determine whether there is a change in an output associated with the application; when it is determined that there is no change, identify any excessive data exposure in one or more code paths of the application; and providing a notification of any excessive data exposure identified.

Inventors:

PAN LIANGLU (AU)
PHAM VAN (AU)
MURRAY TOBIAS (AU)
COHNEY SHAANAN (AU)

Application Number:

PCT/AU2023/051060

Publication Date:

May 02, 2024

Filing Date:

October 24, 2023

Export Citation:

Click for automatic bibliography generation Help

Assignee:

UNIV MELBOURNE (AU)

International Classes:

G06F21/57; G06F9/54; G06F21/55; G06F21/56; G06F40/143; H04L9/40; H04L43/50; H04L43/55

Attorney, Agent or Firm:

PHILLIPS ORMONDE FITZPATRICK (AU)

Download PDF:

View/Download PDF PDF Help

Claims:

The claims defining the invention are as follows:

1 . A method comprising: receiving, by a fuzzing module of a computing device, data, the data being existing data for input into an application for testing; modifying, by the fuzzing module, the existing data to generate test data; communicating the existing data and the test data to the application to determine whether there is a change in an output associated with the application; when it is determined that there is no change, identify any excessive data exposure in one or more code paths of the application; and providing a notification of any excessive data exposure identified.

2. The method of claim 1 , wherein determining whether there is a change in the output associated with the application includes a comparison of the application’s response to the existing data and the test data.

3. The method of claim 1 or 2, wherein the change corresponds to a change in content displayed in association with the application.

4. The method of any one of claims 1 to 3, wherein the change corresponds to a change in an attribute of a markup language document.

5. The method of claim 4, wherein the markup language is one of hypertext markup language, extensible hypertext markup language, extensible markup language, and cascading stylesheets (CSS).

6. The method of claim 4 or 5, wherein the markup language document represents a document object model (DOM).

7. The method of any one of claims 1 to 6, wherein the fuzzing module is implemented in a network proxy. 8. The method of claim 7, wherein the network proxy intercepts the data over a network.

9. The method of claim 8, wherein the fuzzing module further comprises forwarding, by the network proxy, the test data to the application.

10. The method of any one of claims 1 to 9, wherein the fuzzing is implemented via an Application Programming Interface (API) or socket.

11 . The method of any one of claims 1 to 10, wherein the application is accessible using a uniform resource locator (URL).

12. A system comprising: one or more processors; and memory, communicatively coupled to the one or more processors, for storing: a fuzzing module configured to: receive data, the data being existing data for input into an application for testing; modify the data to generate test data; communicate the existing data and the test data to the application to determine whether there is a change in an output associated with the application; when it is determined that there is no change, identify any excessive data exposure in one or more code paths of the application; and providing a notification of any excessive data exposure identified.

13. The system of claim 12, wherein determining whether there is a change in the output associated with the application includes a comparison of the applications response to the existing data and the test data. 14. The system of claim 12 or 13, wherein the change corresponds to a change in content displayed in association with the application.

15. The system of any one of claims 12 to 14, wherein the change corresponds to a change in a part of a markup language document.

16. The system of claim 15, wherein the markup language is one of hypertext markup language, extensible hypertext markup language, extensible markup language, and cascading stylesheets (CSS).

17. The system of claim 15 or 16, wherein the markup language document represents a document object model (DOM).

18. The system of any one of claims 12 to 17, wherein the fuzzing module is implemented in a network proxy.

19. The system of claim 18, wherein the network proxy intercepts the data over a network.

20. The system of claim 19, wherein the fuzzing module further comprises forwarding, by the network proxy, the test data to the application.

21 . The system of any one of claims 12 to 20, wherein the fuzzing is implemented via an Application Programming Interface (API) or socket.

22. The system of any one of claims 12 to 21 , wherein the application is accessible using a uniform resource locator (URL).

23. A network proxy computing device comprising: a fuzzing module including one or more processors; and a memory communicatively coupled with and readable by the one or more processors and having stored therein processor-readable instructions which, when executed by the one or more processors, cause the one or more processors to: intercept an application response communicated over a network to an application; parse the application response to identify a set of valid input data for input into the application; generate a set of test data from characteristics of the valid input data; communicate the valid input data and the test data to the application to determine whether there is a change in an output associated with the application; when it is determined that there is no change, identify any excessive data exposure in one or more code paths of the application; and provide a notification of any excessive data exposure identified.

24. The device of claim 23, wherein determining whether there is a change in the output associated with the application includes a comparison of the applications response to the valid data and the test data.

25. The device of claim 23 or 24, wherein the change corresponds to a change in content displayed in association with the application.

26. The device of any one of claims 23 to 25, wherein the change corresponds to a change in an attribute of a markup language document.

27. The device of claim 26, wherein the markup language is one of hypertext markup language, extensible hypertext markup language, extensible markup language, and cascading stylesheets (CSS).

28. The device of claim 26 or 27, wherein the markup language document represents a document object model (DOM).

29. The device of any one of claims 23 to 28, wherein the fuzzing is implemented via an Application Programming Interface (API) or socket.

30. The device of any one of claims 23 to 29, wherein the application is accessible using a uniform resource locator (URL).

31 . The device of any one of claims 23 to 30, wherein parsing the application response to identify a set of valid input data for input into the application includes identifying data attributes that may be subject to excessive data exposure data by a user or by using user defined rules. A method comprising: receiving, by a fuzzing module of a computing device, data, the data being existing data for input into an application for testing; modifying, by the fuzzing module, the data to generate test data; communicating the existing data to the application and recording a first output associated with the application; communicating the test data to the application and recording a second output associated with the application; and comparing the first and second outputs associated with the application to identify excessive data exposure; determining that the test data is subject to excessive data exposure by the absence of one or more differences between the first output and the second; and providing a notification of any excessive data exposure identified. A system comprising: one or more processors; and memory, communicatively coupled to the one or more processors, for storing: a fuzzing module configured to: receive data, the data being existing data for input into an application for testing; modify the data to generate test data; communicate the existing data to the application and record a first output associated with the application; communicate the test data to the application and record a second output associated with the application; compare the first and second outputs associated with the application to identify excessive data exposure; determine that the test data is subject to excessive data exposure by the absence of one or more differences between the first output and the second; and provide a notification of any excessive data exposure identified.

Description:

SYSTEM, METHOD AND DEVICE FOR DETECTING EXCESSIVE DATA EXPOSURES

Technical Field

[0001] The invention relates to a system, method and device for detecting excessive data exposures that occur when an application, via an Application Programming Interface (API) response, returns more information than necessary for a user to perform a specific action.

Background of Invention

[0002] APIs allow developers to interact programmatically with other applications, generally with client-server applications, via small packets of data, sharing only what is necessary. Over the years, what an ‘API’ is has often described any sort of connectivity interface to an application. More recently, however, the modern API has taken on some characteristics that make them very useful. For example, modern APIs adhere to standards (typically HyperText Transfer Protocol (HTTP) and REpresentational State Transfer (REST), that are developer-friendly, easily accessible, widely used, and broadly understood. However, APIs, particularly web APIs are becoming one of the most targeted attack surfaces, challenging organizations’ security.

[0003] However, APIs often transmit far more data to the client application than they need, and in the context of web applications, often do so over public channels. This issue is known as ‘Excessive Data Exposure’ (EDE). EDE occurs when an application, via API response, returns more information than necessary for a user to perform a specific action, often relying on the client to do the filtering.

[0004] That is, EDEs often exist because API developers tend to rely on client apps to perform the data filtering, without thinking about the sensitivity of the exposed data. Exploitation of EDE is simple, and is usually performed by sniffing the traffic to analyze the API responses, looking for sensitive data that should not be returned to the user.

[0005] Consider a simple example of an online storefront. When a user views the page for a specific product, an API call may be made to fetch stock levels, informing the user whether the item is in stock. The API returns the stock level, but may also return extraneous data (such as the profit margin on the item or the item’s country of origin) that is not displayed to the user but is nonetheless transmitted. The transmission of the extra data constitutes an ‘excessive data exposure’. Here such an API request might look like:

[0006] https://store.com/api/v1 /stock/show?id=123

[0007] The API would then pull the entire object from the database, including information about the stock level:

[0008] {

“id”: 123,

“available”: 85,

“reserved”: 0,

“ordered”: 1 ,

“location”: “Melbourne, VIC”,

“price”: “$40”,

“profit_margin”: “75%”,

“origin”: “China”

}

[0009] Excessive data exposure then occurs when the API returns too much information, instead of filtering the fields required, which should look like this:

[0010] {

““id”: 123,

“available”: 85 [0011] As EDEs do not manifest through explicit abnormal behaviours (e.g., program crashes or memory access violations) they are very difficult to detect. Automatic tools usually cannot detect this type of vulnerability because it is difficult to differentiate between legitimate data returned from the API, and sensitive data that should not be returned, without a deep understanding of the application.

[0012] Having excessive data fields transmitted and possibly being processed, can reduce performance and user experience of web applications. For example, an API response may contain hundreds of store objects, with each store object being a dictionary of tens of data fields. But only a very small percentage of the returned data fields may be necessary for the correct functioning of the web application, for example if only ten data fields are required to update stock information on the web page, from only eight (closest) stores. Optimizing web and mobile application performance for low bandwidth networks is very important in several markets, including those in the developing world. If unaccounted for, the lack of optimization and poor performance induced by EDE may lead to slow response time, high latency variability, and ultimately a compromised experience for the user.

[0013] Known techniques for identifying EDE can be labour intensive and is prone to false negatives. For example, manual inspection and ad-hoc text-matching. Notably, keyword matching techniques often use a list of terms (such as “key”, “token”, “password” etc) in order to flag exposures. Therefore under keyword matching, when an excessive data field does not match any known keywords, it may be erroneously ignored.

[0014] It would be desirable to provide a system, method and device for automatically detecting if an API exposes more data than it should.

[0015] It would also be desirable to reduce false negatives that occur during manual inspection and ad-hoc text-matching techniques, and other known approaches.

[0016] A reference herein to a patent document or other matter which is given as prior art is not to be taken as an admission or a suggestion that the document or matter was known or that the information it contains was part of the common general knowledge as at the priority date of any of the claims. Summary of Invention

[0017] According to an aspect of the present invention there is provided a method comprising: receiving, by a fuzzing module of a computing device, data, the data being existing data for input into an application for testing; modifying, by the fuzzing module, the existing data to generate test data; communicating the existing data and the test data to the application to determine whether there is a change in an output associated with the application; when it is determined that there is no change, identify any excessive data exposure in one or more code paths of the application; and providing a notification of any excessive data exposure identified responsive to the fuzzing. The existing data and the test data may be in the form of an API request-response pair. The application may respond to that request and the response content may be represented in the JavaScript Object Notation (JSON) format. The JSON response may be used to update or change an output associated with the application, such as updating a web page (e.g., showing more information).

[0018] The change in output associated with the application may in the form of a formatted file or the like.

[0019] In one or more embodiments, determining whether there is a change in the output associated with the application includes a comparison of the application’s response to the existing data and the test data.

[0020] In one or more embodiments, the change corresponds to a change in content displayed in association with the application. Advantageously, this allows the method to check if a data field in an API response is excessive based on the similarity between the originally displayed content and the new content when the data field is deleted or modified.

[0021] In one or more embodiments, the change corresponds to a change in an attribute of a markup language document. Markup languages include markup elements that may serve to identify or describe content, including that describe how visible content is to be rendered for display. A markup language document may include markup elements describing a content and/or formatting of content of the document. Web pages are an example of markup language documents. [0022] In one or more embodiments, the markup language is one of hypertext markup language (HTML), extensible hypertext markup language (XHTML), extensible markup language (XML), and cascading stylesheets (CSS). As an example, web page may be implemented as a set of one or more markup language documents, each of which may include content described using HTML elements or CSS elements, and/or elements of other markup languages, or even plain text.

[0023] In one or more embodiments, the markup language document represents a document object model (DOM). Data formats such as XML or JSON are typically syntactically parsed into a general tree data structure containing a logical node for each syntactic module of the data. Regardless of the data format, this parse tree data structure is referred to as a DOM.

[0024] In one or more embodiments, the fuzzing module is implemented in a network proxy. The network proxy or web proxy may be employed to allow a client to make an indirect network connection to a server. The client may connect to the web proxy then request a connection or other resource available on the server.

[0025] In one or more embodiments, the network proxy intercepts the data over a network. The network proxy or web proxy may be configured to perform its intermediary connection function by using API requests, whereby requests received at the network proxy to access a server are relayed to the server using a URL. The network proxy may provide the requested resource either by connecting to the server or by serving the requested resource from a cache. In some cases, the web proxy may alter the client's request or the server's response for various purposes.

[0026] In one or more embodiments, the fuzzing module further comprises forwarding, by the network proxy, the test data to the application.

[0027] In one or more embodiments, the fuzzing is implemented via an Application Programming Interface (API) or socket.

[0028] In one or more embodiments, the application is accessible using a uniform resource locator (URL). [0029] According to an embodiment of the present invention, there is provided a system comprising: one or more processors; and memory, communicatively coupled to the one or more processors, for storing: a fuzzing module configured to: receive data, the data being existing data for input into an application for testing; modify the data to generate test data; communicate the existing data and the test data to the application to determine whether there is a change in an output associated with the application; when it is determined that there is no change, identify any excessive data exposure in one or more code paths of the application; and providing a notification of any excessive data exposure identified responsive to the fuzzing.

[0030] In one or more embodiments, determining whether there is a change in the output associated with the application includes a comparison of the applications response to the existing data and the test data.

[0031] In one or more embodiments, the change corresponds to a change in content displayed in association with the application.

[0032] In one or more embodiments, the change corresponds to a change in a part of a markup language document.

[0033] In one or more embodiments, the markup language is one of hypertext markup language, extensible hypertext markup language, extensible markup language, and cascading stylesheets (CSS).

[0034] In one or more embodiments, the markup language document represents a document object model (DOM).

[0035] In one or more embodiments, the fuzzing module is implemented in a network proxy.

[0036] In one or more embodiment, the network proxy intercepts the data over a network.

[0037] In one or more embodiments, the fuzzing module further comprises forwarding, by the network proxy, the test data to the application. [0038] In one or more embodiments, the fuzzing is implemented via an Application Programming Interface (API) or socket.

[0039] In one or more embodiments, the application is accessible using a uniform resource locator (URL).

[0040] According to an aspect of the present invention, there is provided a network proxy computing device comprising: one or more processors; and a memory communicatively coupled with and readable by the one or more processors and having stored therein processor-readable instructions which, when executed by the one or more processors, cause the one or more processors to: intercept an application response communicated over a network to an application; parse the application response to identify a set of valid input data for input into the application; generate a set of test data from characteristics of the valid input data; communicate the valid input data and the test data to the application to determine whether there is a change in an output associated with the application; when it is determined that there is no change, identify any excessive data exposure in one or more code paths of the application; and provide a notification of any excessive data exposure identified responsive to the fuzzing.

[0041] In one or more embodiments, determining whether there is a change in the output associated with the application includes a comparison of the applications response to the valid data and the test data.

[0042] In one or more embodiments, the change corresponds to a change in content displayed in association with the application.

[0043] In one or more embodiments, the change corresponds to a change in an attribute of a markup language document.

[0044] In one or more embodiments, the markup language is one of hypertext markup language, extensible hypertext markup language, extensible markup language, and cascading stylesheets (CSS).

[0045] In one or more embodiments, the markup language document represents a document object model (DOM). [0046] In one or more embodiments, the fuzzing is implemented via an Application Programming Interface (API) or socket.

[0047] In one or more embodiments, the application is accessible using a uniform resource locator (URL).

[0048] In one or more embodiments, parsing the application response to identify a set of valid input data for input into the application includes identifying data attributes that may be subject to excessive data exposure data by a user or by using user defined rules.

[0049] According to an aspect of the present invention, there is provided a method comprising: receiving, by a fuzzing module of a computing device, data, the data being existing data for input into an application for testing; modifying, by the fuzzing module, the data to generate test data; communicating the existing data to the application and recording a first output associated with the application; communicating the test data to the application and recording a second output associated with the application; and comparing the first and second outputs associated with the application to identify excessive data exposure; determining that the test data is subject to excessive data exposure by the absence of one or more differences between the first output and the second; and providing a notification of any excessive data exposure identified.

[0050] According to an aspect of the present invention, there is provided a system comprising: one or more processors; and memory, communicatively coupled to the one or more processors, for storing: a fuzzing module configured to: receive data, the data being existing data for input into an application for testing; modify the data to generate test data; communicate the existing data to the application and record a first output associated with the application; communicate the test data to the application and record a second output associated with the application; compare the first and second outputs associated with the application to identify excessive data exposure; determine that the test data is subject to excessive data exposure by the absence of one or more differences between the first output and the second; and provide a notification of any excessive data exposure identified.

Brief Description of Drawings [0051] The invention will now be described in further detail by reference to the accompanying drawings. It is to be understood that the particularity of the drawings does not supersede the generality of the preceding description of the invention.

[0052] Figure 1 shows a system for detecting excessive data exposure in accordance with an embodiment of the present invention;

[0053] Figure 2 shows a sequence diagram of a recording and preparation phase in accordance with an embodiment of the present invention;

[0054] Figure 3 shows a sequence diagram of a replaying/fuzzing phase in accordance with an embodiment of the present invention;

[0055] Figure 4 shows the partial content of an API response for querying delivery states, with potential sensitive information removed in accordance with an embodiment of the present invention;

[0056] Figure 5 shows the tree representation of the JSON object in Figure 4, with leaf nodes highlighted in accordance with an embodiment of the present invention;

[0057] Figure 6 shows a simplified HTML document with its hierarchy structure; and

[0058] Figure 7 shows the DOM tree representation of the HTML document in

Figure 6.

Detailed Description

[0059] The invention is suitable for detecting EDE via a web API response from a server to a client application (e.g., JSON, XML, BSON, and form-urlencoded data), and it will be convenient to describe the invention in relation to that exemplary, but nonlimiting, application. However, it will be appreciated that the same approach is applicable to other applications, functions, unstructured connectivity interfaces and the like.

[0060] Figure 1 shows a system 100 for detecting excessive data exposure in a targeted API 102 in accordance with an embodiment of the present invention. The targeted AP1 102 is used to interface with a web application running on a remote server 1 14 or the like. The system 100 includes a fuzzing module 104 defined in computerexecutable instructions (software).

[0061] The fuzzing module 104 includes a web proxy 106. The web proxy 106 captures all client requests 108 and server responses 1 10, including any requests 108 sent to the targeted API 102 and the corresponding response 1 10. In the embodiment shown, given a web URL and the targeted API 102, the web proxy 106 captures traffic between a client app 112, depicted as a web browser 1 12, and a targeted web server 1 14. As will be appreciated by those skilled in the art, this is typical of a client-server application, where the ‘client’ 1 12, runs on an access device (e.g., a personal computer (PC) or a mobile phone) with which the user interacts, and a ‘server’ 1 14 that runs on a remote server computer. The client 1 12 and the server 1 14 typically communicate over a network, like the internet, by sending messages back and forth to each other.

[0062] Client-server applications do not always incorporate or otherwise leverage a web browser 112 and this invention is not limited to that exemplary application. For example, the invention is suitable for non-browser type applications (e.g., mobile applications) which generally include a native or desktop client. The invention is also suitable for applications other than client-server applications developed in different languages and executed in different environments specialized for the different contexts of each application etc.

[0063] In one or more embodiments, the web proxy 106 is employed to allow the client 1 12 to make an indirect network connection to the server 114. The client 1 12 connects to the web proxy 106 then requests a connection or other resource available on the server 114, which in the disclosed embodiment is a remote server computer 1 14. The web proxy 106 may be an off-the-shelf and commercially available web proxy that is capable of being configured (e.g., through adding scripts, extensions, etc. to existing software). The disclosed web proxy 106 is configured to perform its intermediary connection function by using API requests, whereby requests received at the web proxy 106 to access the server 114 are relayed to the server 1 14 using a URL. The web proxy 106 provides the requested resource either by connecting to the server 1 14 or by serving the requested resource from a cache. In some cases, the web proxy 106 may alter the client's 1 12 request or the server's response for various purposes. [0064] Thus, the web proxy 106 of the disclosed embodiment acts as an intermediary for requests from the client 1 12 to server 1 14. The web proxy 106 includes software, along with a database that may be a separate device or incorporated within web proxy hardware 106 that may record request and response pairs 108, 1 10 relating to the targeted API 102. The client 1 12 may be operated by a human user, or it may be operated by a web driver 126 component, which is an automated tool or script configured to interface with applications that are able to programmatically control the execution of web-browsers and the like, such as Selenium Web Driver or other mechanisms familiar to those skilled in the art. Such automated tools are usually referred to as ‘agents.’

[0065] In one or more embodiments, the fuzzing module 104 is configured to start the client 1 12 and record an initial state as So. After that, the client 1 12 is configured to open a given URL, wait for the web page to fully load, and complete the necessary steps (e.g., fill in forms, fill in text boxes, click buttons) until a request is sent to the targeted API 102.

[0066] As will be appreciated, the process of preparing the web page such that it invokes a response from the server 1 14 may be performed by a user or a web driver agent 126. The web driver agent may rely on a script including a number of instructions operable to cause the client 1 12 to automatically navigate and fill in a sequence of one or more text boxes or forms presented by the web page to send a request to the targeted API 102. The script may be operable to automatically fill and submit a sequence of text boxes or forms, such as for example by instructions to automatically fill in text boxes or form elements with instructions to automatically click buttons and the like within various web pages comprising the sequence, such as to proceed from one web page to the next. The script may also serve to show progress updates within the sub-window for the users' 180 benefit.

[0067] Once the request 108 has been sent the client 1 12 state is recorded as Si. The server 114 then sends a response 110 to that request 108. In one or more embodiments, the response content is presented in an extensible format including a JavaScript Object Notation (JSON) format or an Extensible Markup Language (XML) format. JSON is a text-based open standard designed for human-readable data interchange. It is language-independent, with parsers available for many languages. The JSON format often is used for serializing and transmitting structured data over a communication network. It is used primarily to transmit data between a server and web application, serving as an alternative to extensible markup language (XML).

[0068] It is worth noting that some web pages or applications such as those developed with Asynchronous Javascript And extensible Markup Language (AJAX) do not load the entire web page at once. Web pages or applications created with AJAX are incrementally updated by dynamically exchanging small amounts of data between their web pages and their contributing web servers. As a result, web pages do not have to be reloaded in their entirety when they are updated and composite applications feel more responsive and interactive.

[0069] In this case, the client state is recorded as S^when the initial set of updates is complete. At S2, a Document Object Model (DOM) tree (defined as DOMorigin) 1 16 of the web page is extracted using a DOM extractor module 126. This DOM tree 1 16 constitutes a baseline that can compared with trees produced during the fuzzing process to check for potential excessive data exposures based on the metamorphic relation defined in Equation (1 ).

[0070] Dl'ffDOM(Torigin, Tmutated) 0, (1 )

[0071] As will be appreciated by those skilled in the art, data formats such as XML or JSON are typically syntactically parsed into a general tree data structure containing a logical node for each syntactic module of the data. Regardless of the data format, this parse tree data structure is referred to as a DOM. The DOM extractor module 126 extracts the DOM trees 1 16, 1 18 for analysis. Each node of a DOM typically contains information about the syntactic module being represented, such as an XML element tag name or content value, as well as index or pointer values that bind the DOM node into the tree structure, including an indicator of the parent, preceding sibling and next sibling, a child list, and possibly a separate attribute list. The document order of a DOM corresponds to a visitation order of DOM nodes resulting from a depth first traversal of the DOM tree. A depth first traversal, also known as a pre-order traversal, is a traversal of a tree structure in which a node is deemed visited or processed before any of its child nodes are visited or processed. However, the invention is suitable for other data formats and is not necessarily limited to tree data structures and the like. [0072] Advantageously, the metamorphic relation defined in Equation (1 ) — a relation that, given two different program inputs, is expected to hold between their corresponding outputs — allows automated testing approaches to check if a data field in an API response is excessive based on the similarity between the originally displayed content and the new content when the data field is deleted.

[0073] For example, assume there is an API response under analysis Florigin 1 10 that is a set of data fields. A web client 112 (e.g., a web browser) uses Florigin 1 10 to render a page that can be represented by a DOM tree Torigin 116. A data field d e Florigin 1 10 is considered non-excessive if Equation (1 ) holds, where a diffooM module 120 calculates the difference between two DOM trees Torigin 1 16 and Tmutated 1 18. Tmutated 1 18 is constructed by transmitting R _mutated 122 to the web client. R _mutated 122 is obtained by removing or otherwise modifying the in-question data field dfrom Rorigin 1 10. If a data field violates Equation (1 ), it is deemed excessive.

[0074] This metamorphic relation allows one to construct a system that significantly reduces false negatives that occur during manual inspection and ad-hoc text-matching approaches. Keyword matching techniques often use a list of terms (such as “key”, “token”, “password” etc.) in order to flag exposures. Therefore, if an excessive data field does not match any known keywords, it is erroneously ignored.

[0075] The system 100 does this by mutating and replaying API responses 1 10 into the client side of the webapp 1 12 and compares the generated DOM tree 1 16, 1 18 with the original tree in each fuzzing iteration. The mutation engine 124 carries out this report and replay model, which will be described with reference to Figure 2.

[0076] The system 100 may also be integrated into a Web Application Firewall (WAF), Firewall as a Service (FWaaS) or similar system that may collect real-time threat intelligence related data for third-party organisations, for example, content delivery network (CDN) service providers. As will be appreciated by those skilled in the art, a WAF helps protect web applications by filtering and monitoring HTTP traffic between web applications and the internet.

[0077] By deploying a WAF in front of a web application, a form of protection is placed between the web application and the Internet. While a proxy server protects a client machine’s identity by using an intermediary, a WAF is a type of reverse-proxy, protecting the server from exposure, including excessive data exposure, by having clients pass through the WAF before reaching the server.

[0078] Advantageously, by integrating the system 100 into a WAF, FWaaS or similar it is capable of providing live information to security analysts when a web application is leaking potentially sensitive information in real-time. That real-time monitoring capability is of interest to security consultancies and the like, but also to large cloud services providers that already provide large-scale WAF functionality to large numbers of clients. Being able to identify and potentially execute policies to stop or filter out excessive data exposures in real-time (i.e. , a protocol layer 7 defence in the OSI model) provides a significant technical advantage in content delivery and in information security generally. This method of real-time excessive data exposure mitigation may form part of a suite of other tools which together create a holistic defence against a range of attack vectors.

[0079] The system 100 may be included in a network-based WAF i.e., hardwarebased and installed locally to minimize latency; a host-based WAF i,e., fully integrated into an application’s software; a cloud-based WAF accessed over the Internet (typically maintained and updated by a third party vendor).

[0080] Figure 2 shows a sequence diagram 200 of a recording and preparation phase in accordance with an embodiment of the present invention. The client-server communication is modelled using a sequence diagram. As shown, before the targeted request-response pair 108, 110 has been exchanged, the browser 1 12 and the server 1 14 might have completed other exchanges for fetching HTML documents 202 and other object files 204 (e.g., images, stylesheets, JavaScript). All the request-response pairs (denoted as P) 108, 1 10 and resources are recorded and stored in the local machine for the testing/fuzzing phase, which will be described with refence to Figure 3. In one or more embodiments, the invention may be carried out in part by using a lightweight browser plugin to capture all the steps the client app 112 needs to do to successfully transition from state So to state S2 and transform those steps into a script and store it into a configuration file denoted as C. The configuration file can be used by the system 100 so that it can reach different client states (So to S2) in its fuzzing phase by interfacing with applications that are able to programmatically control the execution of web-browsers, such as Selenium Web Driver or other mechanisms familiar to those skilled in the art.

[0081] Figure 3 shows a sequence diagram 300 of a replaying/fuzzing phase in accordance with an embodiment of the present invention. The input for this phase includes: 1 ) a configuration file C to traverse through different client states, 2) the original DOM tree DOMorigin 1 16, and 3) all request-response pairs (P) 108, 1 10 recorded in the recording phase. In each fuzzing iteration, the system 100 goes through three steps (Steps 1 , 2, 3). In the first step (Step-1 ), the web driver (or other agent) module 126 uses the configuration file C to replay all the steps until the client 1 12 reaches the state Sr-then reaching S2 through replaying stored requests. Before that state has been reached, the simulated server module 128 just accepts client’s 1 12 requests and sends recorded 302 responses stored in P 108, 1 10 with no modification. Once the simulated server 128 receives the request sent to the targeted API 102, it mutates the originally recorded response 304 by deleting a specific data field or otherwise modifying that data field and then sends it back to the client 1 12. The mutation algorithm will be discussed in detail in with reference to Figures 4 and 5.

[0082] The client 1 12 uses the mutated response 304 to update the page accordingly. If this leads to any error, the system 100 may skip and moves to the next fuzzing iteration. Otherwise, it waits until the page is fully updated and uses the DOM extractor module 126 to extract the current DOM tree denoted as DOMmutatec 18 (Step- 2). In Step-3, the system 100 uses its DiffooM module 120 (as described above) to compare the similarity between DOMmutated 118 and DOMorigin 116. If they are the same, or within a predefined threshold of similarity, the deleted or modified data field is deemed excessive according to the present invention’s metamorphic relation.

[0083] Once all fuzzing iterations are completed, the system 100 reports the potential excessive data fields to developers/testers 180 for further analysis and confirmation.

[0084] In one or more embodiments, the mutation engine 124 module is performed on a simulated server 128. The simulated server 128 brings several benefits to EDE testing including: Local recorded contents can be supplied with minimal delay, leading to high efficiency. The system 100 does not need to communicate with the targeted remote server 114 during the fuzzing phase, allowing testing in parallel without affecting the targeted server 1 14. It can allow developers/testers 180 to use the system 100 to easily test their servers without setting up a dedicated testing environment. Simulated server 128 can supply consistent content, ensuring the test result is not affected by the remote server 1 14 states which may potentially change during the testing process. The cached/recorded contents act as a snapshot of server states, allowing further testing and studying in the future even if the server will be upgraded.

[0085] Figure 4 shows a partial listing of an API response 400 for querying delivery states of an item (as commonly occurs in ecommerce applications), with potential sensitive information removed. In this particular example, a customer may receive a unique link on the day that an item is on board for delivery. The link opens a web page includes the name and a photo of the delivery driver, an Estimated Time of Arrival (ETA) of the delivery, and the position of the delivering item in the queue. The page sends an API request to the server regularly, and updates contents on the page based on the API response. Part of this API response is shown in Figure 4, with potential sensitive information being removed.

[0086] The service has the logic that only when the delivering parcel is at the front of the queue (next item to be delivered), the accurate geographic location of the delivery driver would be indicated on the map. It suggests that while the driver is delivering an item to a customer, another customer would not be able to know the accurate geographic location of the driver.

[0087] However, by using the system 100 it was determined that the API response always contains rich information about the delivery driver, including accurate latitude and longitude (the location field), direction of facing (the bearing field), and speed of travelling (the speed field).

[0088] As will be appreciated, knowing the timestamped location of the delivery driver 402, a customer may recover the route that the delivery driver is travelling, or even be able to identify the address of other customers who receive parcel from the same delivery driver. The leaked information may also be used in conjunction with other vulnerabilities to perform more attacks, such as a social engineering attack. [0089] Figure 5 shows the tree representation of the JSON object 500 in Figure 4, with leaf nodes 502 highlighted. Each mutation (test case) is generated by removing or otherwise modifying a leaf node 502 from the tree 500. For example, a valid mutation could remove the key-value pair id: 353 from the driver dictionary 504, or capacity: 33 from the car dictionary 506. Or a valid mutation might modify the id: 353 field to become id: 2.

[0090] Figure 6 shows a web page (an HTML document) 600. The hierarchy structure can be represented using a DOM tree.

[0091] In one or more embodiments, when comparing two web pages, the system 100 considers both the DOM tree structure and the content within each tree node. If an API response with a particular data field removed or modified would still result in the identical web page, compared to the web paged produced using the original API response, that data field is reported as excessive.

[0092] While comparing the entire DOM tree can identify if a web page is different from another, in many cases, contents in an API response will only affect a specific area of a web page. The proposed approach can optionally utilise human knowledge to allow the user to specify an area-of-interest on the web page. The area-of-interest is a subtree in the DOM tree that contains contents (that the user believes) could be affected by the API response. The area-of-interest in the above example could be the subtree rooted at the node “div” with the attribute “class=container”. With the area of interest identified, only this part of the DOM is subject to the difference comparison.

[0093] Applying the difference comparison to only a sub-part of the DOM tree allows the fuzzing process to operate more efficiently. For example, for a response that contains thousands of data fields, it is important that the difference comparison which must be run after mutating each of the fields, is as efficient as possible. Moreover, applying the difference comparison to only a sub-part of the DOM tree ensures that it is not affected by extraneous differences in the DOM trees produced from the original vs mutated response. Such extraneous differences arise from nondeterminism in the web application. For example, the DOM for a web page that displays the current time will naturally differ between the original vs mutated response (because the current time will change). Therefore it is necessary to compare the differences not between the entire DOM tree but only the area of interest (which in this example would not contain the current time). Doing so reduces the probability of false negatives.

[0094] Figure 7 shows the DOM tree representation 700 of the HTML document in Figure 6. In one or more embodiments, the hierarchy structure of the document allows the system 100 to utilise human knowledge to specify a subtree for comparison 702 (highlighted with background). For example, a user my select subtree 702 to limit testing to a particular area on a web page or the like.

[0095] To further aid the reduction of false negatives that can arise from extraneous differences due to nondeterminism, in one or more embodiments, the system can identity such differences by the following method. The original response 302 is replayed twice and the DOM tree obtained from both is extracted. If no nondeterminism is present then these two DOM trees should be identical. Otherwise, any differences between them represent extraneous differences that should be ignored when comparing the original and mutated DOM trees. These differences-to-be-ignored therefore constitute a predefined threshold of similarity (mentioned above) when comparing two DOM trees: any differences present in these parts should be ignored.

[0096] Although the above description has been given with specific examples from a web app, with which the user interacts, and a remote server, one having ordinary skill in the art will recognize that the present invention is not so limited, and may be applied in any type of application. Moreover, specific examples have been given of response content is presented JSON and being rendered into DOM trees. However, one having ordinary skill in the art will understand that the present invention can be applied to other content types and protocols, such as (but not limited to) web data provided via HTTP and file transfer accomplished with FTP.

[0097] As discussed above, the various embodiments can be implemented in a wide variety of operating environments, which in some cases can include one or more user computers, computing devices, or processing devices which can be used to operate any of a number of applications. User or client devices can include any of a number of general-purpose personal computers, such as desktop or laptop computers running a standard operating system, as well as cellular, wireless, and handheld devices running mobile software and capable of supporting a number of networking and messaging protocols. Such a system also can include a number of workstations running any of a variety of commercially available operating systems and other known applications for purposes such as development and database management. These devices also can include other electronic devices, such as dummy terminals, thin- clients, gaming systems, and other devices capable of communicating via a network.

[0098] Various aspects also can be implemented as part of at least one service or Web service, such as can be part of a service-oriented architecture. Services such as Web services can communicate using any appropriate type of messaging, such as by using messages in extensible markup language (XML) format and exchanged using an appropriate protocol such as SOAP (derived from the “Simple Object Access Protocol”) or JSON (derived from the “JavaScript Object Notation”). Processes provided or executed by such services can be written in any appropriate language, such as the Web Services Description Language (WSDL) or JavaScript. Using a language such as WSDL allows for functionality such as the automated generation of client-side code in various SOAP frameworks.

[0099] Some embodiments may utilize at least one network that would be familiar to those skilled in the art for supporting communications using any of a variety of commercially available protocols, such as TCP/IP, OSI, FTP, UPnP, NFS, and CIFS. The network can be, for example, a local area network, a wide-area network, a virtual private network, the Internet, an intranet, an extranet, a public switched telephone network, an infrared network, a wireless network, and any suitable combination thereof.

[0100] The environment can include a variety of data stores and other memory and storage media as discussed above. These can reside in a variety of locations, such as on a storage medium local to (and/or resident in) one or more of the computers or remote from any or all of the computers across the network. In a particular set of embodiments, the information can reside in a storage-area network (“SAN”) familiar to those skilled in the art. Similarly, any necessary files for performing the functions attributed to the computers, servers, or other network devices can be stored locally and/or remotely, as appropriate. Where a system includes computerized devices, each such device can include hardware elements that can be electrically coupled via a bus, the elements including, for example, at least one central processing unit (CPU), at least one input device (e.g., a mouse, keyboard, controller, touch screen, or keypad), and at least one output device (e.g., a display device, printer, or speaker). Such a system can also include one or more storage devices, such as disk drives, optical storage devices, and solid-state storage devices such as random-access memory (“RAM”) or read-only memory (“ROM”), as well as removable media devices, memory cards, flash cards, etc.

[0101] Such devices also can include a computer-readable storage media reader, a communications device (e.g., a modem, a network card (wireless or wired), an infrared communication device, etc.), and working memory as described above. The computer-readable storage media reader can be connected with, or configured to receive, a computer-readable storage medium, representing remote, local, fixed, and/or removable storage devices as well as storage media for temporarily and/or more permanently containing, storing, transmitting, and retrieving computer-readable information. The system and various devices also typically will include a number of software applications, modules, services, or other elements located within at least one working memory device, including an operating system and application programs, such as a client application or Web browser. It should be appreciated that alternate embodiments can have numerous variations from that described above. For example, customized hardware might also be used and/or particular elements might be implemented in hardware, software (including portable software, such as applets), or both. Further, connection to other computing devices such as network input/output devices can be employed.

[0102] Storage media and computer readable media for containing code, or portions of code, can include any appropriate media known or used in the art, including storage media and communication media, such as but not limited to volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage and/or transmission of information such as computer readable instructions, data structures, program modules, or other data, including RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a system device.

[0103] Embodiments of the present disclosure can be provided as a computer program product including a non-transitory machine-readable storage medium having stored thereon instructions (in compressed or uncompressed form) that can be used to program a computer (or other electronic device) to perform processes or methods described herein. The machine-readable storage medium can include, but is not limited to, hard drives, floppy diskettes, optical disks, CD-ROMs, DVDs, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, flash memory, magnetic or optical cards, solid-state memory devices, or other types of media/machine-readable medium suitable for storing electronic instructions. Further, embodiments can also be provided as a computer program product including a transitory machine-readable signal (in compressed or uncompressed form). Examples of machine-readable signals, whether modulated using a carrier or not, include, but are not limited to, signals that a computer system or machine hosting or running a computer program can be configured to access, including signals downloaded through the Internet or other networks. For example, distribution of software can be via Internet download.

[0104] Where the terms “comprise”, “comprises”, “comprised” or “comprising” are used in this specification (including the claims) they are to be interpreted as specifying the presence of the stated features, integers, steps or modules, but not precluding the presence of one or more other features, integers, steps or modules, or group thereof.

[0105] While the invention has been described in conjunction with a limited number of embodiments, it will be appreciated by those skilled in the art that many alternative, modifications and variations in light of the foregoing description are possible. Accordingly, the present invention is intended to embrace all such alternative, modifications and variations as may fall within the spirit and scope of the invention as disclosed.

Previous Patent: DETECTING ATTACKS ON MACHINE LEARNING SYSTEMS

Next Patent: HIGH-THROUGHPUT ANALYSIS UNIT