Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD OF CLASSIFYING DATA WITH ACCESS AND INTEGRITY CONTROL
Document Type and Number:
WIPO Patent Application WO/2016/130029
Kind Code:
A1
Abstract:
It is an object of the present invention to provide a method of classifying data with access and integrity control, that is, category and context of information asset is determined and operations performed on this information asset are monitored. Monitoring as well as category and context assignment are performed on at least one device which consists of at least one processor and one non¬ volatile memory. Method of classifying data with access and integrity control is characterized with that after each operation performed on information asset it is checked if operation conforms with rules of information asset handling. Checking of rules allows to verify if performed operation is allowed. In that step content of information asset is not verified. Next, category of information asset is determined, to which it is assigned after execution of operation. For that purpose, information content of information asset is analyzed. Next, group of users is connected with category created in previous step. For category and group of users determined in previous steps, conformity of category and context of information asset is verified against rules of information asset handling.

Inventors:
BRANDT ŁUKASZ (PL)
BRANDT MATEUSZ (PL)
TOKARCZYK ANDRZEJ (PL)
Application Number:
PCT/PL2015/000018
Publication Date:
August 18, 2016
Filing Date:
February 10, 2015
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NORD-SYSTEMS SP Z O O (PL)
International Classes:
G06F9/45; G06F21/55; G06F21/57; G06F21/60; G06F21/62; H04L29/06
Foreign References:
US8060596B12011-11-15
CN101106539A2008-01-16
US20140195485A12014-07-10
US20140165190A12014-06-12
Attorney, Agent or Firm:
SZCZEPANIAK, Bartosz (Czabajska Szczepaniak Sp. p, ul. Piecewska 27 80-288 Gdańsk, PL)
Download PDF:
Claims:
Claims

1. Method of classifying data with access and integrity control whereby category and context of information asset is defined and operation performed on it are monitored, where definition of category, context and monitoring are performed at least on one a device which consists of at least one processor and one non-volatile memory wherein

a. conformity of operations to rules of information asset handling is verified after execution of operation on information asset, b. category of information asset is defined after execution of operation on information asset,

c. for category selected in b, group of users related with category is determined,

d. for category and group of users determined in b and c, conformity of category and context of information asset, defined in rules of information asset handling is verified.

2. The method of classifying data of claim 1 , wherein information which are compared in step a with rules of information asset handling, are attached to asset in form of metadata.

3. The method of classifying data of claim 1 , wherein in step b, for purpose of representing information asset vector space model is used and information asset is classified with supervised machine learning algorithm.

4. The method of classifying data of claim 3, wherein term frequency-invert document frequency TFIDF is used as a model of vector space.

5. The method of classifying data of claim 3, wherein linear regression with support vector machine SVD is used as supervised machine learning algorithm.

6. The method of classifying data of claim 3, wherein random indexing algorithm, Rl, is used for preliminary preparation of category for classifier.

7. The method of classifying data of claim 1 , wherein in step c group of users is determined with the use of unsupervised machine learning algorithm.

8. The method of classifying data of claim 7, wherein k-means clustering algorithm is used as unsupervised machine learning algorithm.

9. The method of classifying data of claim 1 , wherein information asset is a file.

10. The method of classifying data of claim 1 , wherein information asset is an e-mail.

11. The method of classifying data of claim 1 , wherein monitored information asset is an instant message.

12. The method of classifying data of claim 1 , wherein at least one version of information asset is stored in a state which it was before operation execution.

13. The method of classifying data of claim 1 , wherein in step d, at least one operation defined in rules of information asset handling is executed.

4. The method of classifying data of claim 13, wherein in step d, notification about operations performed on information asset is sent to users defined in rules of information asset handling.

15. The method of classifying data of claim 14, wherein notifications are sent in form of e-mail messages.

Description:
Method of classifying data with access and integrity control

Subject of invention is method of classifying data with access and integrity control. Classified are data in different forms e.g. files or e-mail messages. Purpose of classification is to assign data to categories defined by character of information stored in data. In this connection, typical process of data monitoring known from existing solutions, is extended with monitoring of informative content. Beside monitoring of file or data lifecycle, change of data content is monitored. This approach increases precision of processes which monitor information and allows to build new, never used solutions e.g. data security domain or documents flow management domain. Present invention utilizes advanced mechanisms of machine learning for text processing and information classifying as well as for grouping of users of information. Solution proposed in this present invention allows extending model of information monitoring, which is based on policies saved in form of metadata and assertions with proactive and automatic analysis of information changes. There are many known solutions concerning monitoring of operations performed on, for example, files or e-mail messages. Known solutions are based on mechanism which control defined policies which are assigned to data e.g. in form of metadata. Information which are generated by applications which process data or operating systems on which processing is performed are used in such solutions. Such solutions cannot be implemented in all applications because there are not enough extendible or comprehensive. The main and the only function, in such exemplary solution, is to monitor activities on defined files but not to monitor changes of data's informative content. Exemplary solution is presented in patent application US 2014/0195485. Method, described in that patent, is to monitor file system on cloud platform and to verify that operations performed on this platform do not violate policies defined for local file system, which is synchronized with file system on platform. Method from that known solution does not provide mechanisms to interpret files content. Another solution described in US 2014/0165190 presents method of monitoring file system on mobile devices. Method of that known solution is based on approach to monitor events generated by file system. When specific event occurs, security module starts scanning a file. Monitored are files from defined localization. Also this method is not based on knowledge about file's informative content.

In the context of this document, the following terms shall be interpreted as follows:

• information asset - representation of information in the form of data of certain type e.g. pdf file, e-mail message,

· content of information asset - information represented by information asset; the same content of information asset can be represented in different forms on different information assets e.g. in the form of docx or pdf files and entry on www website,

• operation on information asset - operation performed on information asset which intends to read or modify its content - including creation of new information asset,

• category of information asset - assignment of information asset to specific group of information assets based on content of information asset; this category is the same as category of content of information asset,

· context of information asset - information defining processes processing data and users related with information asset,

• rules of information asset handling - definition of allowed type of operation performed on information assets; set of rules can be grouped in policies,

• metadata - data which describe information asset but do not constitute content of information asset. It is an object of the present invention to provide a method of classifying data with access and integrity control, that is, category and context of information asset is determined and operations performed on this information asset are monitored. Monitoring as well as category and context assignment are performed on at least one device which consists of at least one processor and one nonvolatile memory.

Method of classifying data with access and integrity control is characterized with that after each operation performed on information asset it is checked if operation conforms with rules of information asset handling. Checking of rules allows to verify if performed operation is allowed. In that step content of information asset is not verified. Next, category of information asset is determined, to which it is assigned after execution of operation. For that purpose, information content of information asset is analyzed. Next, group of users is connected with category created in previous step. For category and group of users determined in previous steps, conformity of category and context of information asset is verified against rules of information asset handling.

According to an aspect of present invention, rules of information asset handling are attached to asset in the form of metadata. Metadata can be stored in different ways e.g. Alternate Data Stream for Windows operating systems with NTFS file system or extended file attributes used in file systems of Linux, Windows, MAC or AIX operating systems. It is also possible to use file containers used in content management systems. Example of file container is xml document which contain metadata fields and asset file location.

Another advantage of the invention is that in step in which category of information asset is determined, vector space model, VSM, is used in order to represent content of information asset and that content of information asset is classified with supervised machine learning algorithm. Vector space model represents content of information asset as a vector in multidimensional Euclidean space, where dimensions represents terms. Utilization of vector space model allows to use machine learning algorithm in analyses of information content. Process of classifying is about assigning resource to certain category which depends on term frequency which is represented by vector space model. In supervised machine learning classifier learning is used which is based on learning vectors which consist of input elements and corresponding output vectors. Classifiers, after process of learning, gain ability to generalize and recognize categories of input data. In present invention input vector, which is under classification, is a vector space model of content of information asset, for which classifier defines category. Classifiers, in reference to classifying textual information, are learned based on language corpuses, which consist of extensive number of textual documents. Documents come from different sources and areas, but they can be targeted to specific areas.

Yet another advantage of present invention is that in step in which category of content of information asset is defined, term frequency-invert document frequency, TFIDF, is used as a model of vector space. In such representation, weights of terms are computed based on number of their occurrence in documents - information assets and term occurrence frequency referred to documents collection.

In another instances of present invention, linear regression with support vector machine, SVM, is used as supervised machine learning algorithm. Linear regression allows to find plane divisions to sections which corresponds to different classes - categories. Support vector machine allows representing nonlinear limits between categories determined by regression. Linear regression is used to classify documents based on their vector representation e.g. TFIDF. It allows to define similarity of vector representation of content of information asset to categories of information assets in the form of Euclidean distances.

According to an aspect of present invention, random indexing algorithm, Rl, is used for preliminary preparation of categories for classifier. This algorithm is based on random projections mechanism used in sparse distributed memory. Random indexing concept is to build matrix of term vectors and matrix of indices which are used to express semantic space of documents set - information assets. Semantic space is a set of categories, while measure of belonging to category is scope of number of occurrence of defined term. Random indexing is used in order to determine and update of semantic space for a whole set of data which constitute information assets. Categories, determined by random indexing, are used by linear regression algorithm to improves classifying of content of particular information asset.

According to another aspect of present invention grouping of users is done with the use of unsupervised machine learning algorithm. Such algorithms are used to determine structure of data without any prior knowledge of it, i.e. there is no learning phase of algorithm. In present invention users are grouped based on information assets categories. Each information asset has category and it is known if particular user had been working with particular asset. Accordingly, users are objects being grouped and categories of information assets are used as object features.

It is another object of present invention that k-means clustering algorithm is used as unsupervised machine learning algorithm. K-means clustering is aimed to divide n object into k clusters in such a way that each object, i.e. user, belongs to the cluster with the nearest mean that is a prototype of the cluster.

Another advantage of present invention is that information assets which are monitored are files. Present invention is not limited to specific type of file i.e. file can be of different format e.g. docx, pdf, txt, xml. It can be defined which type of files shall be monitored. As well, files location can be defined i.e. directories in which files are placed.

According to another aspect of present invention, e-mail messages are information assets under monitoring. Monitored are: messages' content, addressee, sender and subject. It is also possible to monitor attachment. Moreover, it is possible to monitor e-mail conversation consisting set of exchanged messages, which is treated in this case as a single information asset. It is another object of present invention, instant messages are information assets which are monitored. It can be instant message from network communicators which are based in Jabber protocol. Due to the nature of instant message - short and sent frequently, it is possible to set frequency of monitoring i.e. time periods or number of exchanged messages for which information for classifier are produced.

It is yet another object of present invention that at least one version of information asset is stored in a state which it was in before operation execution. This ensures that versions of information asset from states before operation execution are archived. This approach, together with knowledge gained thanks to classification, allows precise analyzing of changes done on content of information asset and restore previous versions e.g. in case when operation is blocked on asset as a result of misuse detection during process of classification.

According to yet another aspect of present invention, in a step in which verification is done to conformity of category and context of information asset defined in rules of information asset handling, at least one operation is performed, which is defined in rules of information asset handling. These operations are reactions to operations performed on information asset. Exemplary reactions are saving logs, blocking operations on asset for certain user, sending notification to defined user about content change.

Yet another advantage of present invention is that in a step in which verification is done to conformity of category and context of information asset defined in rules of information asset handling, notification is sent to users defined in rules of information asset handling about operation preformed on information asset. It is one of many possible reactions to operations performed on information asset.

Yet another advantage of present invention is that in a step in which verification is done to conformity of category and context of information asset defined in rules of information asset handling, notifications about operations performed on information asset are sent to users that are defined in rules of information asset handling, in form of e-mail messages.

Exemplary embodiment An object of present invention is presented in exemplary embodiments in appended drawings which present:

Fig. 1 - infrastructure to implement invention,

Fig. 2 - components to implement invention,

Fig. 3 - flow of data classification control,

Fig. 4 - classifying of content of information asset.

Presented embodiment of present invention is targeted to companies and organizations for documents flow process management and knowledge management. Figure 1 depicts scheme of infrastructure for exemplary instance of present invention. Business logic of applications and services is implemented on server ~100. Endpoint terminals -101 connect to server ~100. In this exemplary embodiment of present invention, endpoint terminal can act as a personal computer, laptop, tablet, smartphone or server. Access control function is delegated by server -100 to authorization server ~102. Authentication server utilizes users databases which store access permissions to services and resources. In presented instance of present invention, authentication server ~102 is connected to database of directory services -103 and users databases which are out of the directory services domain ~104. Business logic server -100 is connected to files server ~105 and e-mail server -106. In this exemplary instance of present invention, servers -100, -105 and -106 are deployed on one physical server machine. In other embodiments of present invention it is possible to deploy servers on different physical server machines. Servers -102 and -107 are deployed on separate physical servers. Server -107 implements function of data classification which is used during data processing. In this exemplary embodiment of present invention, files stored in network file system are monitored. Every time applications and services deployed on server -100 perform operations on files, appropriate information is provided to server -107 from -105. In other embodiments of present invention, monitoring can be applied to object storage or NoSql database such as document database which stores records in form of semi- structural files like XML or JSON. Verification of relation between specified data and users is an integral part of data classifying process in this exemplary instance of present invention and for that purpose server -107 is connected to -102. Implementation of present invention is described based on fig. 2. Servers -200, -201 , -202 and -203 corresponds with servers -107, -100, -105 and -102 depicted on fig. 1., accordingly. Application -204, whose business logic is realized on servers -201 , has functions which requires execution of operations on files stored in the file system -205 deployed on server -202. File system consists of directories -208 and files -209 in these directories. Presented figure does not depict directories hierarchy, for figure clarity sake. Operations performed on files are monitored by component -206. This component is configured for purpose of monitoring specified operations performed on files stored in specified directories. Additionally, configuration of this component allows to specify types of files so that only operations processing these type of files are monitored. It is possible to use several components -206 on one file server, for example to in order to, independently, monitor operations on files which come from different applications. In this exemplary instance of present invention, event listener -206 uses functions of operating system, which allows to monitor operation on files. These function can vary depend on operating system, for example functions of event poll interface for Linux operating system with kernel version 2.6 or newer. Event listener -206 is connected with component -207 which manages communication with component -211 receiving events on server -200 side. In this exemplary embodiments of present invention, message bus is used, which implements Advanced Message Queuing Protocol, AMQP. Manner of sending events between servers is not an object of present invention and in other embodiments of present invention can differ, for example it can utilize web services or sockets. File state manager -210 creates working copies of files after reception of information from event listener -206 about file's content change. Depending on requirements on specific files and applications, file in version before modification can be deleted or stored. Number of file versions to be stored is configurable. Server -200 can automatically decide about previous files version storing based on analyses of changes made on a file. Event, that was sent to the server -200, is forwarded by component -211 to component -212. This event is composed of type of performed operation and metadata which describes file. Component -212 reads metadata containing information if performed operation is allowed and if it is necessary to generate message about its execution. In this exemplary instance of present invention, such message is sent to defined e-mail recipient - it is possible to define e-mail recipients per each category of file. In this exemplary embodiment of present invention, metadata are in files of JSON type, which are related with monitored files. In other embodiments of present invention and depending on type of operating system, it is possible to use other mean of metadata storage, for example Alternate Data Stream for Windows operating system with NTFS file system, or extended files attributes used in file systems Linux, Windows, Mac or AiX operating systems. Moreover, it is possible to use files containers used in content management systems. Example of such container is XML document which contains metadata fields and location of file. In this exemplary embodiment of present invention, JSON metadata files contain also information about secure cryptographic, that is ciphering and digital signature. Additionally, metadata are composed of file's context and category. Context characteristic are stored in register -216. Context consists of several information. In this exemplary instance of present invention, information about roles of users allowed to files from given context, is utilized. Additionally, only those processes are taken into account within which given files can be processed. Definition of data processing process can be composed of different information. For example it can be information about project or operating process, in which given file can be used or information about clients which are related with given process. Act of context definition is fully configurable and its parts are not a subject of present invention. Each file has category assigned to it. Characteristics of categories are stored in register -217. Categories are defined by component -215. Categories are related with information content of file, for example occurrence of specific words in sentences. Component -212 is connected to component -218 which is responsible for verification of information read from received event. Verification is performed based on policies which contain rules with allowed reactions on events - it will be described later in this document. File whose content has been changed during processing, is forwarded to component -213, which in turn, performs preliminary operations for classifying, based on information content of file. These operations are described later in this document. Component -214 performs file classification based on content prepared previously by component -213. Behavior of component -214 is described later in this exemplary instance of present invention. In this exemplary embodiment of present invention, components -207 and -211 are used in order to send file content. Manner of delivering content of file to component -213 is not an object of present invention. In other embodiments of present invention, component -213 can, for example, connect directly with file system -205. Fig. 3 depicts control flow of files classifying process. In step -300, in component -212, type of event is verified, that is, type of operation performed on a file. If content of a file was not modified, process proceeds to step -303. Operations which do not change content of a file are, for example: saving file in other file format as a separate file, for example, DOCX as a PDF, print screen or sending file as a attachment to e-mail. Otherwise, before step -303 is achieved, file content must be prepared to classification-301 in component -213 or classification of file content -302 in component -214. Step -303 is performed in component -218. Rules of operations on file together with context and categories assigned to file compose policy associated with file. In this exemplary embodiment of present invention following actions are specified within policies: disabling operations on files, request of acknowledging of operations by users who are informed about it via e-mail, enabling operations and informing about it to certain users, informing about operations only those users who are related with given category of a file, enabling operation providing that additionally security mechanism in form of ciphering is used, enabling operations providing that additional security mechanism in form of digital signature is used, enabling operations providing that additional security mechanisms in form of ciphering and digital signature are used. Present invention is not limited to these exemplary types of operations, that is, in other instances of present invention, it is possible to define other actions. Steps -301 and -302 are executed for newly created files also. In this exemplary embodiment of present invention, events can be ordered by event listener -206 and sent to analyses as a whole sequence. Example of such sequence is file open, file content modification and file closure. Flow of file classifying process according to content of information asset is described based on fig. 4. This process consists of four activities: -400, -411 , -416 and -419 realized in components -213, -214, -215 and -218, respectively. After event processing in component -212, information content of file -401 is forwarded by component -211 to component -213. In this exemplary embodiment of present invention, information content is passed in form of text placed in JSON file. Invention is not limited to only this form of information content passing - depending on needs, content preprocessing engine -213 can pars different formats, including original type of monitored file. In case, where monitored files are in semi-structural form, for example XML, JSON or XLSX based on templates specifying location of certain data in scope of sheet, event processing engine -212 verifies if operations listed in event are allowed for certain file fragments and only after receiving positive response from policy engine -218, sends information content of file to component -213. In order to prepare information content of file for analyses, content preprocessing engine -213 performs operation of generating tokens -403 from words, removing stopwords -405 from content and counting occurrence of words -407 in information content of file modified this way. Tokens -404 are sequences of lexical symbols created by removing punctuation marks i.e. tokens are sequences without spaces, that is words. Reduction of stopwords -405 removes all words from obtained tokens, whose contribution to information content is irrelevant. Examples of such words are: and, or, is, the, at, on. Words obtained after reduction are called terms -406. Set of terms comprises formal representation of information content of file. In other embodiment of present invention, it is possible to use other mechanisms of preparing information content e.g. stemming, that is noun conversion to nominative singular form and verb to infinitival form, or resign from specific operations e.g. stopwords reduction. Next, frequency of certain terms occurrence is defined for such representation. In this exemplary instance of present invention, term frequency - inverse document frequency, TFIDF, is used. This method allows considering simultaneously term frequency, TF, and invert document frequency, IDF. The latter value indicates, how many files contain specific term - the less terms appear in files, the greater IDF value is. TFIDF is not the only method to calculate term frequency, that can be used in embodiments of present invention. In other instances of present invention, it is possible to count term frequency TF only, phrase search only, or counting supported with terms tags other than specific parts of speech, or approximate string matching. Result of TFIDF operation is vector space model VSM -408, which represents information content as a vector in multidimensional Euclidean space where dimensions -axes - represents terms. VSM -408 is provided to classifying engines -214, where linear regression with support of vector machines, SVM, is used. Vector measure which is obtained from output of component -213 and provided to component -214, allows to treat documents as points in metric space, in which similarity is defined by Euclidean distance. Linear regression allows to find plane divisions to sections which corresponds to different classes - categories. Support vector machine allows representing nonlinear limit points between categories selected by regression. Present invention is not limited to files categories classification based on regression and support vector machines. In other embodiments of present invention it is possible to use other classifiers e.g. Naive Bayes classifier. It is essential to choose classifier which allows to single out files categories as a function of its information content. File information content -401 is also provided to semantic engine -215, where update of random indexing matrix Rl -412 takes place. Rl matrices are based on random projections used in e.g. sparse distributed memory. Aim is to create two matrices: matrix index vectors MIV and matrix terms vectors MTV. These matrices helps to reflect semantic space of documents set. MIV is sparse matrix, which is filled randomly with +1 , -1 and 0 values under condition that there is much more 0 than other values. MTV contains vectors of terms, where at the beginning all values are set to null values. For each file F, when there is term T inside this file, index vector number F, from MIV, is added to the vector which corresponds to term T in MTV matrix. As a results, term vectors, for terms being in similar files categories, have similar content - what can be seen by looking at values +1 and -1. Use of this method aims to update - in adaptive manner, -413 semantic boundaries of categories, and supporting classification using linear regression and SVM - or analogous - using preliminary setting of categories boundaries. Categories updated this way -414, are saved in categories register -415. In other embodiments of present invention, semantic space can be determined using other methods - providing that it is possible to obtain the same results as for random indexing. Examples of such methods are: Singular Value Decomposition SVD and Latent Semantic Analysis LSA. Combining classification focused on particular information content and classification supporting it and done with the use of random indexing which is aimed to update of categories for whole set of file i.e. their information content - corpus, allows to obtain better computation performance of classification performed in step ~407 and appropriate matching level of categories to actual total information content of all files. Present invention introduces mechanisms for users grouping ~416 in function of files categories related with these users' activities. For that purpose, k-means clustering -417 method is used in this exemplary instance of present invention. It is unsupervised machine learning which allows object grouping according to given parameters. Objects - in this case - are users, and parameters are files categories. This way, dynamic groups -418 are created, which can change over time. In this exemplary embodiment of present invention, it is used to define users which must be informed about file creation or modification. In other embodiments of present invention, it is possible to use other grouping methods e.g. hierarchic clustering. Groups of users and categories of information content are verified in policy engine -218 component in conformance of operation performed on a file with operation allowed by rules included in policy. Classification method covered in present invention is based on information content. This way, policies which contain rules with allowed operations and groups of users are not limited to a single file with given information content e.g. during saving DOCX file to PDF, policies for content of DOCX file are inherited by PDF file.

Architecture presented on fig. 1 and fig. 2 is not the only possible architecture for present invention. In other embodiments of present invention, components can be located in different manner than it is presented in this exemplary instance of present invention e.g. all except policy engine -218, policy registry -219, semantic engine -215, context register -216 and category register -217 in scope of server -202. Allowed are also solutions in which end terminals are directly connected to file server -202 or solutions in which this server is absent and files are stored on end terminals. In yet another exemplary embodiment of present invention, it is possible to server -200 to monitor content of e-mails or instant messaging.