A SYSTEM AND METHOD OF EXTRACTING RAW DATA - VAN MOLENDORFF STEFFAN OCKERT

Title:

A SYSTEM AND METHOD OF EXTRACTING RAW DATA

Document Type and Number:

WIPO Patent Application WO/2013/178993

Kind Code:

Abstract:

A system and corresponding method to extract raw data from an electronic text document in a predetermined format (particularly XML format), the document comprising raw data in the form of tag data held in a hierarchical framework formed from a plurality of tag sets, wherein the system comprises: a tag analysis module operable to analyse the formatted document to identify: a plurality of tags in the hierarchical framework; and the tag data in the hierarchical framework;a grouping module operable to group at least two of the plurality of tags, wherein the at least two tags are grouped into a tag set, each tag set having a tag index specifying the location of the tag set in the hierarchical framework of the document; and an output module operable to associate a tag set with the tag data of that tag set.

Inventors:

VAN MOLENDORFF STEFFAN OCKERT (GB)

Application Number:

PCT/GB2013/051341

Publication Date:

December 05, 2013

Filing Date:

May 22, 2013

Export Citation:

Click for automatic bibliography generation Help

Assignee:

VAN MOLENDORFF STEFFAN OCKERT (GB)

International Classes:

G06F17/30; G06F40/143

Other References:

LIANG JEFF CHEN ET AL: "Mapping XML to a Wide Sparse Table", 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2012) : ARLINGTON, VIRGINIA, USA, 1 - 5 APRIL 2012, IEEE, PISCATAWAY, NJ, 1 April 2012 (2012-04-01), pages 630 - 641, XP032198128, ISBN: 978-1-4673-0042-1, DOI: 10.1109/ICDE.2012.24
THOMAS W COX: "Advanced XML Processing with SAS 9.3", SAS GLOBAL FORUM, 22 April 2012 (2012-04-22), Orlando FL, USA, pages 1 - 11, XP055075762, Retrieved from the Internet [retrieved on 20130820]
ANONYMOUS: "SAS 9.2 XML LIBNAME Engine, User's Guide, Second Edition", 2010, pages i - 139, XP055075764, Retrieved from the Internet [retrieved on 20130820]

Attorney, Agent or Firm:

HOARTON, Lloyd (Sherborne House119-121 Cannon Street,London, Greater London EC4N 5AT, GB)

Download PDF:

View/Download PDF PDF Help

Claims:

Claims

1 . A system to extract raw data from an electronic text document in a predetermined format, the document comprising raw data in the form of tag data held in a hierarchical framework formed from a plurality of tag sets, wherein the system comprises: a tag analysis module operable to analyse the formatted document to identify: a plurality of tags in the hierarchical framework; and the tag data in the hierarchical framework; a grouping module operable to group at least two of the plurality of tags, wherein the at least two tags are grouped into a tag set, each tag set having a tag index specifying the location of the tag set in the hierarchical framework of the document; and an output module operable to associate a tag set with the tag data of that tag set.

2. The system of claim 1 , wherein the predetermined format is based on SGML, XML, HTML or RSS.

3. The system of any preceding claim, wherein the document has a filename and the output module labels the tag data and the tag set with the filename.

4. The system of any preceding claim, wherein the tag data is associated with the plurality of tags by being located in the document within a respective tag.

5. The system of any preceding claim, wherein the hierarchical framework is formed from the relative locations of the plurality of tag sets.

6. The system of any preceding claim, wherein the tag index is an identifier which is unique to the document.

7. A system of any preceding claim, wherein the output module is operable to reduce a plurality of tag sets into a single tag set.

8. A system of claim 7, wherein the tag sets have identical tag names and the output module concatenates all the tag data associated with the plurality of tag sets into a single tag data.

9. The system of any preceding claim, wherein the tag analysis module uses a temporary vector to hold the plurality of tags and the tag data.

10. The system of any preceding claim, wherein the tag set and the associated tag data are held in a flat file having no hierarchical framework.

1 1 . The system of any preceding claim, wherein the tag data is a complex embedded data value; OR a NULL value.

12. The system of any preceding claim, wherein the tag data is a variable with conditional logic; OR the tag data is a variable dependent on related tag data.

13. The system of any preceding claim, wherein the tag set and the associated tag data are held in a file with a hierarchical framework.

14. The system of claim 13, wherein the file with the hierarchical framework is a different hierarchical framework to the framework of the original document.

15 The system of any preceding claim, wherein the tag comprises a tag name and a tag index.

16. A method of extracting raw data from an electronic text document in a predetermined format, the document comprising raw data in the form of tag data held in a hierarchical framework formed from a plurality of tag sets, the method comprising: analysing the formatted document to identify: a plurality of tags in the hierarchical framework; and

the tag data in the hierarchical framework; grouping at least two of the plurality of tags into a tag set, each tag set having a tag index specifying the location of the tag set in the hierarchical framework of the document; and associating a tag set with the tag data of that tag set.

17. A computer storage medium carrying instructions operable to perform the method of claim 16.

Description:

"A system and method of extracting raw data"

Field of the Invention

The invention relates to a system and method of extracting raw data from a formatted document.

Background

An example of a formatted document is the extensible Markup Language (XML) format that has been adopted by many industries because of its design specification which emphasises simplicity, generality and usability over networks. XML is not limited to network use and many organisations, institutions and governments utilise its power to exchange information between standalone devices capable of interpreting XML documents. HTML, and XHTML are other examples of Standard Generalized Markup Languages (SGML)-based languages.

One of the key features of the XML format is its ability to represent arbitrary data structures in a human-readable syntax. The data structure is normally represented in a hierarchical format using a tagging system which holds the user specified data. A list of XML markup languages is available at:

http://en.wikipedia.org/w/index.php?title=List_of_XML_mar kup_languages&ol did=551591574.

Some familiar document formats that use XML are RSS, XHTML, Open Document format and documents produced using Microsoft Office. Each format has a specific set of tags that are interpreted by the application receiving the document.

Methods and functions to extract structured and unstructured data from text based sources have been used for many years. Perl Regular Expression functions, one such example, have been incorporated in SAS Base V9 programming language. Other functions used to extract data have been developed but are generally used in a manual extraction process or an overnight batch process. Many organisations, institutions and governments need to also extract valuable data from XML based documents and add the data to a database system. The hierarchical structure of XML makes extracting data a non-trivial task. There are systems available that use a schema, based on a particular format and tag naming regime, to extract data. The schema is specific to a particular XML format and any variation or addition to the XML format requires updating the schema. Generally updating the schema to reflect changes to the XML document is a manual process which has the potential to introduce errors.

With the number of XML documents being processed by companies

numbering in the thousands, an error in the schema will result in flawed data, which is either incomplete or has been incorrectly or only partially extracted. The consequence of using flawed data by end users is potentially life changing, especially if the data relates to a patient's medical records, for example.

There is a need to provide improved data extraction from text in formatted documents like XML formatted documents. For example, Schema or MAP files, used to extract data, are auto-generated through various applications, e.g SAS XMLMapper, an incorrect auto-generated Schema or MAP file can output flawed data. The present invention mitigates the risk of flawed data due to manually updated schema files by introducing a system that does not require specific XML document schemas.

The common interface that is necessary to extract data from XML files are generally referred to as 'MAP' or 'Schema' files. Various technology

companies have developed their own mechanisms for creating MAP or Schema files to read data from hierarchical XML file structures. For example the SAS Institute has developed the 'XMLMapper' application for this purpose.

The purpose of the MAP file is to identify the variables (i.e. variable name, data type, length) and the associated variable values in the XML file structure during the data extraction, transformation and load process (ETL). For each XML file structure a separate MAP or Schema file must be created to extract the raw data. When organisations have many different types of XML file structures it becomes an onerous task to create and/or amend MAP files if new variables are for example introduced in the Microsoft InfoPath XSN, Word or Excel template files.

For example, if a user has 30+ different XML file structures the process of defining a MAP file or modifying a Schema must be repeated for each XML file type. The process is normally a manual process consisting of a series of steps. Furthermore for each amendment to the XML file in future the process of modifying the MAP file or Schema must be repeated to capture the new XML file structure. This is a costly, time consuming and resource intensive exercise.

The process to extract data from a three level hierarchical file structure with no potential changes in the future is relatively straight forward and known in the state of the art. However in many instances there are more than three levels in the XML hierarchy and extracting data from embedded hierarchical structures within XML files is complicated and data is not always adequately extracted using MAP or Schema files. One such example of a complication is where there are variables with conditional logic and dependencies on other variable values, which are embedded within subgroups, within groups and within other groups and/or subgroups.

Some of the most complex hierarchical XML files can contain in excess of 800 individual variables with multiple variable values and conditional logic in a single XML file structure.

The nature of XML is such, that there is no upper limit to the number of permutations that an XML file structure can adopt. This in turn opens the possibility of having, in theory, an unlimited number of variables.

The number of different types of XML files used in industry can give in excess of 25 different XML files structures which could contain in excess of 3000 variables making it almost an impossible task to search manually through XML file structure source code to identify where the MAP files have been generated incorrectly.

Errors can occur for example, where the variable length has been set too short and as a result the data value is only partially extracted from the XML source file, e.g data length is '3' digits but the data in the XML file has 9 digits, the last 6 digits will not be extracted. Incomplete extraction introduces errors.

Generally data issues are only identified during quality assurance or user acceptance testing. Unfortunately as a result of the large quantities of data, some issues are only identified in reporting which then brings the integrity of the database into question.

There are few resources available on the successful extraction of data or data with conditional logic from complex hierarchical XML file structures. In general automated MAP file application software does not have the intelligence to extract a multi level hierarchical file structure successfully. As a result complex embedded data values are often not extracted at all; field lengths are not allocated accurately which result in partial extraction of the data values, more specifically text field values; and multiple data values associated with a single variable are often only partially extracted.

Within industries the need exists to extract data from multiple XML files which are differentiated by their XML file structures without reverting to manually modifying MAP or Schema files, and ideally doing away with the intermediate MAP and Schema system entirely.

One aspect of the present invention provides a system to extract raw data from an electronic text document in a predetermined format, the document comprising raw data in the form of tag data (variable value) held in a hierarchical framework formed from a plurality of tag sets (variable name), wherein the system comprises: a tag analysis module operable to analyse the formatted document to identify: a plurality of tags in the hierarchical framework; and the tag data in the hierarchical framework; a grouping module operable to group at least two of the plurality of tags, wherein the at least two tags are grouped into a tag set, each tag set having a tag index specifying the location of the tag set in the hierarchical framework of the document; and an output module operable to associate a tag set with the tag data of that tag set.

The present invention further provides a system and method as claimed.

Brief Description of the Drawings

In order that the present invention can be more readily understood,

embodiments thereof will now be described by way of example, with reference to the accompanied drawings, in which:

Figure 1 a is a schematic of a manual XML extraction method;

Figure 1 b is a schematic of an example of the invention and direct extraction from a hierarchical XML file;

Figure 1 c is a schematic workflow showing a preferred order in which modules operate in a preferred embodiment of the invention;

Figure 2 is a schematic of an input module (102) which extracts an XML document (201 ) filename and structure, and labels the table with the same name (202) and populates the table with extracted line data (203);

Figure 3 is a schematic of a first embodiment of the invention showing further processing modules used to output a modified table in a form readily readable by a database system, for example a SAS style table;

Figure 4 is a schematic showing output table (106) with duplicate variables consolidated into a de-duplicated output table (107);

Figures 5a and 5b are functional views of a first embodiment from a user's perspective;

Figures 6a and 6b are detailed functional views of the first embodiment of the invention detailing module processes and incorporating the user's perspective to better understanding the working of the invention;

Figures 7a and 7b are schematics of the key tables created in an extraction process embodying the invention.

Detailed Description The present invention relates to a system and method to extract raw data from text in a syntactically distinguishable document. Examples of the invention are discussed in relation to the XML document format. In the XML example, the system extracts XML tags and XML data from an XML document. The extraction process particularly relates to extracting the XML tags and XML tag data into a set of tables which are readily readable by database applications.

The present invention analyses a XML document and determines the tags and the tag data. The tags are grouped into sets by virtue of the tag names which define the opening and closing of the tag set. The tag data is defined preferably as the data contained within the opening and closing of the tags of the tag set. In a document with a deep hierarchical structure the tag set has an index indicative of its position in the hierarchy. A tag set preferably comprises of tag names and a tag index associated with the tag names.

The tag set and tag data are outputted to an output table. In particular embodiments the output table is transposed and de-duplicated into a form that is readably readable by database applications. Preferably the name of the output table corresponds to the structure name of the XML document. The structure name is often associated with the XML structure type, for example in the context of legal documents, agreements have a pre-defined XML format and specific tags. As an example the structure name could be based a FX agreement, Futures Agreement, Non-Disclosure Agreement etc... This feature helps classify the output table.

This document does not aim to claim known functions but rather a new generic method and system, preferably using modules. The modules consist of a combination of steps and functions, to extract character and numeric data from various structured and unstructured data sources and in particular but not limited to , from XML based file structures.

Software versions of Microsoft InfoPath, Word and Excel are based on a XML file structure and can be identified by the following file type extensions namely: ".xml", ".doc", ".xls", ".docx" and ".xlsx", but is not limited to these applications. The method and system to extract data is based upon a unique platform consisting of modules that preferably use embedded computer languages which in combination with various optional applications, web server configurations and functions provide a dynamic platform. The key

functionalities delivered by the dynamic platform are: an efficient data extraction method from XML file based structures and other source system databases and files; extracting multiple variable values into a single variable; and automated database updating.

The dynamic platform and data extraction method and system requires the setting of initial client environment parameters (Library path names, Data directory names, Source file directories, etc.). The system's application interface can be adjusted to suit the user's preferences, expectations and requirements.

The benefits of examples of the invention are multi fold and include; data integrity; resource efficiency; cost effectiveness; low maintenance; navigation simplicity; robust data extraction and normalisation; automated database updating and database enhancements; powerful reporting functionality and system scalability.

Further benefits of examples of the invention in particular also relate to legal risk management globally across industries, sectors, governments, organisations, institutions, etc. where such legal data is captured in an electronic format in particular XML format(s) but not limited to only XML format(s) where such other formats may include Microsoft InfoPath, Word and or Excel or other relevant formats associated with the capture of legal data in an electronic document or template format.

The invention is explained by way of various embodiments. The invention is not limited to the components or steps of the preferred embodiments or other examples in this document. The method and system uses a series of modules to extract data from a XML document. Figure 1 is a schematic of a preferred embodiment of the invention. A setup module (101 ) is used to setup initial parameters which are based on a user's needs. The setup module is optional and the initial parameters can vary depending on the type of filing system.

The initial setup parameters may include, amongst other things, the initial setup of the architectural environment, application software deployment, library path names, library folder names (root directory), data library names and modules used to extract XML data.

Once the setup module parameters are set the remaining modules operate to extract data from any type of XML file structure. The data to be extracted can optionally be specified based upon the user's request through a user interface or other input means.

The extraction process for varying numbers of complex XML file structures uses an optional virtual storage module, for example, a Program Data Vector (PDV) of the SAS system. The SAS system is developed by the SAS Institute which specialises in systems based on data stored in tables. The PDV is a virtual list of storage areas, preferably in a transient storage medium, for varying virtual variables such as tag data, a tag name, a tag set or tag attributes but is not limited to tag related data and can be used to store any type of variable or data.

Tag names vary for every XML file structure. The modules used after the initial setup module identify and assign virtual variables which correspond to each unique XML hierarchical file structure, this may include a tag name, tag index and tag data, but is not limited to tag related data and can be used to store any type of variable.

The extraction process has the ability to communicate both complex and a large number of virtual variable names, associated multiple variable values and table names across all stages of the dynamic system from data extraction, database build, database update and report delivery as per the user input request parameters through the user interface.

The new method scans the top level data directory or file archiving system (also known as the root directory) that hosts the individual folders and subsequent data files in each folder. For each folder that is scanned virtual folder variables are created which uniquely identify the specific folder. The files within the folder are counted and the number of files stored in one of the virtual variables.

All the file names are read from the specific folder and then each file name is uniquely stored in a virtual file variable.

Figure 2 is a schematic of an optional input module (102). Input module (102) starts to read the XML file (201 ) structure of each individual file in the specific folder. Starting with the first file, number one, the input module continues to read the files until the last file in the folder has been read. The input module is not limited to reading files sequentially from a fixed storage device but, for example, can take XML input through a live feed communicated across a network.

The input module is operable to assign a temporary virtual table name (202) for all files that are identified and read from the folder containing the list of XML files. Each line, or partial line if required, is read from the XML file into a single variable ('Value') (203) which contains all the characters in that line of the XML file structure.

A tag analysis module (103) operable to analyse the XML document to determine a plurality of tag names and a plurality of tag data. There are various methods of analysing the XML document to extract tag names and tag data. A preferred method of analysing the XML document reads the structure of the XML file (102) and assigns five additional virtual variables in memory. The virtual variables are required to read the XML pattern value, XML pattern identifier, for example the characters '<' and '>', XML pattern comparison value, variable name, for example a tag name and variable value, for example tag data, and retain the virtual variable values in a temporary storage medium. The preferred tag analysis method will search for the pattern identifier until it locates the first data line of the XML file and will continue to read it into the PDV by assigning the virtual variables namely: the XML pattern value, the XML pattern identifier and the XML pattern comparison value.

As an example, tag data such as '£1 ,000' is stored between the XML pattern open ('>') and pattern close ('<') tags (i.e. ...>£1 ,000<...) and Perl Regular Expression functionality can be used to extract the common pattern value and save the pattern value in a variable that will be used as a match for comparing all the subsequent pattern values in the individual lines of the XML file. The pattern value is also used to create a pattern identification variable in the PDV. The pattern identification variable is used as an identifier in one of the known Perl Regular Expression functions.

The grouping module (104) is operable to group at least two of the plurality of tags wherein the plurality of tags are grouped into the plurality of tag sets wherein each tag set has a tag index according to the tag set position in the XML document;

The grouping module (104) uses the pattern value to group at least two of the tag names, preferably the tag name has an associated tag index value (102c) into a tag set (102d).

The first variable name, for example a tag name at the highest level in the hierarchy, is extracted from the XML file structure and group names are identified within the XML structure using the Perl Regular Expression functionality. Identification of the XML structure can be carried out using a number of methods and not limited to using a Perl Regular Expression function. Group names in a XML file hold multiple data values for a specific field, or tag set, in a XML template file. The group names may also have a number structure, for example a tag index associated with a tag name, to identify groups (tag sets) within groups (tag sets) with multiple values.

Each group number is listed as an independent variable name but the grouping module will identify groups that are associated with one another and assign the same group name for each group in the PDV. The grouped tag names, and associated tag index, are a tag set.

This module searches the specific line of the XML file in the PDV for certain special known characters (i.e. '</my:') and save the extracted variable name, for example a tag set, and variable value, for example tag data, in the virtual variables associated with the variables in the PDV.

Once all of the XML document has been read and variables extracted the contents of the PDV are transferred to an output table (106). Each output table has a name that is represented by the filename, or a structure

representing the filename, of the XML document. Each variable with multiple values is listed in the table as duplicate tag names but with different tag data values.

An output module (105) is operable to provide an output table (106) comprising the filename, the plurality of tag sets and the plurality of tag data. The output table (106) has two columns namely 'Variable', for tag names, preferably with an associated tag index, and 'Value' for tag data.

In this manner, the modules have extracted a plurality of tag data and a plurality of tag sets from an XML document.

Figure 3 is a schematic of a first embodiment, incorporating the modules of the preferred embodiment. The output table of the preferred embodiment contains duplicate tag names with associated tag data. The first embodiment consolidates the output table from the preferred embodiment.

Figure 4 is a schematic an example of an output table (106) and a de- duplicated table (107). The duplicate variable names (401 ) in the output table, for example the tag sets, and associated values, for example the tag data, are consolidated into a single variable name (tag set) (402) with multiple values (tag data values) per record in the output table (106). It is known that when automated Map files are created not all the data values are necessarily read from the XML file structure due to the complexity of certain XML file structures within industries.

A series of Do-End loops are used to identify the beginning and end of a specific duplicate variable name, for example a tag set. The first duplicate tag set and value, for example the tag data value, that is encountered is placed in virtual variables. For every subsequent line with the same tag set that is read from the output table the tag data value will be appended to the first tag data associated with the tag set in memory.

When the end of the specific duplicate tag set is reached the tag set and tag data value(s) in memory will overwrite the tag data in an intermediate output table.

The result is that there is only one record per unique tag set with the associated tag data value(s) in the de-duplicated output table (107).

The de-duplicated output table can be transposed and prepared in a form that is readily readable by a database application specified by the user or defined by the setup module.

The next step is to transpose (108) the layout of the table. This embodiment uses the de-duplicated table (107). The unique variable names (rows), for example the tag sets, are transposed to columns and the variable values, for example tag data, to form the new record in the transposed output table. The transposition of the de-duplicated output table results in several transposed tables (109). Every transposed output table has multiple unique columns but only one record. The name of every transposed output table corresponds to the file name in the specific folder which has been processed.

All the transposed tables derived from the output table are concatenated in to a virtual variable in memory. A new database ready table (1 10) is created which contains all the records that represent the individual XML files of a similar structure in a specific folder. The database ready table is appended with an additional variable which is the value of the name of the transposed table from which the record was derived.

In this manner the modules have extracted a plurality of tag data and a plurality of tag sets from a XML document. The optional modules have further transposed the extracted plurality of tag data and plurality of tag sets in a form more readily readable by a database.

Figures 5a and 5b are functional views of a preferred embodiment from a user's perspective. The setup module is configured to setup the user environment (501 ). A file structure (502) with a top level (503), sublevels (504) and files (505) are scanned and filenames inserted into virtual variables. A document (505a), with variable names (506) and variable values (507) has a formatted document structure, preferably in XML format (505b), with tags (505c) and tag data (505d).

Using a SAS system, or other database management software, the tags (505b) and tag data (505c) are passed through the tag analysis and grouping modules and written to a program data vector (PDV) (508) as tag sets and tag data. The output module operates on the PDV and outputs the output table (509).

Figures 6a and 6b are schematics of another embodiment detailing the operation of the modules in a processor, a storage medium and a transient storage medium.

The embodiment is based on utilising the invention in a SAS system using a program data vector (PDV). Although the invention is not limited to using a PDV and other systems are compatible with the invention.

The setup module is configured to setup the user environment (600). A file structure (601 ) and files are scanned and filenames inserted into virtual variables (602) in the transient storage medium. The virtual variables (602) have a heading (603) which is preferably a table name corresponding to the top level of the file structure and each line of the structure is read thereafter. Additional virtual variables (605) are added in the transient storage medium. The grouping module (607) operates on the formatted document, preferably in XML format (606), and the tag analysis module (608) identifies the tags and tag data and writes a PDV (609).

The output module (610) operates on the PDV and outputs the output table (61 1 ). The output table contains duplicate tag sets and tag data (612).

Duplicated tag sets are de-duplicated with associated tag data appended to a single tag set with the same name(613). A plurality of tables are created for each tag set with associated tag data(s) (614). All the tables are concatenated into a single permanent table (615). In an another embodiment, and prior to the concatenation process of the plurality of tables (614) a do-end loop runs sequentially through all of the plurality of tables.

Figures 7a and 7b are schematics of the de-duplication process. The loop (701 ) determines the greatest variable length and the variable type for all the variables in each specific table. The plurality of tables (614) are concatenated into a single permanent table. The permanent table comprises of variables with the correct type and length, m itigating any loss in data due to incorrect type or length.

Previous Patent: INSULATION MATERIAL

Next Patent: METHOD AND APPARATUS FOR LOCKING AND SCANNING THE OUTPUT FREQUENCY FROM A LASER CAVITY