Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
AN ELECTRONIC DEVICE AND A METHOD FOR EXTRACTING CLAUSE DATA
Document Type and Number:
WIPO Patent Application WO/2023/148066
Kind Code:
A1
Abstract:
Disclosed is an electronic device. The electronic device comprises memory circuitry, processor circuitry, and an interface. The electronic device is configured to obtain first data indicative of a document. The first data comprises text. The electronic device is configured to generate, based on the first data, second data comprising each line of text and a font style parameter associated with each line of text. The font style parameter is indicative of a font style of a corresponding line of text. The electronic device is configured to determine, based on the second data, for each font style, a style occurrence parameter indicative of an occurrence of a font style in the document. The electronic device is configured to classify, based on the style occurrence parameter, each line of text as heading or paragraph text.

Inventors:
PRADHAN ANSHUMAN (IN)
SENTHILVASAN NIRANJAN (IN)
CHINNAMGARI SUNIL KUMAR (IN)
Application Number:
PCT/EP2023/051831
Publication Date:
August 10, 2023
Filing Date:
January 25, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
MAERSK AS (DK)
International Classes:
G06F40/205; G06F16/30; G06F40/258; G06Q50/18
Foreign References:
EP3862889A12021-08-11
Other References:
RODRIGUEZ GONZALEZ LUIS CARLOS: "Logical Structure Identifier and Provision Classifier of Procurement Contracts in Spanish", 6 October 2021 (2021-10-06), XP093033715, Retrieved from the Internet [retrieved on 20230322]
SHARMA ARJUN DATT ET AL: "Too Long-Didn't Read: A Practical Web Based Approach towards Text Summariza", 13 January 2014, SAT 2015 18TH INTERNATIONAL CONFERENCE, AUSTIN, TX, USA, SEPTEMBER 24-27, 2015; [LECTURE NOTES IN COMPUTER SCIENCE; LECT.NOTES COMPUTER], SPRINGER, BERLIN, HEIDELBERG, PAGE(S) 198 - 208, ISBN: 978-3-540-74549-5, XP047265635
Attorney, Agent or Firm:
AERA A/S (DK)
Download PDF:
Claims:
CLAIMS

1 . An electronic device comprising memory circuitry, processor circuitry, and an interface, wherein the electronic device is configured to: obtain first data indicative of a document, wherein the first data comprises text; generate, based on the first data, second data comprising each line of text and a font style parameter associated with each line of text, wherein the font style parameter is indicative of a font style of a corresponding line of text; determine, based on the second data, for each font style, a style occurrence parameter indicative of an occurrence of a font style in the document; and classify, based on the style occurrence parameter, each line of text as heading or paragraph text.

2. The electronic device of claim 1 , wherein the classifying of each line of text as heading or paragraph text based on the style occurrence parameter comprises:

- determining, based on the style occurrence parameter, a font style pertaining to a first category of font style, wherein the first category is indicative of a heading;

- determining whether a line of text has a font style pertaining to the first category; and

- upon determining that the line of text has a font style pertaining to the first category, classifying the line of text as heading.

3. The electronic device of any of the previous claims, wherein the heading is indicative of a clause heading.

4. The electronic device of any of the previous claims, wherein the paragraph text is indicative of a clause text.

5. The electronic device of claim 4, wherein the electronic device is configured to provide, based on the first category, clause data indicative of the clause heading and the clause text.

6. The electronic device of claim 5, wherein the clause data comprises the clause heading and the clause text. 7. The electronic device of any of the previous claims, wherein the electronic device is configured to determine the paragraph text by including the text between a first delimiter and a second delimiter.

8. The electronic device of claims 2 and 7, wherein the first delimiter and/or the second delimiter are a line of text pertaining to the first category.

9. The electronic device of any of the previous claims, wherein the electronic device is configured to determine a font style pertaining to a second category of font style, wherein the second category is indicative of the paragraph text.

10. The electronic device of any of the previous claims, wherein the style occurrence parameter indicates a number of occurrences of a font style amongst all the lines of text in the document.

11. The electronic device of any of the previous claims, wherein the electronic device is configured to group the lines of text per font style parameter.

12. The electronic device of any of claims 5-11 , wherein the electronic device is configured to determine, based on style occurrence parameters of the document, the font style pertaining to the first category indicative of the heading by: determining whether the style occurrence parameter of the font style satisfies a criterion; and upon determining that the style occurrence parameter of the font style satisfies the criterion, determining the font style as pertaining to the first category of font style indicative of the heading.

13. The electronic device of claim 12, wherein the criterion is based on a range of occurrence of the font style.

14. The electronic device of claim 13, wherein the range is 3-12%.

15. The electronic device of any of claims 12-14, wherein the criterion is based on a first threshold of word count in the corresponding line of text.

16. The electronic device of any of claims 12-15, wherein the criterion is based on a second threshold of words in the corresponding line of text that are in capital letters. 17. The electronic device of any of the previous claims, wherein the classifying of each line of text as heading or paragraph text based on the style occurrence parameter comprises labelling, based on the first category, a corresponding line of text as heading.

18. The electronic device of any of the previous claims, wherein the first category is indicative of a sub-heading.

19. A method, performed by an electronic device, for providing clause data, the method comprising: obtaining (S102) first data indicative of a document, wherein the first data comprises text; generating (S104), based on the first data, second data comprising each line of text and a font style parameter associated with each line of text, wherein the font style parameter is indicative of a font style of a corresponding line of text; determining (S106), based on the second data, for each font style, a style occurrence parameter indicative of an occurrence of a font style in the document; and classifying (S108), based on the style occurrence parameter, each line of text as heading or paragraph text.

20. The method of claim 19, wherein the classifying (S108) of each line of text as heading or paragraph text based on the style occurrence parameter comprises: determining (S108A), based on the style occurrence parameter, a font style pertaining to a first category of font style, wherein the first category is indicative of a clause heading; determining (S108B) whether a line of text has a font style pertaining to the first category; and upon determining that the line of text has a font style pertaining to the first category, classifying (S108C) the line of text as heading.

Description:
AN ELECTRONIC DEVICE AND A METHOD FOR EXTRACTING CLAUSE DATA

The present disclosure pertains to the field of electronic documents and control thereof. The present disclosure relates to an electronic device and a method for providing clause data.

BACKGROUND

Companies, which deal with many clients on a long-term and/or short-term basis, generally create many documents, such as legal documents and/or contracts. The document can be generated based on templates and/or based on a client-specific document, such as customer-specific contract. The document can include headings and paragraph text. For example, legal documents are built primarily with clauses. A clause can be a specific provision in a legal document that relates to an important point of understanding or agreement between the parties engaged in the legal document. A clause can dictate certain conditions under which the parties agree to act during the term of the legal document, such as contract.

How each clause is written in a legal document may affect a liability metric and/or a pricing metric of a consignment. Non-standard/customer-specific legal documents, unlike legal documents built from the standard template, do not follow any structure of clause headings and clause texts.

SUMMARY

Evaluating these documents to identify specific clauses and pinpoint the variation in clauses may require a tremendous amount of human involvement, which is timeconsuming and prone to errors.

Accordingly, there is a need for an electronic device and a method, which mitigate, alleviate, or address the shortcomings existing and provide a solution for providing a classification of text between heading and paragraph text of a document, such as clause heading and clause text.

An electronic device is disclosed, the electronic device comprising memory circuitry, processor circuitry, and an interface. The electronic device is configured to obtain first data indicative of a document. The first data can comprise text. The electronic device can be configured to generate, based on the first data, second data comprising each line of text and a font style parameter associated with each line of text. The font style parameter can be indicative of a font style of a corresponding line of text. The electronic device can be configured to determine, based on the second data, for each font style, a style occurrence parameter indicative of an occurrence of a font style in the document. The electronic device can be configured to classify, based on the style occurrence parameter, each line of text as heading or paragraph text.

Disclosed is a method, performed by an electronic device, for providing clause data. The method comprises obtaining first data indicative of a document. The first data can comprise text. The method comprises optionally generating, based on the first data, second data comprising each line of text and a font style parameter associated with each line of text. The font style parameter can be indicative of a font style of a corresponding line of text. The method comprises optionally determining, based on the second data, for each font style, a style occurrence parameter indicative of an occurrence of a font style in the document. The method comprises optionally classifying, based on the style occurrence parameter, each line of text as heading or paragraph text.

Disclosed is a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by an electronic device cause the electronic device to perform any of the methods disclosed herein.

It is an advantage of the present disclosure that the disclosed electronic device and method allow for improved accuracy in identifying a heading and a paragraph text, such as a clause heading and a clause text in a document.

The present disclosure allows for improved accuracy in identifying headings, e.g., by differentiating the outliers, such as page numbers, from headings and text.

The present disclosure provides a technique that can be applied to various types of documents to classify headings and paragraph text, e.g. to label paragraph text, headings, and sub-headings. The disclosed technique can be applied to any font styles used in each document, as long as there is a variation in the font-styles of the headings and the paragraph text under the headings. For example, for a legal document, after the classification of clause headings and clause text in the legal document, the classified data (e.g. the clause data) may be stored in a database. The database may feed the clause data to one or more applications, and/or to one or more external devices. The recipient of the clause data is capable of generating a legal document and/or evaluating a legal document. Further, the classification performed for a number of legal documents can support deriving risk metrics and/or liability metrics associated with an entity, e.g. how may legal documents have a particular provision resulting in a determination of the risk and/or liability metrics.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and other features and advantages of the present disclosure will become readily apparent to those skilled in the art by the following detailed description of exemplary embodiments thereof with reference to the attached drawings, in which:

Fig. 1 is a diagram illustrating an example document according to this disclosure,

Fig. 2A is a table diagram illustrating lines of text of an example document according to this disclosure,

Fig. 2B is a table diagram illustrating examples of style occurrence parameters for example font styles according to this disclosure,

Fig. 3A-3B is a flow-chart illustrating an example method, performed by an electronic device, for providing clause data according to this disclosure, and

Fig. 4 is a block diagram illustrating an example electronic device according to this disclosure.

DETAILED DESCRIPTION

Various exemplary embodiments and details are described hereinafter, with reference to the figures when relevant. It should be noted that the figures may or may not be drawn to scale and that elements of similar structures or functions are represented by like reference numerals throughout the figures. It should also be noted that the figures are only intended to facilitate the description of the embodiments. They are not intended as an exhaustive description of the disclosure or as a limitation on the scope of the disclosure. In addition, an illustrated embodiment needs not have all the aspects or advantages shown. An aspect or an advantage described in conjunction with a particular embodiment is not necessarily limited to that embodiment and can be practiced in any other embodiments even if not so illustrated, or if not so explicitly described.

The figures are schematic and simplified for clarity, and they merely show details which aid understanding the disclosure, while other details have been left out. Throughout, the same reference numerals are used for identical or corresponding parts.

Fig. 1 is a diagram illustrating an exemplary document 1 comprising text. The document 1 may be a portable document format, PDF, document. The document 1 may be a legal document, e.g., a contract document. For example, the legal document can be a contract drafted using Microsoft word.

The document 1 comprises text. The document 1 may be in a machine-readable format, e.g., a digital document. For example, the document 1 can be a PDF document, and/or a Microsoft word document, etc.

The document 1 can comprise heading and paragraph text. In other words, the text can comprise heading and paragraph text.

Paragraph text may be seen as plain text, such as text which is not part of a heading nor a sub-heading.

The document 1 may be document, e.g., a legal document comprising clauses, subclauses, and clause text. Paragraph text can be clause text, e.g. in a legal document. Heading can be a clause heading.

The document 1 includes three headings 2, 4, and 6, such as three clause headings. The document 1 includes paragraph text 8 (e.g. clause text) corresponding to the heading 2. The document 1 includes paragraph text 10 (e.g. clause text) corresponding to the heading 4. The document 1 includes paragraph text 12 (e.g. clause text) corresponding to the heading 6.

The document 1 includes text 14. The text 14 may be seen as a page number.

The headings 2, 4, and 6 may have the same font style. The paragraph texts 8, 10, and 12 may have the same font style. The headings 2, 4, and 6 may have font style different from the font style of paragraph texts 8, 10, and 12. The headings 2, 4, and 6 may have font style different from the font style of text 14. The paragraph texts 8, 10, and 12 may have font styles different from the font style of text 14.

The paragraph texts 8, 10, and 12 may comprise one or more lines of text. The paragraph texts 8, 10, and 12 may comprise one or more paragraphs. The one or more lines of text may comprise one or more letters, words, and/or symbols, e.g., currency symbols, Roman numbers, etc.

The headings 2, 4, and 6 may comprise one or more lines of text, such as a fist text line, and/or a second text line. The headings 2, 4, and 6 may comprise words having one or more bold letters. The headings 2, 4, and 6 may comprise words having the first letter capitalized. The one or more lines of text may comprise one or more font styles: e.g., Times new roman, Helvetica, Calibri, Arial Bold Narrow, Arial Narrow, etc. For example, the headings may have the font style Arial Bold Narrow, and the paragraph text may have the font style Arial Narrow. The document 1 may include text representing document number, e.g., a page number. The page number may have a different font style than the clause headings, and clause text.

Fig. 2A is a table diagram illustrating an exemplary table 20 comprising a line of text from a document, such as the document 1 of Fig. 1 , and a font style parameter associated with the line of text of the document. The table 20 may be a representation of the second data disclosed herein which can be generated based on the first data indicative of document 1.

The font style parameter may be indicative of a font style of a corresponding line of text. The document may be obtained by an electronic device. The electronic device may be configured to determine a font style associated with one or more lines of text of the document, such as each line of text of the document. Each line of text and a font style corresponding to the line of document is represented in a table format.

The table 20 comprises two columns, such a first column 22, and a second column 24.

The first column 22 comprises one or more rows, wherein each row shows a line of text in the document. The second column 24 comprises one or more rows, wherein each row shows the font style parameter associated with the line of text of the document in the corresponding row of the second column 22.

Fig. 2B is a table diagram illustrating examples of style occurrence parameters associated with example font styles used in a document according to this disclosure.

Fig. 2B shows a style occurrence parameter 32 indicative of an occurrence of a font style in the document. For example, Fig. 2B can be considered as showing the style occurrence parameters associated with the respective font styles of the lines of text of document 1 of Fig. 2A.

The style occurrence parameter 32 may indicate the total count of occurrence of a given font style associated with the one or more lines of the text of the document.

The style occurrence parameter 32 may be represented in the form percentages of the lines of text having a font style with respect to the total number of lines of text in the document.

The font style that provides the highest style occurrence parameter, such as the highest percentage, may indicate paragraph text for the line(s) of text having this particular font style.

The font style that provides the lowest style occurrence parameter, such as percentage may indicate a line of text that occurs the least, e.g., page numbers.

The font style that provides a style occurrence parameter meeting a criterion may indicate a heading, e.g., a clause heading.

In one or more examples where the disclosed technique is applied, a PDF reader reads document 1 , e.g. first data. For example, the second data may be a read file, such as a list of lines of text, and a list of their respective font style parameters. For example, for the given document, the lines of text can be grouped by the font style parameter. For example, the number of lines for each of the font style parameters are counted. For example, from the distribution of the font style parameters, it is determined which of the font style parameters are occurring in the document in a smaller number of times as compared to others. This may reflect that most of the lines in a contract document are not headings. In other words, the most frequent font style parameters in the distribution can represent the paragraph text, such as plain text in the file (e.g. the read file) rather than the headings. The rare font-styles can represent the headings and the sub-headings. For example, for the given file (e.g. the read file), the font styles are added to the category of rare font styles, which may constitute to 4%-10% of all the lines of text in the document. To remove some of the noisy texts that are detected in the rare font style category, some filters may be used after font style detection. For example, heading flag can be created only when it is written in the rare font-style category and when the word count is below a certain threshold (e.g. 1 <= Wordcount <= 12 for headings) and at least half of the words start with caps letter.

Fig. 3A-3B shows a flow diagram of an exemplary method 100, performed by an electronic device according to the disclosure, for providing clause data. The electronic device is the electronic device disclosed herein, such as the electronic device 300 of Fig. 4.

The method 100 comprises obtaining S102 first data indicative of a document. For example, the document may be seen as a digital document. For example, the document may be a legal document, such as a contract document. The document may be in machine readable format. The document may be in digital format, e.g., a PDF document, and/or a Microsoft word document. In one or more example methods, the first data may be obtained from the memory of the electronic device. In one or more example methods, the first data may be obtained from a second electronic device. In one or more example methods, the first data may be obtained from a database, such as a local server and/or a cloud database, and/or from an application programming interface, API.

In one or more example methods, the first data comprises text. For example, the first data comprises textual data, such as data indicative of the text of the document. The first data may comprise data indicative of text, such as headings and/or paragraph text. In one or more example methods, the text may be arranged into one or more lines, such as one or more lines of text. In one or more example methods, the one or more lines of text may comprise one or more headings, such as one or more clause headings. In one or more example methods, the one or more lines of text may comprise one or more sub-headings, such as one or more sub-clause headings. In one or more example methods, the one or more lines of text may comprise paragraph text, such as clause text. In one or more example methods, the one or more lines of text, such as a first line of text, a second line of text, and/or a third line of text, may have one or more font styles.

The method 100 comprises generating S104, based on the first data, second data comprising each line of text and a font style parameter associated with each line of text. In one or more example methods, the font style parameter is indicative of a font style of a corresponding line of text. For example, the second data may be represented in a table such as table 20 of Fig. 2A, where the line of text is in column 22 and the corresponding font style parameter is in column 24.

In one or more example methods, the second data may be in a digital format, such as PDF format, Hyper Text Markup Language, HTML, format, Text, TXT, format, and/or Document, DOC, format. In one or more example methods, the second data may comprise one or more lines of text from the first data. In one or more example methods, the second data may comprise a font style parameter associated with each line of text. In one or more example methods, the font style parameter may be indicative of a font style, such as Arial Bold, and/or Arial Narrow, of a corresponding line of text.

In one or more example methods, a font style, such as Times New Roman Bold, Arial Narrow, and/or Calibri light, may be seen as an attribute of the text of the first data. In one or more example methods, the first data may comprise one or more lines of text having one or more fonts, and/or represented and/or formatted in one or more font styles. For example, as illustrated in Fig. 2A, the first line of text may be represented in Arial Bold Narrow while the second line of text may be represented in Arial Narrow.

The method 100 comprises determining S106, based on the second data, for each font style, a style occurrence parameter indicative of an occurrence of a font style in the document. For example the second data may be a list of lines of text of the document, with their respective font style parameter.

In one or more example methods, the style occurrence parameter may be indicative of an occurrence of a font style in the first data. In one or more example methods, the style occurrence parameter may be indicative of the number of lines of text having a given font style. For example, Fig. 2B shows the style occurrence parameter for each font style of document 1 of Fig. 2A. For example, a document may comprise 22 lines of text. The second data may be generated based on the 22 lines of text. The second data may comprise each line of text from the 22 lines of text and a font style parameter associated with each line of text. The second data may comprise a font style parameter associated with each line of text. The font style parameter may be indicative of the font style of a corresponding line of text from the 22 lines of text. In other words, a font style for each line of text is determined. The 22 lines of text may have one or more font styles associated with one or more lines of text. For example, in the 22 lines of text, there may be 18 lines of text having a font style, such as a first font style. The style occurrence parameter may indicate that 18 lines of text have the first font style, such as 18/22. In the 22 lines of text, there may be 3 lines of text having a font style, such as a second font style different from the first font style. The style occurrence parameter may indicate that 3 lines of text have the second font style, such as 3/22. In the 22 lines of text, there may be 1 line of text having a font style, such as a third font style may be different from the first font style and the second font style. The style occurrence parameter may indicate that 1 line of text has the third font style, such as 1/22. In one or more example methods, the style occurrence parameter may represent the occurrence of a font style in the document in terms of percentages.

The method 100 comprises classifying S108, based on the style occurrence parameter, each line of text as heading or paragraph text. For example, a flag may be assigned based on the line text being classified as a heading or paragraph text. The electronic device may be a client device and/or a server device. For example, the electronic device may comprise an API configured to obtain, from a user or from an external device, a document, via the first data. For example, the electronic device may be an API configured to classify, based on the document (e.g. via the first data) and the style occurrence parameter, each line of text as heading or paragraph text. The API can be hosted in a server device or on a distributed cloud. The electronic device may be a clause extraction device.

In one or more example methods, the one or more lines may have one or more font styles. Based on the style occurrence parameter corresponding to each line of text, each line of text in the first data may be classified as a heading, e.g., a clause heading, a subheading and/or a paragraph text, such as a paragraph text associated with a clause and/ or a sub-clause, such as clause text and/or sub-clause text. In one or more example methods, the paragraph text may be seen as non-heading text, and/or body text and/or plain text. For example, the present disclosure provides a technique that can extract the paragraph text under a sub-heading.

In one or more example methods, the style occurrence parameter indicates a number of occurrences of a font style amongst all the lines of text in the document. In other words, the style occurrence parameter indicates the number of lines of text in the first data having a particular font style, such as the first font style, the second font style, and/or the third font style.

In one or more example methods, the method comprises grouping S107 the lines of text per font style parameter.

In other words, the lines of text in the first data may be grouped based on the font style pertaining to each line of text. In one or more example methods, the font style parameter may indicate of a font style of a corresponding line of text.

In one or more example methods, the classifying S108 of each line of text as heading or paragraph text based on the style occurrence parameter comprises determining S108A, based on the style occurrence parameter, a font style pertaining to a first category of font style. In one or more example methods, the first category is indicative of a heading, such as a clause heading, and/or a clause sub-heading.

In one or more example methods, based on the style occurrence parameter, the one or more font styles associated with the one or more lines of text of the document may be categorized into one or more categories such as a first category, a second category, and/or a third category. In one or more example methods, the first category may be a category for a first font style. In one or more examples the second category may be a category for a second font style. In one or more example, the third category may be a category for a third font style. In one or more example methods, the first category of font style may be indicative of a heading. In one or more example methods, the second category font style may be indicative of paragraph text. In one or more example methods, the third category font style may be indicative of outlier text, such as page numbers, and/or reference numbers and/or footnotes. In one or more example methods, the classifying S108 of each line of text as heading or paragraph text based on the style occurrence parameter comprises determining S108B whether a line of text has a font style pertaining to the first category. In other words, the electronic device can determine based on the style occurrence parameter of a font style whether a line of text has a font style pertaining to the first category

In one or more example methods, the classifying S108 of each line of text as heading or paragraph text based on the style occurrence parameter comprises, upon determining that the line of text has a font style pertaining to the first category, classifying S108C the line of text as heading. For example, upon determining that the line of text has a font style pertaining to the first category, then the line of text may be classified as heading, e.g. a clause heading.

In one or more example methods, the method comprises forgoing S108E the classification of the line of text, when the line of text does not have a font style pertaining to the first category.

In one or more example methods, the heading is indicative of a clause heading.

In one or more example methods, a clause may be seen as one or more lines of text that dictates the conditions under which a contract is legally enforceable, and/or determines the terms of the contract. In one or more example methods, the one or more lines of text may have one or more clauses. In one or more example methods, each clause may have a clause heading and a clause text.

In one or more example methods, the paragraph text is indicative of a clause text.

In one or more example methods, a clause may have a clause heading, and/or paragraph text, such as a clause text, associated with the heading, e.g., clause heading.

In one or more example methods, the method comprises providing S110, based on the first category, clause data indicative of the clause heading and the clause text.

In one or more example methods, the clause data comprises the clause heading and the clause text. In one or more example methods, the clause data may comprise a clause heading and/or one or more clause sub-headings. In one or more example methods, the clause data may comprise clause text. In one or more example methods, the text in the first document may have one or more lines of text indicative of clause data.

In one or more example methods, the method comprises determining S112 the paragraph text by including the text between a first delimiter and a second delimiter.

In one or more example methods, clause text associated with a clause heading is determined based on the text between a first delimiter and a second delimiter, as illustrated in Fig. 2A. In one or more example methods, a delimiter may be seen as a line of text classified as heading immediately preceding or immediately following a line of text classified as paragraph text.

In one or more example methods, the first delimiter and/or the second delimiter are a line of text pertaining to the first category.

In one or more example methods, the first delimiter and/or the second delimiter may be indicative of headings, such as clause headings. In one or more example methods, the first delimiter and/or the second delimiter may be indicative of sub-headings, such as clause sub-headings.

In one or more example methods, the paragraph text between the first delimiter and/or the second delimiter may be indicative of paragraph text associated with the first delimiter.

In one or more example methods, determining the paragraph text comprises identifying, based on the first category, a first delimiter. In one or more example methods, determining the paragraph text comprises determining whether a line of text after the first delimiter does not belong to the first category, and repeating the determining process until a line of text, after the first delimiter, is identified as a line of text belonging to the first category, i.e. the second delimiter. For example, by iterate through all lines, heading can be segregated as the key (clause heading) and all text after it till the next heading is the value (clause text). In one or more example methods, the method comprises determining S114 a font style pertaining to a second category of font style. In one or more example methods, the second category is indicative of the paragraph text.

In one or more example methods, the second category of font style may have a font style, such as the second font style. In one or more example methods, the font style of the second category of font style may be different from the font style of the first category of font style. In one or more example methods, the second font style may be different from the first font style. In one or more example methods, the second category is indicative of the paragraph text, such as the clause text associated with a clause heading and/or clause subheading.

In one or more example methods, when a line of text does not pertain to the first category, the method may comprise determining whether the line of text has a font style pertaining to a second category. In one or more example methods, upon determining that the line of text has a font style pertaining to the second category, the line of text may be classified as paragraph text, e.g., clause text.

In one or more example methods, when a line of text does not pertain to the second category, the method may comprise determining whether the line of text has a font style pertaining to a third category. In one or more example methods, upon determining that the line of text has a font style pertaining to the third category, the line of text may be classified as outlier text, such as page numbers.

In one or more example methods, the method comprises forgoing S108E the classification of the line of text, when a line of text does not pertain to any one of the categories, such as the first category, the second category, and/or the third category.

In one or more example methods, the method comprises providing, based on the first category and the second category, clause data indicative of the clause heading and the clause text.

In one or more example methods, determining the paragraph text comprises identifying, based on the first category, a first delimiter. In one or more example methods, determining the paragraph text comprises determining whether a line of text after the first delimiter belongs to the second category, and repeating the determining process until a line of text, after the first delimiter, is identified as a line of text belonging to the first category, i.e. the second delimiter.

In one or more example methods, determining S108A, based on style occurrence parameters of the document, the font style pertaining to the first category indicative of the heading comprises determining whether the style occurrence parameter of the font style satisfies a criterion.

In one or more example methods, the criterion may be based on style occurrence parameter of a font style.

In one or more example methods, determining S108A, based on style occurrence parameters of the document, the font style pertaining to the first category indicative of the heading by, upon determining that the style occurrence parameter of the font style satisfies the criterion, determining the font style as pertaining to the first category of font style indicative of the heading.

In one or more example methods, determining whether style occurrence parameter of the font style satisfies the criterion comprises determining whether the style occurrence parameter of the font style is within a range of occurrence of the font style.

In one or more example methods, determining S108A, based on style occurrence parameters of the document, the font style pertaining to the first category indicative of the heading comprises determining the font style, e.g., the first font style, as a font style pertaining to the first category of font style indicative of a heading when the style occurrence parameter of the font style satisfies the criterion.

In one or more example methods, determining S108A, based on style occurrence parameters of the document, the font style pertaining to the first category indicative of the heading comprises forgoing determining the font style as pertaining to the first category of font style indicative of a heading when the style occurrence parameter of the font style does not satisfy the criterion.

In one or more example methods, the criterion is based on a range of occurrence of the font style. In one or more example methods, the range is 3-12%. In one or more example methods, the range may be 4-10%. In one or more example methods, the range may be 5-8%. In one or more example methods, the range may be 6-7%.

In one or more example methods, the criterion is based on a first threshold of word count in the corresponding line of text.

In one or more example methods, the criterion may comprise a first primary criterion. The first primary criterion may comprise a first threshold. The first threshold may be based on the word count. The word count may be indicative of a count of words, such as a number of words. In other words, a total number of words in a line of text. In one or more example methods, the first primary criterion may be seen as satisfied by a line of text when the word count of the line of text is below the first threshold. For example, the first threshold of word count may comprise values between 1 and 12. In one or more example methods, when a line of text having a font style pertaining to the first category of font style has a word count below 12, then the line of text may be considered satisfying the first primary criterion. In one or more example methods, the first primary criterion may be seen as not satisfied by a line of text when the word count of the line of text is above the first threshold. For example, 1 <= Wordcount <= 12 for headings.

In one or more example methods, the criterion is based on a second threshold of words in the corresponding line of text that are in capital letters.

In one or more example methods, the criterion may comprise a first secondary criterion. The first secondary criterion may comprise a second threshold. The second threshold may be based on the number of words that start with a capital letter.

In one or more example methods, the first secondary criterion may be seen as satisfied when the number of words of a line of text starting with a capital letter is above or equal to the first secondary threshold.

In one or more example methods, the first secondary criterion may be seen as not satisfied, when the number of words of the line of text starting with a capital letter is below the first secondary threshold.

In one or more examples, the first secondary threshold may be 50%. In one or more example methods, when less than 50% of the words of the line of text do not start with a capital letter, the first secondary criterion may be seen as not satisfied by the line of text having a font style pertaining to the first category of font style. In one or more example methods, when 50% or more of the words of the line of text do not start with a capital letter, the first secondary criterion may be seen as satisfied by the line of text having a font style pertaining to the first category of font style.

In one or more example methods, the classifying S108 of each line of text as heading or paragraph text based on the style occurrence parameter comprises labelling S108D, based on the first category, a corresponding line of text as heading.

In one or more example methods, labelling a line of text as heading may comprise labelling the line of text as clause heading. In one or more example methods, labelling a line of text as heading may comprise labelling the line of text as clause sub-heading.

In one or more example methods, the first category is indicative of a clause sub-heading. For example, the present disclosure provides a technique that can extract the paragraph text under a sub-heading.

In one or more example methods, when the style occurrence parameter, for a font style, such as the first font style, meets a first primary criterion (e.g. inside a range), then the first style may be classified as a font style associated with a heading and/or subheading.

In one or more example methods, when the style occurrence parameter, for a font style, such as the second font style, does not meet the first primary criterion (e.g. outside a range) then the second style may be classified as a font style associated with paragraph text, e.g., clause text, non-heading text, and/or body text.

In one or more example methods, when the style occurrence parameter, for a font style, such as the third font style, meets a first secondary criterion, then the third style may be seen as font style associated with page numbers.

In one or more example methods, the lines of text have a font style pertaining to the first category of style may be indicative of a sub-heading, such as a clause sub-heading.

It may be envisaged that the electronic device can detect whether a line is written in bold style and flag it as heading. In one or more example methods, the method comprises providing clause data, to a clause repository, such as a clause database, such as a server for clause control.

Fig. 4 shows a block diagram of an exemplary electronic device 300 according to the disclosure. The electronic device 300 comprises a memory circuitry 301 , a processor circuitry 302, and an interface 303. The electronic device 300 is configured to perform any of the methods disclosed in Fig. 3A-B. In other words, the electronic device 300 is configured for providing clause data.

The electronic device 300 is configured to obtain (such as using the processor circuitry 302, and/or via the interface 303) first data indicative of a document. In one more example electronic devices, the first data comprises text.

The electronic device 300 is configured to generate (such as using the processor circuitry 302), based on the first data, second data comprising each line of text and a font style parameter associated with each line of text. In one more example electronic devices, the font style parameter is indicative of a font style of a corresponding line of text.

The electronic device 300 is configured to determine (such as using the processor circuitry 302), based on the second data, for each font style, a style occurrence parameter indicative of an occurrence of a font style in the document.

The electronic device 300 is configured to classify (such as using the processor circuitry 302), based on the style occurrence parameter, each line of text as heading or paragraph text. The electronic device may be a client device and/or a server device. For example, the electronic device may comprise an API configured to obtain, from a user or from an external device, a document, via the first data. For example, the electronic device may be an API configured to classify, based on the document (e.g. via the first data) and the style occurrence parameter, each line of text as heading or paragraph text. The API can be hosted in a server device or on a distributed cloud. The electronic device may be a clause extraction device.

In one more example electronic devices, the classifying of each line of text as heading or paragraph text based on the style occurrence parameter comprises to determine, based on the style occurrence parameter, a font style pertaining to a first category of font style.

In one or more example electronic devices, the first category is indicative of a heading.

In one more example electronic devices, the classifying of each line of text as heading or paragraph text based on the style occurrence parameter comprises to determine, whether a line of text has a font style pertaining to the first category.

In one more example electronic devices, the classifying of each line of text as heading or paragraph text based on the style occurrence parameter comprises to, upon determining that the line of text has a font style pertaining to the first category, classify the line of text as heading.

In one more example electronic devices, the heading is indicative of a clause heading.

In one more example electronic devices, the paragraph text is indicative of a clause text.

In one more example electronic devices, the electronic device 300 is configured to provide (such as using the processor circuitry 302 and/or interface 303), based on the first category, clause data indicative of the clause heading and the clause text.

In one more example electronic devices, the clause data comprises the clause heading and the clause text.

In one more example electronic devices, the electronic device 300 is configured to determine (such as using the processor circuitry 302) the paragraph text by including the text between a first delimiter and a second delimiter.

In one more example electronic devices, the first delimiter and/or the second delimiter are a line of text pertaining to the first category.

In one more example electronic devices, the electronic device 300 is configured to determine (such as using the processor circuitry 302) a font style pertaining to a second category of font style. In one more example electronic devices, the second category is indicative of the paragraph text.

In one more example electronic devices, the style occurrence parameter indicates a number of occurrences of a font style amongst all the lines of text in the document. In one more example electronic devices, the electronic device is configured to group (such as using the processor circuitry 302) the lines of text per font style parameter.

In one more example electronic devices, the electronic device 300 is configured to determine (such as using the processor circuitry 302), based on style occurrence parameters of the document, the font style pertaining to the first category indicative of the heading by determining whether the style occurrence parameter of the font style satisfies a criterion.

In one more example electronic devices, the electronic device 300 is configured to determine (such as using the processor circuitry 302), based on style occurrence parameters of the document, the font style pertaining to the first category indicative of the heading by, upon determining that the style occurrence parameter of the font style satisfies the criterion, determining the font style as pertaining to the first category of font style indicative of the heading.

In one more example electronic devices, the criterion is based on a range of occurrence of the font style.

In one more example electronic devices, the range is 3-12%.

In one more example electronic devices, the criterion is based on a first threshold of word count in the corresponding line of text.

In one more example electronic devices, the criterion is based on a second threshold of words in the corresponding line of text that are in capital letters.

In one more example electronic devices, the classifying of each line of text as heading or paragraph text based on the style occurrence parameter comprises labelling, based on the first category, a corresponding line of text as heading.

In one more example electronic devices, the first category is indicative of a sub-heading, such as clause sub-heading.

In one more example electronic devices, the electronic device 300 is configured to provide (e.g. output, e.g. via the interface 303) the clause data, e.g. to a clause database, such as a clause repository, such as a clause repository capable of being queried. The processor circuitry 302 is optionally configured to perform any of the operations disclosed in Figs. 3A-B (such as any one or more of: S102, S104, S106, S107, S108, S108A, S108B, S108C, S108D, S108E, S110, S112, S114). The operations of the electronic device 300 may be embodied in the form of executable logic routines (e.g., lines of code, software programs, etc.) that are stored on a non-transitory computer readable medium (e.g., the memory circuitry 301 ) and are executed by the processor circuitry 302).

Furthermore, the operations of the electronic device 300 may be considered a method that the electronic device 300 is configured to carry out. Also, while the described functions and operations may be implemented in software, such functionality may as well be carried out via dedicated hardware or firmware, or some combination of hardware, firmware and/or software.

The memory circuitry 301 may be one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, a random access memory (RAM), or other suitable device. In a typical arrangement, the memory circuitry 301 may include a non-volatile memory for long term data storage and a volatile memory that functions as system memory for the processor circuitry 302. The memory circuitry 301 may exchange data with the processor circuitry 302 over a data bus. Control lines and an address bus between the memory circuitry 301 and the processor circuitry 302 also may be present (not shown in Fig. 4). The memory circuitry 301 is considered a non-transitory computer readable medium.

The memory circuitry 301 may be configured to store the first data, the second data, and/or clause data, in a part of the memory.

Embodiments of methods and products (electronic device) according to the disclosure are set out in the following items:

Item 1 . An electronic device comprising memory circuitry, processor circuitry, and an interface, wherein the electronic device is configured to: obtain first data indicative of a document, wherein the first data comprises text; generate, based on the first data, second data comprising each line of text and a font style parameter associated with each line of text, wherein the font style parameter is indicative of a font style of a corresponding line of text; determine, based on the second data, for each font style, a style occurrence parameter indicative of an occurrence of a font style in the document; and classify, based on the style occurrence parameter, each line of text as heading or paragraph text.

Item 2. The electronic device of item 1 , wherein the classifying of each line of text as heading or paragraph text based on the style occurrence parameter comprises:

- determining, based on the style occurrence parameter, a font style pertaining to a first category of font style, wherein the first category is indicative of a heading;

- determining whether a line of text has a font style pertaining to the first category; and

- upon determining that the line of text has a font style pertaining to the first category, classifying the line of text as heading.

Item 3. The electronic device of any of the previous items, wherein the heading is indicative of a clause heading.

Item 4. The electronic device of any of the previous items, wherein the paragraph text is indicative of a clause text.

Item 5. The electronic device of item 4, wherein the electronic device is configured to provide, based on the first category, clause data indicative of the clause heading and the clause text.

Item 6. The electronic device of item 5, wherein the clause data comprises the clause heading and the clause text.

Item 7. The electronic device of any of the previous items, wherein the electronic device is configured to determine the paragraph text by including the text between a first delimiter and a second delimiter.

Item 8. The electronic device of items 2 and 7, wherein the first delimiter and/or the second delimiter are a line of text pertaining to the first category. Item 9. The electronic device of any of the previous items, wherein the electronic device is configured to determine a font style pertaining to a second category of font style, wherein the second category is indicative of the paragraph text.

Item 10. The electronic device of any of the previous items, wherein the style occurrence parameter indicates a number of occurrences of a font style amongst all the lines of text in the document.

Item 11. The electronic device of any of the previous items, wherein the electronic device is configured to group the lines of text per font style parameter.

Item 12. The electronic device of any of items 5-11 , wherein the electronic device is configured to determine, based on style occurrence parameters of the document, the font style pertaining to the first category indicative of the heading by: determining whether the style occurrence parameter of the font style satisfies a criterion; and upon determining that the style occurrence parameter of the font style satisfies the criterion, determining the font style as pertaining to the first category of font style indicative of the heading.

Item 13. The electronic device of item 12, wherein the criterion is based on a range of occurrence of the font style.

Item 14. The electronic device of item 13, wherein the range is 3-12%.

Item 15. The electronic device of any of items 12-14, wherein the criterion is based on a first threshold of word count in the corresponding line of text.

Item 16. The electronic device of any of items 12-15, wherein the criterion is based on a second threshold of words in the corresponding line of text that are in capital letters.

Item 17. The electronic device of any of the previous items, wherein the classifying of each line of text as heading or paragraph text based on the style occurrence parameter comprises labelling, based on the first category, a corresponding line of text as heading. Item 18. The electronic device of any of the previous items, wherein the first category is indicative of a sub-heading.

Item 19. A method, performed by an electronic device, for providing clause data, the method comprising: obtaining (S102) first data indicative of a document, wherein the first data comprises text; generating (S104), based on the first data, second data comprising each line of text and a font style parameter associated with each line of text, wherein the font style parameter is indicative of a font style of a corresponding line of text; determining (S106), based on the second data, for each font style, a style occurrence parameter indicative of an occurrence of a font style in the document; and classifying (S108), based on the style occurrence parameter, each line of text as heading or paragraph text.

Item 20. The method of item 19, wherein the classifying (S108) of each line of text as heading or paragraph text based on the style occurrence parameter comprises: determining (S108A), based on the style occurrence parameter, a font style pertaining to a first category of font style, wherein the first category is indicative of a clause heading; determining (S108B) whether a line of text has a font style pertaining to the first category; and upon determining that the line of text has a font style pertaining to the first category, classifying (S108C) the line of text as heading.

Item 21 . The method according to any of items 19-20, wherein the heading is indicative of a clause heading.

Item 22. The method according to any of items 19-21 , wherein the paragraph text is indicative of a clause text.

Item 23. The method of item 22, the method comprising providing (S110), based on the first category, clause data indicative of the clause heading and the clause text. Item 24. The method of item 23, wherein the clause data comprises the clause heading and the clause text.

Item 25. The method according to any of items 19-24, the method comprising determining (S112) the paragraph text by including the text between a first delimiter and a second delimiter.

Item 26. The method of items 20 and 25, wherein the first delimiter and/or the second delimiter are a line of text pertaining to the first category.

Item 27. The method according to any of items 19-26, the method comprising determining (S114) a font style pertaining to a second category of font style, wherein the second category is indicative of the paragraph text.

Item 28. The method according to any of items 19-27, wherein the style occurrence parameter indicates a number of occurrences of a font style amongst all the lines of text in the document.

Item 29. The method according to any of items 19-28, the method comprising grouping (S107) the lines of text per font style parameter.

Item 30. The method according to any of items 23-29, wherein determining (S108A), based on style occurrence parameters of the document, the font style pertaining to the first category indicative of the heading by: determining whether the style occurrence parameter of the font style satisfies a criterion; and upon determining that the style occurrence parameter of the font style satisfies the criterion, determining the font style as pertaining to the first category of font style indicative of the heading.

Item 31. The method of item 30, wherein the criterion is based on a range of occurrence of the font style.

Item 32. The method of item 31 , wherein the range is 3-12%.

Item 33. The method according to any of items 30-32, wherein the criterion is based on a first threshold of word count in the corresponding line of text. Item 34. The method according to any of items 30-33, wherein the criterion is based on a second threshold of words in the corresponding line of text that are in capital letters.

Item 35. The method according to any of items 19-34, wherein the classifying (S108) of each line of text as heading or paragraph text based on the style occurrence parameter comprises labelling (S108D), based on the first category, a corresponding line of text as heading.

Item 36. The method according to any of items 19-35, wherein the first category is indicative of a sub-heading.

Item 37. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by an electronic device cause the electronic device to perform any of the methods of items 19-36.

The use of the terms “first”, “second”, “third” and “fourth”, “primary”, “secondary”, “tertiary” etc. does not imply any particular order, but are included to identify individual elements. Moreover, the use of the terms “first”, “second”, “third” and “fourth”, “primary”, “secondary”, “tertiary” etc. does not denote any order or importance, but rather the terms “first”, “second”, “third” and “fourth”, “primary”, “secondary”, “tertiary” etc. are used to distinguish one element from another. Note that the words “first”, “second”, “third” and “fourth”, “primary”, “secondary”, “tertiary” etc. are used here and elsewhere for labelling purposes only and are not intended to denote any specific spatial or temporal ordering. Furthermore, the labelling of a first element does not imply the presence of a second element and vice versa.

It may be appreciated that Figs. 1-4 comprises some circuitries or operations which are illustrated with a solid line and some circuitries or operations which are illustrated with a dashed line. The circuitries or operations which are comprised in a solid line are circuitries or operations which are comprised in the broadest example embodiment. The circuitries or operations which are comprised in a dashed line are example embodiments which may be comprised in, or a part of, or are further circuitries or operations which may be taken in addition to the circuitries or operations of the solid line example embodiments. It should be appreciated that these operations need not be performed in order presented. Furthermore, it should be appreciated that not all of the operations need to be performed. The exemplary operations may be performed in any order and in any combination.

It is to be noted that the word "comprising" does not necessarily exclude the presence of other elements or steps than those listed.

It is to be noted that the words "a" or "an" preceding an element do not exclude the presence of a plurality of such elements.

It should further be noted that any reference signs do not limit the scope of the claims, that the exemplary embodiments may be implemented at least in part by means of both hardware and software, and that several "means", "units" or "devices" may be represented by the same item of hardware.

The various exemplary methods, devices, nodes and systems described herein are described in the general context of method steps or processes, which may be implemented in one aspect by a computer program product, embodied in a computer- readable medium, including computer-executable instructions, such as program code, executed by computers in networked environments. A computer-readable medium may include removable and non-removable storage devices including, but not limited to, Read Only Memory (ROM), Random Access Memory (RAM), compact discs (CDs), digital versatile discs (DVD), etc. Generally, program circuitries may include routines, programs, objects, components, data structures, etc. that perform specified tasks or implement specific abstract data types. Computer-executable instructions, associated data structures, and program circuitries represent examples of program code for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps or processes.

Although features have been shown and described, it will be understood that they are not intended to limit the claimed disclosure, and it will be made obvious to those skilled in the art that various changes and modifications may be made without departing from the scope of the claimed disclosure. The specification and drawings are, accordingly, to be regarded in an illustrative rather than restrictive sense. The claimed disclosure is intended to cover all alternatives, modifications, and equivalents.