SYSTEM AND METHOD TO DETECT AND GENERATE RELEVANT CONTENT FROM UNIFORM RESOURCE LOCATOR (URL)

Title:

SYSTEM AND METHOD TO DETECT AND GENERATE RELEVANT CONTENT FROM UNIFORM RESOURCE LOCATOR (URL)

Document Type and Number:

WIPO Patent Application WO/2020/101479

Kind Code:

Abstract:

Disclosed is a system and method to detect and generate relevant content from a received Uniform Resource Locator (URL). The system comprises an image analysis module (202), text analysis module (204), layout analysis module (206), and extraction module (208). The image analysis module (202) analyzes a plurality of images by capturing images from the received URL. The image analysis module (202) captures a URL screenshot from the received URL processed by an OCR engine. The text analysis module (204) analyzes the text by reading information about an HTML, file from the received URL. The text analysis module (204) utilizes a headless browser to download the HTML, file from the received URL and removes a plurality of HTML tags from the HTML file. The layout analysis module (206) analyzes a web layout by dividing the document object mode! (DOM) element blocks by scoring the height and weight of the web layout; T he layout analysis module (206) uses a JavaScript module to mark and sort each DOM element blocks by height and width. The extraction module (208) utilizes a system parser (210) to retrieve a title, and a date to classify an array of text. The extraction module (208) uses HTML metadata received by the text analysis module (204) and generates the relevant content in a textual format.

More Like This:

WO/2023/283520	TRANSFORMING RELATIONAL STATEMENTS INTO HIERACHICAL DATA SPACE OPERATIONS
WO/2020/170323	STRUCTURED DATA GENERATION SYSTEM AND PROGRAM
JP6542785	Implementation of semi-structured data as a first-class database element

Inventors:

AMRUDDIN AMRU YUSRIN (MY)
MOHD HELMI MOHD MARZUQ IKRAM (MY)
JOHARI MUHAMMAD AWIS JAMALUDDIN (MY)
MUSTAFFA ROSNIN (MY)
GOON WOOI KIN (MY)

Application Number:

PCT/MY2019/050094

Publication Date:

May 22, 2020

Filing Date:

November 14, 2019

Export Citation:

Click for automatic bibliography generation Help

Assignee:

MIMOS BERHAD (MY)

International Classes:

G06F16/81

Domestic Patent References:

WO2015122620A1

2015-08-20

Foreign References:

JP2012118737A	2012-06-21
JPH11120202A	1999-04-30
JP2011123740A	2011-06-23

Attorney, Agent or Firm:

PYPRUS SDN BHD (MY)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS:

1. A system to detect and generate relevant content from a received Uniform Resource Locator, URL, characterized in that, the system comprising:

a memory to store machine-readable instructions pertaining to generation of relevant content; and

a processor coupled to the memory and operable to execute the machine-readable instructions stored in the memory, wherein the processor comprises:

an image analysis module (202) for analyzing a plurality of images by capturing images from the received URL, wherein the image analysis module (202) captures a webpage of the received URL in a form of image file for processing by an Optical Character Recognition, OCR, engine;

a text analysts module (204) to analyze the text by reading information pertaining to an Hypertext Markup Language, HTML, file from the received URL, wherein the text analysis module (204) utilizes a headless browser to download the HTML file from the received URL and removes a plurality of HTML tags from the HTML file;

a layout analysis module (206) to analyze a web layout by dividing a plurality of document object model, DOM, element blocks by scoring height and width of the web- layout, wherein the layout analysis module (206) uses a JavaScript module to mark and sort each DOM element blocks by height and width; and

an extraction module (208) utilizes a system parser (210) to retrieve a title, and a date to classify an array of text, wherein the extraction module (208) uses HTML metadata received by the text analysis module (204) and generates the relevant content in a textual format.

2. The system according to claim 1, wherein the OCR engine provides an array of text and transmits the array of text to the text analysis module (204).

3. The system according to claim 1, wherein the text analysis module (204) removes the HTML tags from the HTML file and processes the remaining array of text.

4. The system according to claim l , wherein the layout analysis module (206) further analyses number of words in each DOM element block to determine an article content and processes an array of text blocks sorted base on size of the web layout.

5. The system according to claim I, wherein the system parser (210) retrieves the title by using DOM extraction for a plurality of <titie></title> tags and retrieves the date using a regular expression on text data received from the text analysis module (204), the layout analysis module (206), and the extraction module (208),

6. A method for detecting and generating relevant content from a received Uniform Resource Locator, URL, characterized in that the method comprising steps of:

receiving, by an image analysis (202), a URL;

analyzing a plurality of images by capturing images from the received URL, wherein a webpage is captured in a form of image file from the received URL and process by an Optical Character Recognition, OCR, engine;

analyzing, by a text analysis module (204), texts by reading information pertaining to an Hypertext Markup Language, HTML, file from the received URL, wherein the text analysis module utilizing a headless browser to download the HTML file and remove HTML tags from the HTML file;

analyzing , by a layout analysis module (206), a web-layout by dividing a plurality of document object model, DOM, element blocks by scoring height and width of the web layout;

retrieving by an extraction module (208), a title and a date of the URL through a system parser (210), wherein the extraction module (208) uses HTML metadata received by the tet analysis module (204); and

analyzing and generating detected relevant content in a textual format. 7. The method according to claim 6, wherein the OCR engine provides an array of text and transmitting the array of text to the text analysis module (204).

8. 'Hie method according to claim 6, further comprising removing the HTML lags from the HTML file and processing the remaining array of text by the text analysis module (204).

9. The method according to claim 6, wherein the system parser (210) retrieves the title by using DOM extraction for a plurality of <title></title> tags and retrieves the date using a regular expression on text data received from the text analysis module (204), the layout analysis module (206), and the extraction module (208).

10. The method according to claim 9, wherein each array of text data from the text analysis module (204), the layout analysis module (206), and the extraction module (208) are compared by using an intersection of a plurality of predefined sets of a matrix.

Description:

SYSTEM AND METHOD TO DETECT AND GENERATE RELEVANT CONTENT FROM UNIFORM RESOURCE LOCATOR (IJRL)

FIELD OF INVENTION

[0001] file present invention relates to a system and method for content retrieval, in particular to system and method to detect and generate relevant content from a received Uniform Resource Locator (URL). BACKGROUND

[0002] The continued growth of the data over the internet has great commercial value, and become an important source of intelligence information. Detection and retrieval of relevant, and important data from a uniform resource locator (URL) depend on various parameters. Existing systems and methods capture important data and extract the captured data from the URL if it contains a proper HTML format, does not contain AJAX pages, has sufficient text information, and navigates the DOM structure (by using either XPath or CSS Query).

[0003] In case the criteria mentioned above are not met, the probability of detecting and extracting important data becomes low. Further, the DOM structure is bound to change depending on the website owner. Some websites use dynamic CSS class or dynamic DOM elements making it hard to get the right data.

[0004] US patent number 9,442,903 B2 issued to Katie discloses a method for generating preview data for online content. It receives a hyperlink from a user client device and acquires a set of data from a target referenced by the hyperlink. Then it stores a portion of the set of data on a server for analyzing the set of data to automatically generate a set of candidates for each of a summary and title of a preview for the hyperlink. The preview provides a synopsis of content at the target referenced by the hyperlink. Then it ranks the summary and title preview candidates based on relevance to the target referenced by the hyperlink. f¾e relevance is determined via analysis of positive and negative signals for each candidate. It selects a top-ranked summary and title from the set of candidates. It transmits preview candidate with selected summary and title to the user client device. However, the method for generating preview data for online content uses an algorithm to provide positive and negative signals for the data elements to select the proper data to use for the link preview. Positive signals indicate that a particular image, name, or sentence is relevant or appropriate to use in describing the linked content. Negative signals indicate that a particular image, or other data element, may not be suitable for use as automatically generated preview data.

[0005] US patent application number 2015/0066895 A l issued to Komissarchik; Julia; et al. discloses a system and method of creating a domain-specific facts network. It includes a set of document acquisition servers to collect information from the internet and uses surface and deep web crawling mechanisms. It also includes a document repository database that stores all the collected documents. Then it includes a set of knowledge agent servers to process the documents stored in the database and extract candidate facts from these documents. The candidate facts are stored in the candidate database. It also includes inference and verification servers to integrate and verify candidate facts from the database and stores the results in the knowledge database. The knowledge database can be used as a source for data feeds and can be copied to a database server for an internet application, such as a business information search, job search or travel search. However, the creation of a domain-specific facts network converts unstructured and semi-structured information into a structured format to be used as a knowledge repository for different search applications and is not able to accurately detect and extract the important and high fidelity data when sufficient textual information is not provided by the internet and websites.

[0006] Therefore, there is a need for a reliable, efficient and cost-effective system and method to detect and generate relevant content from a received Uniform Resource Locator (URL).

SUMMARY

[0007] The present invention discloses a system and method to detect and generate relevant content from a received Uniform Resource Locator (URL). Furthermore, there is also a need for a system and method to perform high fidelity content detection on targeted URL by at least one of analyzing images (still image), and text (text extraction) of the web content, vision-based content extraction, and analysing block web-layout (size of web- layout) of the web content to extract useful information therefrom. In one aspect of the present invention, there is provided a system to delect and generate relevant content from a received Uniform Resource Locator, URL. The system comprises a memory to store machine-readable instructions pertaining to generation of relevant content; and a processor coupled to the memory and operable to execute the machine-readable instructions stored in the memory. The processor comprises an image analysis module for analyzing a plurality of images by capturing images from the received URL, wherein the image analysis module captures a webpage of the received URL in a form of image file for processing by an Optical Character Recognition, OCR, engine; a text analysis module to analyze the text by reading information pertaining to an Hypertext Markup Language, HTML, file from the received URL, wherein the text analysis module utilizes a headless browser to download the HTML file from the received URL and removes a plurality of HTML tags from the HTML file; a layout analysis module to analyze a web layout by dividing a plurality of document object model, DOM, element blocks by scoring height and width of the web-layout, wherein the layout analysis module uses a JavaScript module to mark and sort each DOM element blocks by height and width; and an extraction module utilizes a system parser to retrieve a title, and a date to classify an array of text, wherein the extraction module uses HTML metadata received by the text analysis module and generates the relevant content In a textual format.

[0009] In one embodiment, the OCR engine provides an array of text and transmits the array of text to the text analysis module. Further, the text analysts module removes the HTML tags from the HTML file and processes the remaining array of text.

[0010] In another embodiment, the layout analysts module is adapted to further analyses number of words in each DOM element block to determine an article content and processes an array of text blocks sorted base on size of the web layout. In a further embodiment, the system parser retrieves the title by using DOM extraction for a plurality of <title></title> tags and retrieves the date using a regular expression on text data received from the text analysis module, the layout analysis module, and the extraction module.

[0011] In another aspect of the present invention, there is provided a method for detecting and generating relevant content from a received Uniform Resource Locator, URL. The method comprises steps of receiving, by an image analysis, a URL; analyzing a plurality of images by capturing images from the received URL, wherein a webpage is captured in a form of image file from the received URL and process by an Optical Character Recognition, OCR, engine; analyzing, by a text analysis module, texts by reading information pertaining to an Hypertext Markup Language, HTML, file from the received URL, wherein the text analysis module utilizing a headless browser to download the HTML file and remove HTML tags from the HTML file; analyzing, by a layout analysis module, a web-layout by dividing a plurality of document object model, DOM, element blocks by scoring height and width of the web layout; retrieving by an extraction module, a title and a date of the URL through a system parser, wherein the extraction module uses HTML metadata received by the text analysis module; and analyzing and generating detected relevant content in a textual format.

[0012] In one embodiment, the OCR engine provides an array of text and transmitting the array of text to the text analysis module, in another embodiment, it further comprises removing the HTML tags from die HTML file and processing the remaining array of text by the text analysis module.

[0013] In another embodiment. The method according to claim 6, wherein the system parser retrieves the title by using DOM extraction for a plurality of <title></title> tags and retrieves the date using a regular expression on text data received from the text analysis module, the layout analysis module, and the extraction module.

[0014] In a further embodiment, each array of text data from the text analysis module, the layout analysis module (206), and the extraction module (208) are compared by using an intersection of a plurality of predefined sets of a matrix.

[0015] Accordingly, one advantage of the present invention is that it captures the important data and extracts the important data from URL if it contains dynamic HTML5, and AJAX pages, and further extracts the important data when the URL contains less sufficient textual information.

[0016] Accordingly, one advantage of the present invention is that it analyses text by- reading HTML tags from the URL received from a user computing device. [0017] Accordingly, one advantage of the present invention is that it analyses web-layout by dividing the DOM element block by height and width for scoring.

[0018] Accordingly, one advantage of the present invention Is that it detects title and date by using a system parser. [0019] Accordingly, one advantage of the present invention is that it analyses content by returning final content in text format.

[0020] Other features of embodiments of the present invention will be apparent from accompanying drawings and from the detailed description that follows.

BRIEF DESCRIPTION OF THE DRAWINGS

[0021] In the figures, similar components and/or features may have the same reference label. Further, various components of the same type may be distinguished by following the reference label with a second label that distinguishes among the similar components. If only the first reference label is used in the specification, the description applies to any one of the similar components having the same first reference label irrespective of the second reference label.

[0022] FIG. I illustrates a flowchart of the method for detecting and generating relevant content from a received Uniform Resource Locator (URL), in accordance with an embodiment of the present invention.

[0023] FIG. 2 illustrates a system for detecting and generating relevant content from a received Uniform Resource Locator (URL), in accordance with an embodiment of the present invention.

[0024] FIG. 3 illustrates a flowchart of the URL inserted into the system, in accordance with an embodiment of the present invention.

[0025] FIG. 4 illustrates a flowchart of the present system for capturing images and analyzing the inserted URL, in accordance with an embodiment of the present invention. [0026] FIG. 5 illustrates a flowchart of the present system for reading HTML tag and analyzing the inserted URL, in accordance with an embodiment of the present invention.

[0027] FIG. 6 illustrates a flowchart of the present system for analyzing the web-layout block of the content from HTML, page, in accordance with an embodiment of the present invention.

[0028] FIG. 7 illustrates a flowchart of the system parser for obtaining title and date, in accordance with an embodiment of the present invention.

DETAILED DESCRIPTION

[0029] Systems and methods are disclosed for detecting and generating relevant content from a received Uniform Resource Locator (URL). Embodiments of the present invention include various steps, which will be described below. The steps may be performed by hardware components or may be embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor programmed with the instructions to perform the steps. Alternatively, steps may be performed by a combination of hardware, software, firmware, and/or by human operators.

[0030] Various methods described herein may be practised by combining one or more machine-readable storage media containing the code according to the present invention with appropriate standard computer hardware to execute the code contained therein. An apparatus for practicing various embodiments of the present invention may involve one or more computers (or one or more processors within a single computer) and storage systems containing or having network access to computer program(s) coded in accordance with various methods described herein, and the method steps of the invention could be accomplished by modules, routines, subroutines, or subparls of a computer program product.

[0031] The present invention discloses a system and method whereby the relevant, important and high fidelity data is automatically detected and generated from the Uniform Resource Locator (URL). The system and method include an image analysis module, a text analysis module, a layout analysis module, and an extraction module. The image analysis module analyzes images by capturing images from the received URL. The image analysis module captures a URL screenshot from the received URL processed by the OCR engine. The (ext analysis module analyzes the text by reading information about an HTML file from the received URL. The (ext analysis module utilizes a headless browser to download the HTML file from the received URL and removes a plurality of HTML tags from the HTML file. The layout analysis module analyzes a web layout by presenting the webpage through Document Object Model (DOM) Application Programming Interface (API). Through live DOM API, the website is divided into DOM element blocks. Each of the DOM element block is scored with height and width of the web layout. In one embodiment, the DOM element block is scored based on an area of the element block. The layout analysis module may use a JavaScript module to mark and sort each DOM element blocks accordingly. The extraction module utilizes a system parser to retrieve a title, and a date to classify an array of text. The extraction module uses HTML metadata received by the text analysis module and generates the relevant content in a textual format,

[0032] Although the present invention has been described with the purpose of detecting and generating relevant content from a received Uniform Resource Locator (URL), it should be appreciated that the same has been done merely to illustrate the invention in an exemplary manner and any other purpose or function for which explained structures or configurations could be used, is covered within the scope of the present invention.

[0033] Exemplary embodiments will now be described more fully hereinafter with reference to the accompanying drawings, in which exemplaly embodiments are shown. This invention may, however, be embodied in many different forms and should not be construed as limited to tihe embodiments set forth herein. These embodiments are provided so that this invention will be thorough and complete and will fully convey the scope of the invention to those of ordinary skill in the art. Moreover, all statements herein reciting embodiments of the invention, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future (i.e., any elements developed that perform the same function, regardless of structure).

[0034] Thus, for example, it will be appreciated by those of ordinary skill in the art that the diagrams, schematics, illustrations, and the like represent conceptual views or processes illustrating systems and methods embodying this invention. The functions of the various elements shown in the figures may be provided through the use of dedicated hardware as well as hardware capable of executing associated software. Similarly, any switches shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even inanua!!y, the particular technique being selectable by the entity implementing this invention. Those of ordinary skill in the art further understand that the exemplary hardware, software, processes, methods, and/or operating systems described herein are for illustrative purposes and, thus, are not intended to be limited to any particular name.

[0035] Specific details are given in the following description to provide a thorough understanding of the embodiments. However, it will be understood by one of ordinary skill in the art that the embodiments may be practised without these specific details. For example, circuits, systems, networks, processes, and other components may be shown as components in block diagram form in order not to obscure the embodiments in unnecessary detail. In other instances, well-known circuits, processes, algorithms, structures, and techniques may be shown without unnecessary detail to avoid obscuring the embodiments.

[0036] FIG. 1 illustrates a process 100 of high fidelity content detection in accordance with an embodiment of the present invention. The process extracts important information from, for example, an Uniform Resource Locator (URL) of a website, and based on the webpage contents, the system processes in terms of images, texts, layout, title and date to output the extracted information in text format. The process is initiated with a step 102 of receiving the URL. In an embodiment, the received URL is encoded in a predefined format. In an embodiment, the predefined format is posted in an image analysis module 202, a text analysis module 204, and a layout analysis module 206. Hie image analysis module 202 analyzes a plurality of images by capturing images from the received URL. The image analysts module 204 captured an image of webpage of the given URL and processes by an OCR engine. In an embodiment, the OCR. engine provides an array of text and transmits the array of text to the images analysis module 202.

[0037] The method then includes a step 104 of analyzing a plurality of images by capturing images from the received URL. Further, the method includes the step 106 of analyzing texts by reading a plurality of HTML tags from the received URL. The method then includes the step 108 of analyzing a web-layout through Document Object Model (DOM) Application Programming Interface (API). The web-layout is divided into a plurality of DOM element blocks. Each DOM element block is further scored by their respective height and width. In one embodiment, the element block can be scored by area size. Generally, the DOM is a cross-platform and language-independent application programming interface that treats an HTML., XHTML, or XML document as a tree structure wherein each node is an object representing a part of the document. The objects can be manipulated programmatically, and any visible changes are occurring, as a result, may then be reflected in the display of the document.

[0038] The method then includes the step 110 of obtaining a title and date of the URL through a system parser 210 (shown in FIG. 2), Furthermore, the method includes the step 112 of analyzing and generating detected relevant content in a textual format. In an embodiment, the system parser 210 retrieves the title by using DQM extraction for a plurality of <title></title> tags and retrieves the date using a regular expression on text data received from the text analysis module 204, the layout analysis module 206, and the extraction module 208.

[0039] In an embodiment, each array of text data from die text analysis module 204, the layout analysis module 206, and the extraction module 208 are compared by using an intersection of predefined sets of a matrix. Following is an example of the matrix. Each array of text data from each of the modules are compared by using an intersection of sets method in the table matrix.

Table 1 : Matrices of array of text data rendered by the text analysis module, the layout analysis module and the extraction module.

[0040] The final content is obtained in textual fdrmal by using the below formula:

[0041]

[0042] With the above scheme, It captured the final content of the important data of the webpage in text form, whereby nan-important or less-important data will be filtered out. Based on the formula, the system is able to extract the important data based on the texts that appears the most from the various modules of the present invention. These non- important data include banners _* advertisements, images and videos, etc.

[0043] FIG. 2 illustrates a system 200 for detecting and generating relevant content from a received Uniform Resource Locator (URL), in accordance with an embodiment of the present invention. The system 200 includes an image analysis module 202, a text analysis module 204, a layout analysis module 206, and an extraction module 210. The image analysis module 202 is configured to analyze a plurality of images by capturing images from the received URL. The image analysis module 202 captures an image file of a webpage of the received URL processed by an OCR engine. In an embodiment, the OCR engine provides an array of text and transmits the array of text to the images analysis module 202.

[0044] The text analysis module 204 is configured to analyze the text by reading information about an HTML file from the received URL. The text analysis module 204 utilizes a headless browser, i.e. a web browser without Graphical User Interface (GUI), to download the HTML file from the received URL and removes a plurality of HTML tags from the HTML file. In an embodiment, the text analysis module 204 removes the HTML tags from the HTML file and processes the remaining array of text

[0045] The layout analysis module 206 is configured to analyze a web layout by dividing a plurality of document object model (DOM) element blocks by scoring, based for example, the height and width of the web layout. The layout analysis module 206 may use a JavaScript module to mark and sort each DOM. element blocks by height and width. In an embodiment, the layout analysis module 206 further analyses number of words in each DOM element block to determine an article content and processes an array of text blocks sorted base on size of the web layout.

[0046] The extraction module 208 is configured to utilize a system parser 210 to retrieve a title, and a date to classify an array of text. The extraction module 208 uses HTML metadata received by the text analysis module 204 and generates the relevant content in a textual format. In an embodiment, the system parser 210 retrieves the title by using DOM extraction for a plurality of < title ></tiile> tags and retrieves the date using a regular expression on text data received from the text analysis module 204, the layout analysis module 206, and the extraction module 208.

[0047] Through the system 200, the URL screenshot 250 is processed and output with a texture layout 252 of the URL screenshot 250.

[0048] FIG. 3 exemplifies a process 300 of text array extraction from the URL in accordance with an embodiment of the present invention. In this example, the YAHOO! main page is used to illustrate the process 300 only, not limitation. At step 302, the URL such as http://www.yahoo.com is inserted into the system 200. At step 304, the URL is encoded in a unified format to conserve the URL’s integrity. Base64 scheme, tor example may be used, to encode the URL, and in this case, the http./Avww. yahoo. com is encoded as as aHROcDovlJtUifySS YWhvby5JB2Q. At step 306, the encoded URL is posted to the image analysis module 202, the text analysis module 204, and the layout analysis module 206 for processing. Through the image analysis module 202, the system captures an image of the URL and analyses to extract text with at least an OCR module accordingly. The text analysis module 204 is adapted to extract the textual content from the HTML through headless browser. The layout analysis module 206 extracts text blocks sorted based on the layout sizing. In one embodiment, the extraction of the layout analysis may be carried out with JavaScript. At step 308, the system receives the response from the image analysts module 202, the text analysis module 204, and the layout analysis module 206.

[0049] FIG. 4 illustrates a process for image analysis 400 in accordance with an embodiment of the present invention. At step 402, the image analysis module captures the entire webpage of the given URL as image file. The image file can be in a compressed (lossless/lossy), uncompressed, vector or compound formats. At step 404, an Optical Character Recognition (OCR) engine is utilised to process the image file. At step 406, the OCR engine extracts and returns the array of text on the capture image file of the given URL’s page. Then at step 408, the system receives an array of text from image analysis module 202. In one embodiment, the image analysis may adapt readability process to filter and eliminate unwanted text.

[0050] FIG. 5 illustrates a process of text analysis 500 of in accordance with an embodiment of the present invention. At step 502, a HTML file from the targeted URL is downloaded by using a headless browser. At step 504, HTML tags ate removed from the HTML file. At step 506, the remaining array of text is obtained. At step 508, the system receives an array of text from the text analysis module 204. In the text analysis process, the meta data of the HTML file is also extracted. During the text analysis, in one embodiment, readability method or process can be used for filtering out unwanted texts.

[0051] FIG. 6 illustrates a process 600 of layout analysis in accordance with an embodiment of the present invention. At step 602, the HTML File from targeted URL downloaded using the headless browser. At step 604, it marks and sorts each DOM element block by height and width by using JavaScript. At step 606, the layout analysis module performs an analysis to compute a number of words in each block to determine article content. The system receives 608 an array of text blocks sorted base on layout sizing (big to small layout) from layout analysis module 206. In one embodiment, the layout sizing comprises area size.

[0052] FIG. 7 illustrates a process 700 of text parsing in accordance with an embodiment of the present invention. At step 702, the system 200 uses the HTML metadata obtained from the text analyzing process. In step 704, a title is extracted using DOM Extraction for <title></title> tags. At step 706, a date is extracted using a regular expression of the text block data. At step 708, an array of text classified by title and date is obtained.

[0053] Jlius the present system and method automatically captures the important data and extracts the important data from the URL if it contains dynamic HTML5 and AJAX pages. Further, the present invention can also extract the important data when the URL contains less textual information. Hie present invention analyses text by reading HTML tags from the URL received from a user computing device. Additionally, the present invention analyses web-layout by dividing the DOM element block by height and width for scoring. Furthermore, the present invention analyses content by returning final content in text format.

[0054] While embodiments of the present invention have been illustrated and described, it will be clear that the invention is not limited to these embodiments only. Numerous modifications, changes, variations, substitutions, and equivalents will be apparent to those skilled in the art, without departing from the scope of the invention, as described in the claims.

Previous Patent: SYSTEM AND METHOD FOR MANAGING DUPLICATE ENTITIES BASED ON A RELATIONSHIP CARDINALITY IN PRODUCTION ...

Next Patent: DISPENSER