Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEM AND METHOD FOR HOT TOPICS AGGREGATION USING RELATIONSHIP GRAPH
Document Type and Number:
WIPO Patent Application WO/2022/005272
Kind Code:
A1
Abstract:
The present invention relates to a system and method for hot topics aggregation using relationship graph. The present invention is characterized by at least one data processing unit (102) for processing real-time streaming data received from a plurality of remote data sources (202); at least one data relation unit (104) for extracting and constructing semantic relationship between documents based on processed sentences, phrases or words (204); and at least one topic ranking unit (106) for extracting topic clusters from the relationship graph, determining topic ranking and constructing hot topics abstract from ranked topic clusters to form a specific topic and a general topic. In particular, the present invention constructs a graph of semantic relationship between each named entities to form an entity relation topic for hot topic extraction.

Inventors:
KOK WEIYING (MY)
DOMINGO MA STELLA TABORA (MY)
EFTEKHARYPOUR YASAMAN (MY)
LATIP ABDUL AZIZ (MY)
Application Number:
PCT/MY2020/050158
Publication Date:
January 06, 2022
Filing Date:
November 17, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
MIMOS BERHAD (MY)
International Classes:
G06F16/2458; G06F16/26; G06F40/205; G06F40/268; G06F40/279
Foreign References:
US20130085745A12013-04-04
US20100082331A12010-04-01
US20150120717A12015-04-30
US20030033333A12003-02-13
KR20180053325A2018-05-21
Attorney, Agent or Firm:
MIRANDAH ASIA (MALAYSIA) SDN BHD (MY)
Download PDF:
Claims:
CLAIMS

1. A system (100) for hot topics aggregation using a relationship graph, the system (100) having a plurality of databases (108, 110, 112, 114) for storing processed and unprocessed documents is characterized by: at least one data processing unit (102) having means for: receiving real-time streaming data from a plurality of remote data sources; cleansing, analysing and processing received data in text-based; breaking down data processed into sentences, phrases or words to extract named entities and discover dependency between the named entities and sentences to derive meanings from documents containing natural language by applying natural language processing; at least one data relation unit (104) configured to the at least one data processing unit (102) for extracting and constructing semantic relationship between documents based on processed sentences, phrases or words; and at least one topic ranking unit (106) configured to the at least one data relation unit (104) for extracting topic clusters from the relationship graph, determining topic ranking and constructing hot topics abstract from ranked topic clusters to form a specific topic and a general topic.

2. The system (100) according to Claim 1 , wherein the at least one data relation unit (104) further comprises: at least one key phrase extraction module for selecting a set of phrases describing extracted entities as named entity topic and its value; and named entities relation graph clusters for building a relationship graph between documents to identify semantic similarity based on extracted named entities and key phrases.

3. The system (100) according to Claim 1 , wherein the plurality of databases (108, 110, 112, 114) are preferably a content database (108), a named entities database (110), a graph database (112) and a hot topic database (114).

4. A method (200) for hot topics aggregation using a relationship graph is characterized by: processing real-time streaming data received from a plurality of remote data sources (202); extracting and constructing semantic relationship between documents based on processed sentences, phrases or words (204); extracting topic clusters from the relationship graph (206); and determining topic ranking and constructing hot topics abstract from ranked topic clusters to form a specific topic and a general topic (208).

5. The method (200) according to Claim 4, wherein processing real-time streaming data received from a plurality of remote data sources (202) further comprises steps of (300): retrieving title and content of each document from a content database (302); removing all unwanted contents including symbols, emoticons, images and uniform resource locator (304); tokenizing contents of all documents into sentences (306); marking title and sentences from the contents of all documents by part of speech tagging (308); extracting named entities from the marked title and sentences from contents of all documents (310); checking the extracted named entities extracted and matching named entities with abbreviation in a named entities database (312); harmonizing abbreviation of the extracted named entities back to its original name (314); determining occurrence of named entities in the contents of all documents (316); ranking a list of named entities found based on the occurrence of named entities determined and position in title and content of documents in descending order and saving ranked list of named entities in the content database (318); and performing cross sentence dependency parser for each named entity between all sentences related to the named entity (320).

6. The method (200) according to Claim 4, wherein extracting and constructing semantic relationship between documents based on processed sentences, phrases or words (204) further comprises steps of (400): extracting key phrase by selecting a set of phrases that describe extracted entities as named entity topic and its value (402); and building a relationship graph between documents from named entities relation graph clusters to identify the semantic similarity based on extracted named entities and key phrases (404).

7. The method (200) according to Claim 6, wherein extracting key phrase by selecting a set of phrases that best describe extracted entities as named entity topic and its value (402) further comprises steps of (500): retrieving ranked extracted named entities from step 318 (502); compiling sentences from cross sentence dependency parser from step 320 with related named entities (504); tokenizing sentences into words (506); removing at least one stop word to create phrases (508); stemming created phrases (510); converting phrases into vectors (512); constructing similarity matrix between the vectors (514); scoring of phrases by using graph based ranking algorithm (516); ranking of phrases by descending order of rank score (518); selecting a top ranked phrase as a key phrase (520); and using the key phrase as named entity topic value and storing the key phrase into the named entity database (522).

8. The method (200) according to Claim 6, wherein building a relationship graph between documents from named entities relation graph clusters to identify the semantic similarity based on extracted named entities and key phrases (404) further comprises steps of (600): determining if the named entity, NE exist in the relationship graph (602); creating a new named entity, NE node (604) if the name entity does not exist in the relationship graph (602); and proceeding to step 606 if the named entity exist in the relationship graph; determining if the named entity, NE node has a named entity topic, NeT value (606); adding a new named entity topic, NeT to the named entity, NE node (608) if the named entity topic does not exist in the named entity node (606); and proceeding to step 610 node if the named entity topic exist in the graph; determining if the named entity topic, NeT value exist (610); adding the named entity topic, NeT value to the named entity topic (612) if the named entity topic value does not exist in the named entity topic (610); proceeding to step 614; adding document ID, docID to the named entity topic (614); adding document published time, DocPT to the named entity topic and updating in a graph database (616); and determining if all named entity, NE extracted from the content has been added into the graph (618); if Yes, finish constructing the named entity relation graph; and if No (620), reiterating steps 500 onwards for extracting key phrase by selecting a set of phrases that describe extracted entities as named entity topic and its value and continuing with steps 602 onwards until the named entity graph is constructed.

9. The method (200) according to Claim 4, wherein extracting topic clusters from the relationship graph, determining topic ranking and constructing hot topics abstract from ranked topic clusters to form a specific topic and a general topic (206) further comprises steps of (700): setting a timeframe in day(s) (702); obtaining the named entity, NE nodes that contain named entity topic, NeT value with DocPT within the timeframe (704); constructing a topic title using the named entity, NE and named entity topic, NeT value (706); determining the named entity topic, NeT weight (708); ranking of topics by descending order of the named entity topic, NeT weight

(710); and listing top N topics as hot topics for the timeframe (712).

Description:
SYSTEM AND METHOD FOR HOT TOPICS AGGREGATION USING RELATIONSHIP

GRAPH

FIELD OF INVENTION

The present invention relates to a system and method for hot topics aggregation using relationship graph. In particular, the present invention provides a system and method that breaks data into sentences, phrases or words to extract named entities from content of a document and further constructs a graph of semantic relationship between each named entities to form an entity relation topic for hot topic extraction.

BACKGROUND ART

Hot topics generally refer to a subject that is widely discussed in many sources in the same period of time and varies in popularity over time. The hotness of the topic can be ranked based on the frequency of publication, the number and/or frequency of visits, reposts, or discussed or search on social media and search engines.

Every minute, there are a vast amount of information such as news articles, blog posts, announcements, reports, social media opinions, statuses, etc. that are being published and shared on the internet. It is impossible to regularly monitor these wide varieties of information sources in a short period of time.

Hot topic detection is an important knowledge discovery task on social media streams to determine topics that are being discussed the most on a social media. This service is usually used by news aggregators or social media to collect, aggregate and categorize multiple media sources to present them into a single page which helps to save user’s time from visiting multiple sites for new updates. The aggregated hot news or social media topics are not just for being updated to the most recent trend, and instead to be a source of analysis and as a further understanding of the public or users’ interest and needs across all domains such as government agencies, ecommerce business and online health communities, etc.

However, there are some shortcomings or disadvantages on the existing hot topic detection methods inclusive of the clustering method whereby the number of cluster is required to be defined in advance or data is being assigned to a pre-determined set of topic clusters which is not effective for real-time rapid changes for unpredictable contents. The existing methods generally produced poor results in selecting the topic of representation due to the occurrence of semantically overlapping of topics. Each topic is clustered by a different keyword or phrase and is treated as an independent topic with no relation with each other and clustered in different groups of topics. Often, event topics containing multiple sub-topics were not linked together. For example, a topic on “Indonesia Earthquake", and “Indonesia volcano eruption" and the event “Indonesia Tsunami caused at least 1,340 people are known to have died”. The series of volcano eruption and tsunami event shows a strong relationship with the pass event that occurs five days after the earth quake which took place on the same island.

One example of a system and method for extracting topics is disclosed in China Patent Application Publication No. CN 104915446 A (hereinafter referred to as CN 446 A Publication) entitled “Automatic Extracting Method and System of Event Evolving Relationship Based on News 29 June 2015, Applicant: Univ South China Tech. The CN 446 A Publication describes a system and method for automatically extracting news-based evolutionary relations, including news information pre-processing, news lead extraction, news event time extraction, event extraction, event keyword extraction, and event evolution relationship analysis. CN 446 A Publication is limited to extraction of nouns or noun phrase as characteristics of word review of an event and does not extract the named entity for the entire reviewed data.

Another example of the use of real-time trending topics analysis is disclosed in United States Patent No. US 10095686 B2 (hereinafter referred to as US 686 B2 Patent) entitled “Trending Topic Extraction from Social Media” having a filing date of 6 April 2015, Applicant: Adobe Inc.. US 686 B2 Patent disclosed that a real-time topic analysis for social listening is performed to discover and understand trending topics in varying degrees of granularity. Further, the US 686 B2 Patent utilizes a lightweight natural language processing, NLP method for topic extraction from the data and ranked the topics by an ATF-IDF algorithm for handling dynamically-changing content. US 686 B2 Patent also discloses classification of trending topics into clusters which provides insight for decision making and business intelligence.

A further example of a hot topic extraction method is disclosed in China Patent Application Publication No. CN102982157 A (hereinafter referred to as CN 157 A Publication) entitled “Device and method used for mining microblog hot topics” having a filing date of 3 December 2012, Applicant: Beijing Qihoo Technology Co; Qizhi Software Beijing Co Ltd. CN 157 A Pubication relates to a device and method used for mining microblog hot topic from an open interface. CN 157 A Publication discloses that excavating the microblogging much-talked- about topic is done by carrying out hot word according to the microblogging contents that were gathered by the device. Further, CN 157 A Application teaches that the hot topics gathered were weighted and calculated according to the microblogging parameter of microblogging quantity and corresponding microblogging by obtaining the temperature value of popular keyword sets. The hot topics were sorted and displayed according to the user's request or activity of the user.

The existing clustering method need to be improved for a more efficient real-time streaming news topic without the need to predetermine the number of clusters. Finding the relationship of topic by discovering the semantic relationship is important to understand the cause or causes of events, as well as relevant background and insights on the development of events, the climax, until the end of the entire topic. Further, the strength of relationship between the topics and the related topics should be considered in determining the hotness of a topic.

SUMMARY OF INVENTION

The present invention relates to a system and method for hot topics aggregation using relationship graph. In particular, the present invention provides a system and method that breaks data into sentences, phrases or words to extract named entities from content of a document and further constructs a graph of semantic relationship between each named entities to form an entity relation topic for hot topic extraction.

One aspect of the invention provides a system (100) for hot topics aggregation using a relationship graph. The system (100) having a plurality of databases (108, 110, 112, 114) for storing processed and unprocessed documents is characterized by at least one data processing unit (102) having means for receiving real-time streaming data from a plurality of remote data sources; cleansing, analysing and processing received data in text-based; breaking down data processed into sentences, phrases or words to extract named entities and discover dependency between the named entities and sentences to derive meanings from documents containing natural language by applying natural language processing; at least one data relation unit (104) configured to the at least one data processing unit (102) for extracting and constructing semantic relationship between documents based on processed sentences, phrases or words; and at least one topic ranking unit (106) configured to the at least one data relation unit (104) for extracting topic clusters from the relationship graph, determining topic ranking and constructing hot topics abstract from ranked topic clusters to form a specific topic and a general topic.

Another aspect of the invention provides that the at least one data relation unit (104) further comprises at least one key phrase extraction module for selecting a set of phrases describing extracted entities as named entity topic and its value; and named entities relation graph clusters for building a relationship graph between documents to identify semantic similarity based on extracted named entities and key phrases.

A further aspect of the invention provides that the plurality of databases (108, 110, 112, 114) are preferably a content database (108), a named entities database (110), a graph database (112) and a hot topic database (114).

Another aspect of the invention provides a method (200) for hot topics aggregation using a relationship graph. The method (200) comprising steps of processing real-time streaming data received from a plurality of remote data sources (202); extracting and constructing semantic relationship between documents based on processed sentences, phrases or words (204); and extracting topic clusters from the relationship graph (206), determining topic ranking and constructing hot topics abstract from ranked topic clusters to form a specific topic and a general topic (208).

A further aspect of the invention provides that the step of processing real-time streaming data received from a plurality of remote data sources (202) further comprises steps of retrieving title and content of each document from a content database (302); removing all unwanted contents including symbols, emoticons, images and uniform resource locator (304); tokenizing contents of all documents into sentences (306); marking title and sentences from the contents of all documents by part of speech tagging (308); extracting named entities from the marked title and sentences from contents of all documents of step 308 (310); checking the extracted named entities extracted in step 310 and matching named entities with abbreviation in the named entities database (312); harmonizing abbreviation of the extracted named entities back to its original name (314); determining occurrence of named entities in the contents of all documents (316); ranking a list of named entities found based on the occurrence of named entities determined in step 316 and position in title and content of documents in descending order and saving ranked list of named entities in the content database (318); and performing cross sentence dependency parser for each named entity between all sentences related to the named entity (320).

Yet another aspect of the invention provides that the step of extracting and constructing semantic relationship between documents based on processed sentences, phrases or words (204) further comprises steps of (400) extracting key phrase by selecting a set of phrases that best describe extracted entities as named entity topic and its value (402); and building a relationship graph between documents from named entities relation graph clusters to identify the semantic similarity based on extracted named entities and key phrases (404).

Still another aspect of the invention provides that the step of extracting key phrase by selecting a set of phrases that best describe extracted entities as named entity topic and its value (402) further comprises steps of (500) retrieving ranked extracted named entities from step 318 (502); compiling sentences from cross sentence dependency parser from step 320 with related named entities (504); tokenizing sentences into words (506); removing at least one stop word to create phrases (508); stemming created phrases (510); converting phrases into vectors (512); constructing similarity matrix between the vectors (514); scoring of phrases by using graph based ranking algorithm (516); ranking of phrases by descending order of rank score (518); selecting a top ranked phrase as a key phrase (520); and using the key phrase as named entity topic value and storing the key phrase into the named entity database (522). A further aspect of the invention provides that the step of building a relationship graph between documents from named entities relation graph clusters to identify the semantic similarity based on extracted named entities and key phrases (404) further comprises steps of (600) determining if the named entity, NE exist in the relationship graph (602); creating a new named entity, NE node (604) if the name entity does not exist in the relationship graph (604); and proceeding to step 606 if the named entity exist in the relationship graph (602); determining if the named entity, NE node has a named entity topic, NeT value (606); adding a new named entity topic, NeT to the named entity, NE node (608) if the named entity topic does not exist in the named entity node (606); and proceeding to step 610 node if the named entity topic exist in the graph; determining if the named entity topic, NeT value exist (610); adding the named entity topic, NeT value to the named entity topic (612) if the named entity topic value does not exist in the named entity topic (610); proceeding to step 614; adding document ID, docID to the named entity topic (614); adding document published time, DocPT to the named entity topic and updating in a graph database (616); and determining if all named entity, NE extracted from the content has been added into the graph (618); if Yes, finish constructing the named entity relation graph; and If No (620), reiterating steps 500 onwards for extracting key phrase by selecting a set of phrases that describe extracted entities as named entity topic and its value and continuing with steps 602 onwards until the named entity graph is constructed.

Yet another aspect of the invention provides that the step of extracting topic clusters from the relationship graph, determining topic ranking and constructing hot topics abstract from ranked topic clusters to form a specific topic and a general topic (206) further comprises steps of (700) setting a timeframe in day(s) (702); obtaining the named entity, NE nodes that contain named entity topic, NeT value with DocPT within the timeframe (704); constructing a topic title using the named entity, NE and named entity topic value (706); determining the named entity topic, NeT weight (708); ranking of topics by descending order of the named entity topic, NeT weight (710); and listing top N topics as hot topics for the timeframe (712).

The present invention consists of features and a combination of parts hereinafter fully described and illustrated in the accompanying drawings, it being understood that various changes in the details may be made without departing from the scope of the invention or sacrificing any of the advantages of the present invention. BRIEF DESCRIPTION OF ACCOMPANYING DRAWINGS

To further clarify various aspects of some embodiments of the present invention, a more particular description of the invention will be rendered by references to specific embodiments thereof, which are illustrated in the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. The invention will be described and explained with additional specificity and detail through the accompanying drawings in which:

FIG. 1.0 illustrates a general architecture of the system of the present invention.

FIG. 1.0a illustrates the sub modules within the data relation unit of the system of the present invention.

FIG. 2.0 is a flowchart illustrating a general methodology of the present invention.

FIG. 3.0 is a flowchart illustrating steps of processing real-time streaming data received from a plurality of remote data sources of the present invention.

FIG. 4.0 is a flowchart illustrating steps of extracting and constructing semantic relationship between documents based on processed sentences, phrases or words of the present invention.

FIG. 5.0 is a flowchart illustrating steps of extracting key phrase by selecting a set of phrases that best describe extracted entities as named entity topic and its value of the present invention. FIG. 6.0 is a flowchart illustrating steps of building a relationship graph between documents from named entities relation graph clusters to identify the semantic similarity based on extracted named entities and key phrases of the present invention.

FIG. 7.0 is a flowchart illustrating steps of extracting topic clusters from the relationship graph, determining topic ranking and constructing hot topics abstract from ranked topic clusters to form a specific topic and a general topic of the present invention. DETAILED DESCRIPTION OF THE DRAWINGS

The present invention relates to a system and method for hot topics aggregation using relationship graph. In particular, the present invention provides a system and method that breaks data into sentences, phrases or words to extract named entities from the content of a document and further constructs a graph of semantic relationship between each named entities to form an entity relation topic for hot topic extraction. Hereinafter, this specification will describe the present invention according to the preferred embodiments. It is to be understood that limiting the description to the preferred embodiments of the invention is merely to facilitate discussion of the present invention and it is envisioned without departing from the scope of the appended claims.

Reference is first made to FIG. 1.0 and FIG 1.0a. FIG. 1.0 illustrates a general architecture of the system of the present invention and FIG. 1 .0a illustrates the sub modules within the data relation unit of the system of the present invention. As illustrated in FIG. 1 .0, the system (100) of the present invention for hot topics aggregation using a relationship graph having a plurality of databases (108, 110, 112, 114) for storing processed and unprocessed documents is characterized by three modules. The modules are namely at least one data processing unit (102), at least one data relation unit (104) and at least one topic ranking unit (106). The at least one data processing unit (102) having means for receiving real-time streaming data from a plurality of remote data sources; cleansing, analysing and processing received data in text-based; breaking down data processed into sentences, phrases or words to extract named entities and discover dependency between the named entities and sentences to derive meanings from documents containing natural language by applying natural language processing. The at least one data relation unit (104) configured to the at least one data processing unit (102) having means for extracting and constructing semantic relationship between documents based on processed sentences, phrases or words; and the at least one topic ranking unit (106) configured to the at least one data relation unit (104) for extracting topic clusters from the relationship graph, determining topic ranking and constructing hot topics abstract from ranked topic clusters to form a specific topic and a general topic. The at least one data relation unit (104) further comprises at least one key phrase extraction module (104a) and named entities relation graph clusters (104b). The key phrase extraction module (104a) selects a set of phrases describing extracted entities as named entity topic and its value; and the named entities relation graph clusters (1004b) for building a relationship graph between documents to identify semantic similarity based on extracted named entities and key phrases. These modules are coupled to a plurality of databases wherein the plurality of databases (108, 110, 112, 114) are preferably a content database (108), a named entities database (110), a graph database (112) and a hot topic database (114). In particular, the at least one data processing unit (102) is coupled to the content database (108) and named entities database (110) while the at least one data relation unit (104) is coupled to the named entities database (110) and a graph database (112) while the and at least one topic ranking unit (106) is coupled to the hot topic database (114).

Reference is now made to FIG. 2.0 which illustrates the general methodology of the present invention for hot topics aggregation using a relationship graph. As illustrated in FIG. 2.0, the general methodology comprising steps of processing real-time streaming data received from a plurality of remote data sources (202); extracting and constructing semantic relationship between documents based on processed sentences, phrases or words (204); and extracting topic clusters from the relationship graph, determining topic ranking and constructing hot topics abstract from ranked topic clusters to form a specific topic and a general topic (206).

Reference is now made to FIG. 3.0 which illustrates the steps for processing real-time streaming data received from a plurality of remote data sources in the at least one data processing unit (102). The steps for processing real-time streaming data received from a plurality of remote data sources is initiated by first retrieving title and content of each document from the content database (302). Thereafter, all unwanted contents including symbols, emoticons, images and uniform resource locator are removed from the content of the document (304). Upon removing all unwanted contents, the contents of all documents are tokenized into sentences (306) and followed by marking of title and sentences from contents of all documents a POS tag or part-of-speech tag (308). A POS tag is a special label assigned to each tokenized word in a text corpus to indicate the part of speech for purposes of corpus searches and for text analysis. Named entities are subsequently extracted from marked title and sentences from contents of all documents of step 308 (310) and named entities extracted in step 310 are checked and the checked named entities are further matched with abbreviations in the named entities database (312). Upon matching the abbreviations in the named entities database, the abbreviation of extracted named entities are harmonized back to its original name (314). The occurrence of named entities in the contents of all documents is thereafter determined (316). The list of named entities found based on occurrence of named entities determined in step 316 and the position in the title and content of the documents are ranked in descending order and the ranked list of named entities is saved in the content database (318). Finally, cross sentence dependency parser is performed for each named entity between all sentences related to the named entity (320). Cross sentence dependency parser searches for across all sentences where the entity is mentioned in the context in a separate sentence which is usually replaced by a pronoun in the next sentence. Cross sentence dependency parser is used to link or gather separate sentences which refers or mentions about the entity. An example of an application of cross dependency parser is provided herewith.

Example:

"KUALA LUMPUR: The East Coast Rail Link, ECRL project will be unveiling its proposed realignment from Kota Bharu to Dungun during its Public Inspection exercise on Nov 25, 2019, according to Malaysia Rail Link Sdn Bhd, MRL in a statement on Friday. It said the public would be able to give feedback and suggestions during the three-month ECRL Public Inspection for the Kota Bharu-Dungun stretch also known as Section A, comprising a 210.4km mainline, a proposed future spurline 14.4 km, and six-passenger freight station.

In the above example, the word 'It' in the second sentence of the following example mentions MRL in two separate sentences. Therefore, cross sentence dependency is required to be applied in the present invention in order to gather all relevant sentences which mentions about the relevant entity in the context for further processing.

Reference is now made to FIG. 4.0, 5.0 and 6.0 respectively. FIG. 4.0 illustrates the steps for extracting and constructing semantic relationship between documents based on processed sentences, phrases or words (204) in the at least one data relation unit (104) while FIG. 5.0 illustrates the steps of extracting key phrase by selecting a set of phrases that best describe extracted entities as named entity topic and its value of the present invention and FIG. 6.0 illustrates the steps of building a relationship graph between documents from named entities relation graph clusters to identify the semantic similarity based on extracted named entities and key phrases of the present invention. As illustrated in FIG. 4.0 (400), in extracting and constructing semantic relationship between documents based on processed sentences, phrases or words, key phrase is first extracted by selecting a set of phrases that best describe extracted entities as named entity topic and its value (402); and thereafter build a relationship graph between documents from named entities relation graph clusters to identify the semantic similarity based on extracted named entities and key phrases (404). As illustrated in FIG. 5.0, in extracting key phrase by selecting a set of phrases that best describe extracted entities as named entity topic and its value (402) further comprises steps of (500) first retrieving ranked extracted named entities from step 318 (502) followed by compiling sentences from cross sentence dependency parser from step 320 with related named entities (504). Thereafter, sentences are tokenized into words (506) with at least one stop word being removed to create phrases (508). Stop words are words that are filtered before of after processing of natural language. Generally, stop words are commonly used words that a search engine has bee instructed to ignore. The created phrases are stemmed (510) and converted into vectors (512). Subsequently, similarity matrix is constructed between the vectors (514). The phrases are scored by using graph based ranking algorithm (516) to rank the phrases in descending order of the rank score (518). Further, the top ranked phrase is selected as the key phrase (520) which is used as the named entity topic, NeT value. Finally, the selected key phrase is stored into the named entity database (522) which is also known as the named entity, NE dictionary.

As illustrated in FIG. 6.0, in building a relationship graph between documents from named entities relation graph clusters to identify the semantic similarity based on extracted named entities and key phrases (404) further comprises steps of (600) first determining if the named entity, NE exist in the relationship graph (602) and subsequently creating a new named entity, NE node (604) if the name entity does not exist in the relationship graph (602); and proceeding to step 606 if the named entity exist in the relationship graph. If the named entity exist in the relationship graph, it is further determined if the named entity, NE node has a named entity topic, NeT value (606). A new named entity topic, NeT is added to the named entity, NE node (608) if the named entity topic does not exist in the named entity node (606); and proceeding to step 610 node if the named entity topic exist in the graph. If the named entity topic exist in the graph, it is further determined if the named entity topic, NeT value exist (610). The named entity topic, NeT value is added to the named entity topic if the named entity topic value does not exist in the named entity topic (612); and proceeding to step 614. Thereafter, document ID, docID is added to the named entity topic (614) and document published time, DocPT to the named entity topic and updating in a graph database (616). It is further determined if all named entity, NE extracted from the content has been added into the graph (618). If “Yes”, the named entity relation graph is constructed and if “No” (620), reiterating steps 500 onwards as illustrated in FIG. 5.0 for extracting key phrase by selecting a set of phrases that best describe extracted entities as named entity topic and its value. Thereafter, continue with steps 602 onwards in FIG. 6.0 until the named entity graph is constructed.

Reference is now made to FIG. 7.0 which illustrates the steps of extracting topic clusters from the relationship graph, determining topic ranking and constructing hot topics abstract from ranked topic clusters to form a specific topic and a general topic of the present invention. As illustrated in FIG. 7.0 (700), in extracting topic clusters from the relationship graph, determining topic ranking and constructing hot topics abstract from ranked topic clusters to form a specific topic and a general topic (206), a timeframe in day(s) is first set (702). Thereafter, the named entity, NE nodes that contain named entity topic, NeT value with DocPT within the timeframe are obtained (704) and a topic title using the named entity, NE and named entity topic value is constructed (706). Subsequently, the named entity topic, NeT weight is determined (708). Finally, the topics are ranked by descending order of the named entity topic weight (710); and top N topics are listed as hot topics for the timeframe (712).

The hotness of a topic is determined by its named entity topic, NeT weight which is determined using the formula as shown below:

Wherein, Document count Time range Current date or time Document published date or time Set timeframe in day(s) or hours total up — to — date document for a NeT The present invention in summary provides for extraction of hot topics and semantic relationship between each entities topic by breaking data into sentences, phrases or words to extract all named entities from the content by applying cross sentence dependency related to each named entity. Sentences related to a named entity are compiled from the cross sentence dependency to extract the keyword or phrases and further constructing a graph of the semantic relationship between each named entities to form an entity relation topic for hot topics extraction. Topic clusters are extracted from the relationship graph and topic ranking is determined based on similarity matrix between named entities topics value with inclusion of up-to-date weightage calculation to construct hot topics abstract from the ranked topic clusters. Throughout this specification, unless the context requires otherwise, the word “comprise”, or variations such as “comprises” or “comprising”, will be understood to imply the inclusion of a stated step or element or integer or group of steps or elements or integers, but not the exclusion of any other step or element or integer or group of steps, elements or integers. Thus, in the context of this specification, the term “comprising” is used in an inclusive sense and thus should be understood as meaning “including principally, but not necessarily solely”.