Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
AUTOMATICALLY EXTRACTING INFORMATION ABOUT LOCAL EVENTS FROM WEB PAGES
Document Type and Number:
WIPO Patent Application WO/2007/118305
Kind Code:
A1
Abstract:
A system and method for electronically extracting events from electronic documents such as web pages uses a probabilistic extraction tool to linguistically process the electronic documents and generate event data. The probabilistic extraction tool may be configured using Hidden Markov Model techniques and trained with manually extracted event descriptions. The list of events extracted automatically may be useful to define local event data portals making notice of local events electronically available and searchable to users.

Inventors:
O'REILLY STEVEN (CA)
BROWNE KEITH (CA)
IMRIE ROBERT (CA)
GILCHRIST ROD (CA)
GRANTHAM SIMON (CA)
Application Number:
PCT/CA2007/000582
Publication Date:
October 25, 2007
Filing Date:
April 10, 2007
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
DEMANDCAST CORP (CA)
O'REILLY STEVEN (CA)
BROWNE KEITH (CA)
IMRIE ROBERT (CA)
GILCHRIST ROD (CA)
GRANTHAM SIMON (CA)
International Classes:
G06Q30/00; G06F17/30
Domestic Patent References:
WO2000041090A12000-07-13
WO2007005465A22007-01-11
Foreign References:
US20060241860A12006-10-26
Other References:
FREITAG ET AL.: "Information Extraction with HMMs and Shrinkage", PROCEEDINGS OF THE AAAI-99 WORKSHOP ON MACHINE LEARNING FOR INFORMATION EXTRACTION, 1999, Retrieved from the Internet
BALDI ET AL.: "Modeling the Internet and the Web: Probabilistic Methods and Algorithms", WILEY & SONS, vol. CHAPTER 4, May 2003 (2003-05-01), pages 77 - 78, 80 - 81, 121 - 122, Retrieved from the Internet
TEZUKA ET AL.: "Integrated Model for a Region-Specific Search Systems and Its Implementation", PROCEEDINGS OF THE 2003 IRC INTERNATIONAL CONFERENCE ON INTERNET INFORMATION RETRIEVAL, KOYANG, KOREA, 2003, pages 243 - 248, Retrieved from the Internet
KURASHIMA ET AL.: "Blog Map of Expriences: Extracting and Geographically Mapping Visitor Experiences from Urban Blogs", WEB INFORMATION SYSTEMS ENGINEERING - WISE 2005, LECTURE NOTES IN COMPUTER SCIENCE, SPRINGER-VERLAG BERLIN HEIDELBERG, October 2005 (2005-10-01), pages 496 - 503, Retrieved from the Internet
TEZUKA ET AL.: "Toward Tighter Integration of Web Search with a Geographic Information System", PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, WWW'06, EDINBURGH, SCOTLAND, ACM PRESS, NEW YORK, NY, 23 May 2006 (2006-05-23) - 26 May 2006 (2006-05-26), pages 277 - 286, Retrieved from the Internet
Attorney, Agent or Firm:
GOWLING LAFLEUR HENDERSON LLP (1 First Canadian Place100 King Street Wes, Toronto Ontario M5X 1G5, CA)
Download PDF:
Claims:

WHAT IS CLAIMED IS:

1. A method for extracting event data from an electronic document comprising:

providing the electronic document to a probabilistic extraction tool configured to automatically identify event data using probabilistic processing techniques; and

operating the probabilistic extraction tool to extract the event data.

2. The method of claim 1 comprising repeating said steps of providing and operating to extract event data from a plurality of electronic documents.

3. The method of claim 2 comprising performing said repeating step periodically to update the event data automatically.

4. The method of claim 2 comprising providing a search interface to said event data to facilitate a user to determine a desired list of events.

5. The method of claim 4 comprising providing said list of events in response to a user query.

6. The method of claim 2 comprising providing said event data in the form of a data feed to a third party.

7. The method of claim 1 comprising the step of training the probabilistic extraction tool with a set of manually extracted event data.

8. A method of extracting event data from an electronic document comprising:

receiving the electronic document for automatic processing, said document comprising event data describing one or more events;

tokenizing the electronic document to define a token stream; and

evaluating the token stream with a probabilistic extraction tool to determine the event data for each event.

9. The method of claim 8 comprising:

receiving a set of training data comprising manually extracted event data from one or more electronic documents; and,

using the set of training data to train the probabilistic extraction tool.

10. The method of claim 8 wherein the probabilistic extraction tool is configured in accordance with Hidden Markov Model techniques for linguistic processing of the electronic document.

11. The method of claim 8 comprising providing said event data to a user electronically.

12. The method of claim 8 comprising providing an user interface for searching the event data using web-based protocols.

13. The method of claim 8 comprising providing said event data to a third party via a data feed.

14. A computer system configured to perform the method according to any one of claims 1 to 7.

15. A computer system configured to perform the method according to any one of claims 8 to 13.

16. A computer readable medium comprising instructions and data which when executed by a processor adapt a computer system to implement the method of any one of claims 1 to 7.

17. A computer readable medium comprising instructions and data which when executed by a processor adapt a computer system to implement the method of any one of claims 8 to 13.

18. A computer system for extracting event data comprising:

an input means for receiving electronic documents, said documents comprising event data describing one or more events;

a probabilistic extraction tool configured to process electronic documents using probabilistic processing techniques to extract and provide extracted event data; and,

an output means for providing a listing of events in response to said extracted event data.

19. The computer system of claim 18 wherein the probabilistic extraction tool comprises a tokenizer to define a token stream from said electronic documents and an event component identifier to identify and separate event data from said token stream.

20. The computer system of claim 19 comprising a user interface for providing the listing of events, said interface comprising a search tool for searching the extracted event data to retrieve events of interest.

Description:

AUTOMATICALLY EXTRACTING INFORMATION ABOUT LOCAL EVENTS FROM WEB PAGES

TECHNICAL FIELD

[0001] The present invention relates to a method for automatically extracting information

about local events from web pages or other electronic documents.

BACKGROUND

[0002] One kind of Internet search that is currently very difficult to do well is to search

for local events. For a consumer it is highly desirable to be able to use the Internet to find

a comprehensive list of the events that are within their local area, that are of a type or related to a subject that they are interested in and that are being held at a time that is

convenient for them. Current general search engines (e.g. Google™, Yahoo™, MSN Search™/Windows Live™, etc.) are not well-suited to find this type of information or to

present their output in a user-friendly manner.

[0003] Local event information may be available on the Internet from a number of

sources such as newspaper sites, ticket selling sites, venue sites, city portals, event

organizers, and interest group sites. Information on these sites tends to be fragmented and

non-comprehensive. There are many sites that focus on specific content categories and provide depth of detail for events related to these categories. There is presently no site or search strategy that can provide both depth and breadth of events in a geographically limited area.

[0004] As an alternative to this type of site, it would be useful to the consumer for there

to be a single source that offered a comprehensive and complete listing of his local events and it would be useful if this up-to-date information could be accessed anytime and

anywhere. This would help people identify what's going on that is of interest to them and

help them plan their social calendars. Anecdotal information from local portal sites

suggests that the search for this type of content is very common. Currently some of these portals find that local event search is the third most common type of user search.

[0005] In addition to consumers, this type of an event data source would be useful to

local information businesses. It would allow them to reduce costs by having fewer people

involved in collecting event data. It would improve the depth and breadth of the event information that they could provide to their readers. And it would provide an opportunity to increase their revenues by providing more attractive content and thus attract readership as well as by selling targeted advertising to local businesses and connecting consumers to local business advertising that matches their particular search criteria and hence their

specific interests.

[0006] One method of operation for local event data portal sites is for human content editors or content managers to collect the data manually from sources such as press

releases and manually input it into their own, individual content management databases.

Typically this process is unique to each site and is inefficient and not standardized. It is costly, time consuming and lacks depth and scalability.

[0007] A somewhat more automated approach uses tool-assisted web "scraping" for the aggregation of local event data. Specific web sites are identified and handcrafted content extraction scripts are developed to extract event information and related links while

making very specific format assumptions about web site's pages. Scraping is an iterative

process, requiring content extraction review and script refinement.

[0008] Google™ is a widely used automated search system that uses keywords and a rating system based primarily on the number and type of links to a page to find specific

web pages and to present them to a user with the "best and most relevant" pages being

presented first. The results obtained from a Google search gives a potentially large list of

web pages belonging to sites that list, manage, sponsor or provide venues for events. The sites tend to be specialized into business events, entertainment events, events requiring tickets, music events, club events, wine tasting events and so on. In general the results

provide a bewildering array of places to look for specific event information.

[0009] A preferable approach for searching for events is to search the events themselves

based on a specified set of criteria rather than searching all the web pages that have a general keyword such as "event" on the page. Preferably, with this type of search (for example "what jazz events are on downtown this Tuesday evening"), rather than getting a vast number of relatively low relevance web page pointers, the searcher receives a list of

events, without duplicates, sorted into a logical rank order based on whatever makes sense, such as the number of different web sites found during the web crawl that referenced the particular event or other relevance, desirability and quality metrics.

[0010] In order to provide such an improved search of local event data on web pages, event data is extracted and categorized (i.e. by finding and recognizing event descriptions rather than just keywords) from web pages. This is a fundamental difference in approach to searching from what Google does.

[0011] Other event search approaches in use today employ templates. Zvents.com uses a template based system that functions simply as form of acceleration for manual extraction of event related data from a defined collection of Internet web sites. The scalability of

such a solution is extremely limited since skilled staff or developers are required to adapt the template system for any new or modified site. In addition to this template based system, the majority ofZvents.com event listings are directly entered and uploaded by

end users. Eventful.com uses a template approach that is similar to Zvents.com but

simpler. They also obtain event data from "partner" sites such as stubhub.com which

sells tickets, and by encouraging end users to pre- format and submit event listings.

[0012] One problem with allowing end users to upload events is that it provides an opportunity for the end user to fill the event database with "spam".

[0013] It is clear from the preceding discussion that Internet searches of local event data

in general and the local event aggregation business in particular could be significantly improved by the automated finding and extracting of local event data from web pages.

[0014] Therefore, there is a need for a method for automatically extracting event data from web pages and other electronic documents.

SUMMARY OF THE INVENTION

[0015] These and other problems are generally solved or circumvented, and technical advantages are generally achieved, by preferred embodiments of the present invention.

[0016] In accordance with aspects of the invention, systems and methods for electronically extracting events from electronic documents such as web pages use a probabilistic extraction tool to linguistically process the electronic documents and

generate extracted event data with little to no human intervention. The probabilistic

extraction tool may be configured using Hidden Markov Model techniques and trained with manually extracted event descriptions. The list of events extracted automatically from web pages, etc. may be useful to define local event data portals making notice of local events electronically available and searchable to users.

[0017] In one embodiment there is provided a method for extracting event data from an electronic document comprising providing the electronic document to a probabilistic extraction tool configured to automatically identify event data using linguistic or probabilistic processing techniques; and operating the probabilistic extraction tool to

extract the event data. The steps of providing and operating may be repeated to extract event data from a plurality of electronic documents and repeated periodically to update the list of events automatically. A search interface to the event data may be provided to facilitate a user to determine a desired list of events such a local events. Preferably the probabilistic extraction tool is trained with a set of manually extracted event data.

BRIEF DESCRIPTION OF THE DRAWINGS

[0018] In order that the invention may be readily understood, embodiments of the invention are illustrated by way of examples in the accompanying drawings, in which:

[0019] FIG. 1 is a flowchart of operation for automatically extracting event information from a web page;

[0020] FIG. 2 is a block diagram of an extraction system according to an embodiment and illustrated in two phases;

[0021] FIG 3 is an illustration of web page tokenization;

[0022] FIG 4 is an illustration of event component identification;

[0023] FIG 5 shows state transition probabilities used in event component identification;

[0024] FIG 6 shows word based transition probabilities in event component identification; and

[0025] FIG 7 shows the determination of the most probable state transition path resulting in probabilistic identification of event components.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS

[0026] This section describes embodiments of the instant invention with reference to the accompanying figures.

[0027] The following description discloses a system, keep method among other aspects for automatically extracting event data from web pages or other electronic documents.

[0028] Referring to FIG 1 and FIG 2, the first step (101) in automatically extracting event data from a web page is to train the extraction tool with a set of manually extracted

event components.

[0029] In order to provide training, a set of web pages containing local events are collected and presented to human editors by a program. The program chooses pages by using a Bayesan classifier on candidate pages obtained by web crawling or by issuing search queries to a standard search engine or by using other well known search engine techniques. The program then presents the candidate web pages to the human editors who then highlight the various components (i.e. date, location, time, event description, etc.) of the local events that are present on each web page. Based on the highlighting the program inserts additional tokens into the web pages to mark the beginning and ending of

each of the local event components. This process results in a collection of training web pages with human generated markup (201).

[0030] These training web pages are then fed to a training version of a web page tokenizer (202) that breaks the web page up into tokens. Tokens are, in general, words on a web page or markup elements. Not all markup elements become tokens since many of these elements deal with the appearance of the words and are therefore not relevant to the contents of the local event components.

[0031] The training version of the web page tokenizer (202) performs the same tokenizing of web pages that the extraction version of the web page tokenizer (207) does, but rather than emitting a token stream, it generates the various data tables that the event component identifier uses in identifying local events.

[0032] In fact the two tokenizers (202, 207) may be the same program invoked with different behavior modifying flags.

[0033] The output of training step (101) of the automatic extraction of local event data process is the contents of the three tables shown in FIG 2 (203, 204, 205). Based on the contents of these three tables the event component identifier (209) is able to identify event components from the token stream (208).

[0034] The second step (102) in automatically extracting event data from a web page is to tokenize the web page (206) from which local event data is to be extracted. This is done by the tokenizer (207) and the result of this operation is a token stream (208).

[0035] The third step (103) in automatically extracting event data from a web page is to identify the components of the local events in the web page by evaluating the token

stream (208) with the event component identifier (209) which then produces a stream of

event components (210). Groups of these event components provide complete local event descriptions.

[0036] FIG 3 shows the operation of the tokenizer in more detail.

[0037] The contents of the web page being processed to extract local event data (301) is read by the tokenizer (302) which discards markup that describes the page's appearance but retains markup that relates to the structure of the page content.

[0038] In the example table element height and backgrounds are discarded (first line), but the fact that words are contained in table rows (<tr>) and table row data elements (<td>) is retained.

[0039] The output of the tokenizer is the token stream (304).

[0040] As an illustration, FIG 3 shows the word list (303) that would be generated for this web page if the tokenizer were in training mode (i.e. the table 203 on FIG 2).

[0041] FIG 4 shows the operation of the event component identifier in more detail.

[0042] The token stream (401) previously produced by the tokenizer is read by the event component identifier (402) which in turn produces an event component stream (403).

[0043] In the process the event component identifier (402) determines which parts of the web page being processed contain the actual description of each local event found.

[0044] In order to do this identification of local event components, the event component identifier (402) uses the training tables generated during the training process (203, 204, 205) in conjunction with a Hidden Markov Model (HMM) (See, Manning, Christopher

D. and Schutze, Hinrich (1999). Foundations of Statistical Natural Language Processing. Cambridge, Massachusetts & London, England: The MIT Press).

[0045] In practice it has been found to be advantageous to have the tokenizer (207) perform recognition of simple phrases. For example it is relatively straightforward to have the tokenizer (207) recognize the string "Mar 4, 2004" as a date and emit a 'date' token (while the training tokenizer 202 stores this 'meta-word' in the word list (203)). This behavior reduces the complexity of the HMM's required in the component identifier (209) substantially.

[0046] Similarly, the tokens 'address' for a string like "567 Queen St. W.", 'venue' for a string like "Air Canada Center", and 'artist/performer' for a string like "Barenaked Ladies" provide substantial benefit.

[0047] Events are often very similar to each other. Particular events may be repeated regularly in the same locations (e.g. football, baseball, basketball, hockey and other sporting events), there may be many different events in the same venue and there may be many events in different locations with the same participants (i.e. a band tour). Since the training tokenizer (202) can label short phrases as relatively generic tokens such as 'venue', the overall system can take significant advantage of theses similarities between events to improve accuracy and speed of recognition.

[0048] The transition probability table (205) is shown in FIG 5.

[0049] As the training tokenizer (202) processes the training web pages, it encounters the various highlighted local event components. The different types of highlighted regions (i.e. date, time, location, event description, none of the above, etc.) are described with the

word 'state' in FIG 5 which is a term commonly used in describing HMM's. Four states are shown in FIG 7 and assigned representative meanings (a reduced number of states are shown to simplify the illustration).

[0050] For each token encountered on the training web pages (201) the training tokenizer (202) keeps a count of the type of state transition that occurred. After all the training web pages are processed the training tokenizer (202) uses this table of counts to calculate the probability of transition from state to state for each new token.

[0051] This set of probabilities is emitted as the transition probability table (205).

[0052] The word based transition probability table (204) is shown in FIG 6.

[0053] In addition to counting the state transitions that occurred with each token during the processing of the training web pages (201), the training tokenizer also records the count in a table of state transitions that is associated with each unique word.

[0054] For example if the word 'December' was encountered 100 times during the processing of all of the training pages (201), each time it was encountered a count was incremented corresponding to what state transition occurred following the occurrence of the word. After all the training web pages are processed the training tokenizer (202) uses this table of counts to calculate the probability of transition from state to state for each time the word 'December' was encountered.

[0055] The third table, the 'Word List' (203) is the list of all unique tokens encountered during the processing of the training web pages (illustration in FIG 3, component 303).

[0056] Given these three tables (203, 204, 205) generated by the training tokenizer (202), the event component identifier (402) identifies event components (210) from the token stream (208) by finding the most probable state transition path as shown in FIG 7.

[0057] The event component identifier (402) can begin and end in any state, but for the purpose of illustration, FIG 7 has been simplified to show it always starting and ending in state D. With each token encountered the state of the event component identifier (402) can change to any of the other states as illustrated with the arrows on drawing. The probability of each transition is found by looking at the tables (203, 204, 205) generated by the training tokenizer (202).

[0058] For each state transition there can be either one or two probabilities. In all cases the probability from the transition probability table (205) is used. If the token is found in the word list (203) generated by the training tokenizer (202), the probability from the word based transition probability table (204) is also used. In the case where there are two probabilities, the probabilities are multiplied together to get the probability of the state transition.

[0059] The probability of each possible path through network shown in FIG 7 is calculated by multiplying the probability of all of the state transitions along the path.

[0060] The number of possible paths through a network such as shown in FIG 7 grows geometrically with the number of tokens and so a naive search of all possible paths for the one with the highest probability can quickly become computationally infeasible.

[0061] However it is possible to reduce the search for the most probable path to be linear with the number of tokens and therefore be practical.

[0062] In order to make this optimization we associate a 'distance' value with the nodes in FIG 7 that is proportional to the negative logarithm of the probability rather than using the probability itself. Using the logarithm makes our path calculations a summation rather than a set of multiplications. Since probability values are by definition all less than one the logarithm of a probability is always negative and taking the negative logarithm assures us that our 'distance' values are always positive.

[0063] With this substitution our search for the path with the maximum probability becomes a search for the shortest path.

[0064] Now if we observe that there is only one shortest distance from the starting state to any given node in the graph we can cycle through the graph from the beginning to the end, watching for the shortest path as we go, with a constant (and therefore optimized) number of computational steps for each new token.

[0065] To illustrate this, consider the path in FIG 7 from 701 to 704 to 708. This path is the shortest path from 701 to 708 and is drawn in a heavy weight. Alternate paths from 701 to 708 include 701 to 702 to 708, 701 to 703 to 708 and 701 to 705 to 708. We observe that any path from 701 to 714 must pass through one of the four nodes 706, 707, 708 or 709. Each of these four nodes (706, 707, 708, 709) has a shortest path to it from 701, and one of those shortest paths must be on the shortest path from 701 to 714. We can therefore drop from consideration all of the alternate paths (such as 701 to 702 to 708) that are not the shortest.

[0066] If we move forward to the node labeled 711, the shortest path from 701 to 711 is the shortest path to one of the nodes 706, 707, 708, 709 plus the distance to node 711. In

this case the shortest of these four paths is the one drawn in heavy weight from 701 to 704 to 708 to 711.

[0067] This optimization of the calculation of the most probable path is called the 'Viterbi' algorithm (See, Manning, Christopher D. and Schutze, Hinrich (1999). Foundations of Statistical Natural Language Processing. Cambridge, Massachusetts & London, England: The MIT Press) and is most commonly used to do probabilistic error correction in data streams.

[0068] Once the event component identifier (402) has established the most probable state transition path it is able to assign tokens to specific states and thus divide the incoming token stream (208) into event components (210).

[0069] In FIG 7 tokens 1, 3 and 4 cause state transitions and token 2 is in state C and therefore emitted as part of the event component associated with that state.

[0070] As will be apparent to persons of ordinary skill in the art, event components (i.e. event data) for particular events may be collected and stored in a data store (not shown) such as a database. The event data and hence a list or other collection of events prepared from the event data may be updated periodically repeating the extraction operations typically on a plurality of web pages or other source electronic documents for example, to pick up updates to those documents.

[0071] The event data may be provided to others in a variety of ways (or combinations thereof (all not shown)). For example, event data maybe made available to users through a web-based search interface that allows users to query and locate a list of events or a particular event of interest. The search interface may be configured with various search

tools such as calendars, drop down menus, etc. to direct the search or may be natural language based (e.g. "Who is playing at the Rex this Saturday night?").

[0072] The event data may be provided to others as a service through a data feed (e.g. RSS feed, XML stream or other feed (all not shown)). For example, the event data may be extracted by a service provider and provided to a third party (e.g. newspaper publisher, city/regional magazine or other web site operator) for its web site or other use. The event data may be in a raw form or a more formatted form ready for publication such as to a list of events on a web page.

[0073] Although the above description relates to specific embodiments as presently contemplated by the inventors, it is understood that the invention in its broad aspect includes mechanical and functional equivalents of the elements described herein.