Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
PRODUCTION OF ANALYTIC TIME SERIES BASED ON THE COMBINATION OF CODES IN TIME-STAMPED, TAGGED NEWS
Document Type and Number:
WIPO Patent Application WO/2008/113416
Kind Code:
A1
Abstract:
Language for expressing combinations of codes for being evaluated by an automatic processing system, wherein an expression completely and unambiguously defines a time series, and wherein an expression allows combining codes using at least AND, OR, and NOT operators; Computer implemented method for producing analytic time series based on arbitrary combinations of codes from news items, each news item containing textual content and tagged codes, each code being descriptive for textual content of the news item, each news tem being times-tamped, wherein the combinations of codes are defined and evaluated by use of the inventive language.

Inventors:
CORNEZ JASON (ES)
GONZALEZ ARMANDO (ES)
CROSBIE KEVIN (ES)
GAGNER PHILIP (ES)
Application Number:
PCT/EP2007/052599
Publication Date:
September 25, 2008
Filing Date:
March 19, 2007
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
POLLERT HEINER (DE)
CORNEZ JASON (ES)
GONZALEZ ARMANDO (ES)
CROSBIE KEVIN (ES)
GAGNER PHILIP (ES)
International Classes:
G06F17/30
Other References:
KIENREICH W ET AL: "A Visual Query Interface for a Very Large Newspaper Article Repository", DATABASE AND EXPERT SYSTEMS APPLICATIONS, 2005. PROCEEDINGS. SIXTEENTH INTERNATIONAL WORKSHOP ON COPENHAGEN, DENMARK 22-26 AUG. 2005, PISCATAWAY, NJ, USA,IEEE, 22 August 2005 (2005-08-22), pages 415 - 419, XP010835627, ISBN: 0-7695-2424-9
HAVRE S ET AL: "Themeriver: visualizing thematic changes in large document collections", IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, IEEE SERVICE CENTER, LOS ALAMITOS, CA, US, vol. 8, no. 1, January 2002 (2002-01-01), pages 9 - 20, XP011094161, ISSN: 1077-2626
Attorney, Agent or Firm:
SCHNEIDER, Guenther et al. (München, DE)
Download PDF:
Claims:

Cl aims

1. Language for expressing combinations of codes for being evaluated by an automatic processing system, wherein an expression completely and unambiguously defines a time series, and wherein an expression allows combining codes using at least AND, OR, and NOT operators, wherein:

the AND operator performs counting of news per interval tagged with all of the specified codes or sub-expressions;

the OR operator performs counting of news per interval tagged with any of the specified codes or sub-expressions;

the NOT operator performing counting of news per interval not tagged with the specified code or sub-expression .

2. Computer implemented method for producing analytic time series based on arbitrary combinations of codes from news items, each news item containing textual content and tagged codes, each code being descriptive for textual content of the news item, each news tern being times-tamped, wherein the com-

binations of codes are defined and evaluated by use of the language according to claim 1.

3. Method for evaluation of an expression written in the language according to claim 1, wherein evaluation is performed by set logic arithmetic according to the following rules:

AND meaning a set intersection; OR meaning a set union; NOT meaning a difference.

4. System for transforming a language expression of claim 1 and compiling the language expression into machine executable code before processing with the process of claim 3.

5. A computer system adapted for performing the method according to one of the preceding claims.

6. A computer-readable storage media comprising pro ¬ gram code for performing the method according to one of claim 1 to 3, when loaded into a computer system.

Description:

Specification

Production of Analytic Time Series Based on the Combi ¬ nation of Codes in Time-Stamped, Tagged News

BACKGROUND OF THE INVENTION

This application relates to the field of real-time news processing.

DESCRIPTION OF RELATED ART

A news item is content in the form of a headline or a headline plus a story, or a story body without a head- line optionally with additional data that describes, classifies, or is otherwise about the content of the news item. This additional information is often known as metadata.

Tagged news are news items that are annotated with codes that provide information about the news item. A code is a discreet and unique identifier which indi ¬ cates something that the publisher wants to convey about the story. The process of annotating a news item with codes is known as tagging. The code for a news item are part of its metadata.

A time series is a time-ordered set of data with one or more values for each defined point in time. A stock market price is an example of a time series.

A relational database is a well-defined system for storing and retrieving data, where different sets of data may be related by identifiers or keys.

The co-pending Patent Application "Process for Convert- ing Time-Stamped, Tagged News Items into Real-Time Ana ¬ lytic Time-Series", by the same applicant, is included herein by reference.

SUMMARY OF THE INVENTION

In general, in one aspect, this invention provides methods a language for expressing combinations of codes for being evaluated by an automatic processing system, wherein an expression completely and unambiguously de ¬ fines a time series, and wherein an expression allows combining codes using at least AND, OR, and NOT opera ¬ tors, wherein:

the AND operator performs counting of news per interval tagged with all of the specified codes or sub-expressions;

the OR operator performs counting of news per in- terval tagged with any of the specified codes or sub-expressions;

the NOT operator performing counting of news per interval not tagged with the specified code or sub-expression .

Furthermore, the invention provides a computer imple ¬ mented method for producing analytic time series based on arbitrary combinations of codes from news items, each news item containing textual content and tagged codes, each code being descriptive for textual content of the news item, each news tern being times-tamped, wherein the combinations of codes are defined and evaluated by use of the language according to the in ¬ vention .

Yet further, the invention provides a method for evaluation of an expression written in the language according to the invention, wherein evaluation is performed by set logic arithmetic according to the follow ¬ ing rules : AND meaning a set intersection; OR meaning a set union; NOT meaning a difference.

Yet further, the invention provides a system for trans- forming a language expression of the invention and compiling the language expression into machine executable code before processing with the above mentioned inven ¬ tive method.

Advantageous implementations can include one or more of the features according to the dependent claims.

In particular, the invention comprises also computer systems for performing the inventive methods.

Furthermore, the invention comprises computer-readable storage media comprising program code for performing the inventive methods, when loaded into a computer sys ¬ tem.

Thus, the present invention deals with the following problem: There is a huge volume of news (both histori ¬ cally, and also arriving in real-time) and there is a desire to identify a particular type of news. This is related to a type of search problem, but here, there is no necessary interest in reading the news items them- selves, but rather to see when they occur with the hopes of correlating that with some market activity.

The present invention provides a way of creating a time series based on news that is tagged with codes to do exactly this. Using the present invention, codes can be combined via a language, and expressions can be evaluated in this language to produce time series that show us when the news we are interested in occurs.

Thus, the technique described in "Process for Convert ¬ ing Time-Stamped, Tagged News Items into Real-Time Ana ¬ lytic Time-Series" is extended in a novel way so that one can produce analytic time series based combinations of multiple codes from news that is tagged and time- stamped.

In "Process for Converting Time-Stamped, Tagged News Items into Real-Time Analytic Time-Series", the set of time series which can be produced are strictly limited by the codes used to tag the news items.

So there might be a code for tagging news about Europe and there might be a code for tagging news about the US Dollar. But what if we are interested in news about how the US Dollar affects Europe, that is, we want news which is tagged with both codes.

The present invention describes both a language for specifying expressions which represent combinations of codes as well as a method of efficiently evaluating those expressions. The expressions can directly state the simple example above and naturally allow the ex ¬ pression of much more complex combinations as well.

Computation of the values of the series is well-defined and typically quite rapid.

BRIEF DESCRIPTION OF DRAWINGS

Fig. 1 illustrates set operations and expressions used in the present invention;

Fig. 2 illustrates the process for computing value of combination as performed by the present invention .

DETAILED DESCRIPTION

With reference to Fig. 1, the grammar of the inventive expression language for series based on code combina ¬ tions is described.

Just as English has a grammar that specifies the rules for how to put words together to form sentences, the combinations also have a grammar that describes pre ¬ cisely how these expressions may be formed and exactly what each expression means. In this language, the ex- pression for each combination represents the count of news items on a per interval basis that match the ex ¬ pression, according to how the news items are tagged.

According to the invention there are a minimum of three operators that this language defines. These operators are AND, OR, NOT.

AND: Count includes news tagged with ALL of the listed codes and those news items specified by all sub- expressions.

OR: Count includes news tagged with ANY of the listed codes or those news items specified by any sub ¬ expressions .

NOT: Count excludes news tagged with the code or those news items specified by the sub-expression.

The full grammar is formally described below using a BNF notation.

i . expr <== ( AND args ( OR args ( NOT arg )

ii. args <== arg | arg args

iii. arg <== code | expr

iv. code <== Any code with which the news might be tagged

In the grammar, <== roughly means is defined by, and (the vertical bar) means or, designating a choice of one of the options. Lowercase words, like expr and args are rules in the grammar, each of which will be described in detail below. Uppercase words and other symbols such as ( and ) are to be taken literally, meaning that those characters or words will be part of the expression itself. Two words next to each other are to be separated by a space character. Additional spaces are ignored. Unknown operators result in an er ¬ ror .

What follows is a detailed description of each line of the grammar.

i. In the grammar, an expression is denoted by expr.

This says that an expression is defined to be one of the following: ( followed by AND then a space then args and finally ) , or ( followed by OR then a space then args and finally ) , or ( followed by NOT then a space then an arg and finally ) .

ii. Arguments or parameters are denoted by args . This says that arguments are formed either from a single ar ¬ gument, or from a single argument followed by a space and then more arguments. This is a recursive defini- tion, since it refers to itself. It is a short way of saying that args designates one or more arguments, each of which is designated by arg.

iii. A single argument or parameter is denoted by arg. This rule says that an argument can be either a code or an expression, as we defined in rule (i) . If an argu ¬ ment is itself an expression, it is often referred to as a sub-expression, but it in fact follows exactly the same rules as an expression.

iv. A code literal is denoted by code. This rule is very simple. It just says that a code is represented literally by the code name, such as TAG/A, TAG/B, or TAG/C, meaning when you want to specify a code in an expression, you can just enter the code name directly.

The following points are worth noting. In the above AND, OR, and NOT are known as operators. Operators al ¬ ways come at the beginning of an expression or sub- expression, immediately following the ( symbol. Also note that all expressions and sub-expressions must be enclosed in matching parentheses. And finally, the case of AND, OR, and NOT is not important; any combina ¬ tion of upper- and lower-case characters are permitted.

This is known as a prefix grammar and it has the advan ¬ tage of making the groupings and precedence unambigu-

ous. In addition, it easily allows an operator to have not just two, but any number of arguments.

This grammar is similar to that of a Boolean logic grammar, and in fact, when considering a single news item, the expression can be considered as evaluating to TRUE if and only if the news item is tagged exactly ac ¬ cording to the expression.

Finally we observe that it is possible to consider ad ¬ ditional operators for this language. One such opera ¬ tor is XOR which could mean that a news items is tagged with exactly one of the codes supplied. That is, (XOR a b) will count a news item if the item was tagged with either "a" or "b", but not if it was tagged with both. The XOR operator can be expressed in terms of our ex ¬ isting operators and the same is true for any other op ¬ erator one might wish to add to the language.

(XOR a b) == (AND (OR a b) (NOT (AND a b) ) )

Second, the inventive process for evaluation of expres ¬ sions described above to produce new analytic time se- ries is described.

We observe that for any regular interval of time, the news which is tagged with any given code can be repre ¬ sented as a set of news identifiers. Here, "set" is meant in the mathematical sense, where elements have identity, each element can occur at most once, and the

ordering of elements is not important, such that {a b} and {b a} denote the same set.

This observation, which is a basic principle of the present invention, that the news tagged with any code can be treated as a set, is extremely powerful because it means that set arithmetic can now be applied to tagged news items to compute new analytic time series. Now we explore how this works in more detail.

Recall from co-pending application (5) that the time series for a news code is defined to be the count of news items per interval which are tagged with that code. And just above we have defined a set which we denote as set (code) . It should be clear that the size of set (code) must always be equal to this count. This provides a basis for what follows.

As outlined above, AND is defined to mean, in part, the count of news items per interval tagged with ALL the codes given as parameters. So given the example (AND a b) , this can be computed by applying the set intersec ¬ tion operation to set (a) and set (b) , and counting the elements in the resulting set.

In the following, it is explained why this must be true without supplying a rigorous proof. First, set opera ¬ tions are guaranteed to themselves produce a set. In particular, this means that the result will not contain duplicates. Now recall that set intersection is de ¬ fined to be the set of elements which occur in both in ¬ put sets, or in all input sets in the case of more than

two inputs. So, if we have an news item in set (a) and the same news item in set (b) , this means that the news item must be tagged with code "a" and also with code "b", according to the definitions of the sets. Fur- ther, we know that if a news item is tagged only with code "a" then that news item will not be part of the set intersection. And clearly, if a news item is tagged with neither "a" nor "b", then it will also not be part of the set intersection. Thus, the size of the set resulting from the intersection of set (a) and set (b) is exactly equal to the number of news items tagged with both code "a" and code "b" .

The OR operator can be implemented by applying the set union operator to the input sets. The union of two sets contains all the elements which are members of ei ¬ ther input set, and because the result is itself a set, if an element is a member of both input sets, it ap ¬ pears in the output set only once.

The equivalence of (OR a b) with the size of the union of set (a) and set (b) should be clear for the same rea ¬ sons that AND can be computed via set intersection.

By itself the NOT operator can be taken to mean the count of news items which are not tagged with a par ¬ ticular code. But it is actually more general than that. Its argument might itself be an expression and, thinking in terms of sets, this expression will evalu- ate to some set of news items. Applying NOT to this set means to remove all elements of this set from the

set of all news items which occurred during the inter ¬ val. This is a set difference operator.

To compute NOT in isolation requires we compute the full set of all news items in an interval, which we de ¬ note by ALL. So the following is the set-based method for computing (NOT a) :

(NOT a) ==> (set-difference ALL a)

But in combination with AND, we can compute NOT without the need for retrieving the full set of news items in the interval. In particular, one of the following transformations can be used:

(AND a (NOT b) ) ==> (set-difference a b)

Or more generally:

(AND a b (NOT c) d) ==> (set-difference (AND a b d) c)

Notice that after the more general transformation, the expression contains a sub-expression. In this case, it is an AND, so it is processed as described above in (2a) .

While it is possible to evaluate an expression by in ¬ terpreting it according the rules presented in (2), we can optionally transform some expressions into simpler ones and also compile any expression into a normal com ¬ piled function which one can then execute without the

overhead associated with interpreting the expression each time.

The following is a partial list of symbolic transforma- tions that are possible within the framework of this language. Of course, all of these transformations ex ¬ ist for the purpose of simplifying the expression or improving the efficiency of its execution. Other transformations which make things more verbose or lower are expressly avoided. In the following, NIL is the set of no news items in an interval, ALL is the set of every news item in an interval, and diff is short for set-difference .

a. (and ALL a b) == (and a b) b. (and ALL) == ALL c. (and a) == a d. (and NIL a b) == NIL e. (and a (and b c) d) == (and a b e d) f. (or NIL a b) == (or a b) g. (or a) == a h. (or ALL a b) == ALL i. (or a (or b c) d) == (or a b e d) j. (not NIL) == ALL k. (not ALL) == NIL

1. (not (diff a b) ) == (or (not a) b) m. (and a (diff b c) d) == (diff (and a b d) c)

Other simplifying transformations are certainly possi- ble as well. And note that transformations, whether listing above or not, can be applied recursively on the resulting expression after any given transformation.

Care must be taken to ensure there are no transforma ¬ tions which result in a new form which when transformed again results in the original form.

After the transformations we end up with a new, simpler expression that when evaluated returns the set results as the original expression.

The final, optional step is to compile this expression into a native code routine so that it can be executed rapidly. This is not difficult if we notice that our expression represented exactly the way a parse-tree of a compiler might represent it. One can implement a code generator to produce source code and invoke a com- piler on this, or if using a language such as Common

Lisp where the compiler is available at runtime, the task is completely natural.

The following is an example of a lisp function that takes as input an expression and returns as output a compiled function. The resulting compiled function takes a time as input and returns the count at the in ¬ terval defined by the specified time:

(defun make-combination-function (expression)

(compile nil " (lambda (time) (length , expression) ) ) )

Finally we note that it is an explicit property of this process that it can be applied to a streaming newsfeed to produce a time series in real time. So if there is a sudden burst of news about exactly some set of topics

but not about others, then this can be seen on a minute by minute basis as it occurs.

Fig. 2 illustrates an example of the process for com- puting the value of a combination of sets. Given are the codes A, B, C. Step 1 defines the combination of codes A, B, C to be evaluated. In the example, the news are searched in which code A or code B are com ¬ prised but not code C. Step 2 gives the formulation of the corresponding expression:

(and (or A B) (not C) )

In step 3, this expression is transformed to:

(diff (union AB) C) .

In step 4, the sets for each code are computed:

The set for code A may be { 1 2 3 4 5 6 } ; the set for code B may be { 4 5 6 7 8 9 } , and the set for code C may be {2 4 6 8} .

Thus, the evaluating the expression in step 5 gives

(A U B) - C = {1 3 5 7 9} .

Step 6 gives the value, i.e., the size of the combina ¬ tion :

{1 3 5 7 9} I = 5

The present techniques can be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. Apparatus of the invention can be implemented in a computer pro ¬ gram product tangibly embodied in a machine-readable storage device for execution by a programmable proces ¬ sor. Method steps according to the invention can be performed by a programmable processor executing a pro- gram of instructions to perform functions of the inven ¬ tion by operating on the basis of input data, and by generating output data. The invention may be imple ¬ mented in one or several computer programs that are ex ¬ ecutable in a programmable system, which includes at least one programmable processor coupled to receive data from, and transmit data to, a storage system, at least one input device, and at least one output device, respectively. Computer programs may be implemented in a high-level or object-oriented programming language, and/or in assembly or machine code. The language or code can be a compiled or interpreted language or code. Processors may include general and special purpose mi ¬ croprocessors. A processor receives instructions and data from memories, in particular from read-only memo- ries and/ or random access memories. A computer may in ¬ clude one or more mass storage devices for storing

data; such devices may include magnetic disks, such as internal hard disks and removable disks; magneto- optical disks; and optical disks. Storage devices suit ¬ able for tangibly embodying computer program instruc- tions and data include all forms of non-volatile mem ¬ ory, including by way of example semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM disks. Any of the foregoing can be supplemented by or incorporated in ASICs (application-specific integrated circuits) .

The computer systems or distributed computer networks as mentioned above may be used, for example, for pro ¬ ducing goods, delivering parts for assembling products, controlling technical or economical processes, or im ¬ plementing telecommunication activities.

To provide for interaction with a user, the invention can be implemented on a computer system having a display device such as a monitor or LCD screen for displaying information to the user and a keyboard and a pointing device such as a mouse or a trackball by which the user can provide input to the computer system. The computer system can be programmed to provide a graphical or text user interface through which computer programs interact with users .

A computer may include a processor, memory coupled to the processor, a hard drive controller, a video con-

troller and an input/output controller coupled to the processor by a processor bus. The hard drive controller is coupled to a hard disk drive suitable for storing executable computer programs, including programs em- bodying the present technique. The I/O controller is coupled by means of an I/O bus to an I/O interface. The I/O interface receives and transmits in analogue or digital form over at least one communication link. Such a communication link may be a serial link, a parallel link, local area network, or wireless link (e.g. an RF communication link) . A display is coupled to an interface, which is coupled to an I/O bus. A keyboard and pointing device are also coupled to the I/O bus. Alter ¬ natively, separate buses may be used for the keyboard pointing device and I/O interface.