Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
COHORT DATA ANALYSIS METHODS AND SYSTEMS AND DATA STRUCTURES FOR PERFORMING COHORT DATA ANALYSIS
Document Type and Number:
WIPO Patent Application WO/2018/048350
Kind Code:
A1
Abstract:
Methods, systems and data structures for performing cohort data analysis are disclosed. User activity data is stored as an activity table that comprises a plurality of tuples. Each of the tuples has a plurality of attributes comprising a user identifier, a time stamp and an activity identifier. The activity table is sorted according to a primary key comprising the user identifier, the time stamp and the activity identifier. A cohort query which specifies a birth condition indicating a division of the plurality of users into a plurality of cohorts and a function of at least one attribute to be evaluated for each cohort of the plurality of cohorts can be evaluated by iterating over the activity table.

Inventors:
JIANG DAWEI (CN)
CAI QINGCHAO (SG)
OOI BENG CHIN (SG)
TAN KIAN LEE (SG)
TUNG KUM HOE ANTHONY (SG)
Application Number:
PCT/SG2017/050443
Publication Date:
March 15, 2018
Filing Date:
September 06, 2017
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NAT UNIV SINGAPORE (SG)
International Classes:
G06F17/30
Domestic Patent References:
WO2015178697A12015-11-26
Foreign References:
US20100332459A12010-12-30
US20120005227A12012-01-05
US20090319549A12009-12-24
Other References:
JIANG D. ET AL.: "Cohort Query Processing. Proceedings of the VLDB Endowment", COHORT QUERY PROCESSING. PROCEEDINGS OF THE VLDB ENDOWMENT, vol. 10, no. 1, 1 September 2016 (2016-09-01), pages 1 - 12, XP058308893, [retrieved on 20170929]
BARBOSA S. ET AL.: "Averaging Gone Wrong: Using Time-Aware Analyses to Better Understand Behavior", PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, 15 April 2016 (2016-04-15), pages 829 - 841, XP058080297, [retrieved on 20170929]
Attorney, Agent or Firm:
LINDSAY, Jonas Daniel (SG)
Download PDF:
Claims:
CLAIMS

1. A method for analyzing user activity data, the user activity data comprising indications of activities for a plurality of users, the method comprising

storing the user activity data as an activity table, the activity table comprising a plurality of tuples, each tuple having a plurality of attributes comprising a user identifier, a time stamp and an activity identifier, the activity table being sorted according to a primary key comprising the user identifier, the time stamp and the activity identifier; and

evaluating a cohort query by iterating over the activity table, the cohort query specifying a birth condition indicating a division of the plurality of users into a plurality of cohorts and a function of at least one attribute to be evaluated for each cohort of the plurality of cohorts.

2. A method according to claim 1 , wherein the cohort query comprises a birth selection operator which retrieves tuples for a user that satisfies a condition for inclusion in a cohort.

3. A method according to claim 1 or 2 wherein the cohort query comprises an age selection operator which retrieves tuples which satisfy a specified condition.

4. A method according to claim 1 , wherein the cohort query comprises a birth selection operator which retrieves tuples for a user that satisfies a condition for inclusion in a cohort and an age selection operator which retrieves tuples which satisfy a specified condition and wherein evaluating the cohort query comprises applying the birth selection operator to the activity table and then applying the age selection operator.

5. A method according to any preceding claim wherein the cohort query comprises an aggregation operator which assigns each user to a cohort and aggregates tuples for each cohort.

6. A method according to any preceding claim, further comprising dividing the activity table into a plurality of data chunks such that all tuples for any one user are included in one data chunk; and compressing the data chunks.

7. A method according to claim 6, wherein compressing the data chunks comprises applying run-length encoding to a column of the data chunks comprising the user identifier.

8. A method according to claim 7, wherein iterating over the activity table comprises using run length encoded data of user identifiers to advance through the activity table.

9. A method according to any one of claims 6 to 8, wherein compressing the data chunks comprises applying delta encoding to a column of the data chunks comprising the time stamp and / or a column of the data chunks comprising an integer attribute.

10. A method according to any one of claims 6 to 9, wherein compressing the data chunks comprises: for an attribute of the plurality of attributes, constructing a global dictionary which associates each unique value of the attribute with a global identifier; for each data chunk constructing a chunk dictionary which maps a chunk specific identifier for each unique value of the attribute occurring in the chunk to the global identifier corresponding to that unique value for the attribute; and replacing the attribute values for that attribute with respective chunk specific identifiers.

11. A method according to claim 10 wherein the attribute of the plurality of attributes encoded with the global dictionary is the activity identifier.

12. A computer readable carrier medium carrying processor executable instructions which when executed cause a processor to carry out a method according to any one of claims 1 to 11.

13. A data structure for storing user activity data comprising indications of activities for a plurality of users, the data structure comprising an activity table comprising plurality of tuples, each tuple having a plurality of attributes comprising a user identifier, a time stamp and an activity identifier, the activity tabie being sorted according to a primary key comprising the user identifier, the time stamp and the activity identifier.

14. A data structure according to claim 13, the activity table comprising a plurality of data chunks wherein all tuples for any one user are included in one data chunk.

15. A data structure according to claim 14 wherein a column of the data chunks comprising the user identifier is encoded according to a run-length encoding scheme.

16. A data structure according to anyone of claims 14 to 15 wherein a column of the data chunks comprising the time stamp and / or a column of the data chunks comprising an integer attribute is encoded according to a delta encoding scheme.

17. A data structure according to anyone of claims 14 to 16 wherein a column of the data chunks comprising an attribute is encoded with a global dictionary which associates each unique value of the attribute with a global identifier and a chunk dictionary which maps a chunk specific identifier for each unique value of the attribute occurring in the chuck to the global identifier corresponding to that unique value for the attribute.

18. A data structure according to claim 17, wherein the attribute of the plurality of attributes encoded with the global dictionary is the activity identifier.

19. A data processing system comprising a storage management module and a query processing module,

the storage management module being operable to store user activity data comprising indications of activities for a plurality of users as an activity table, the activity table comprising a plurality of tuples, each tuple having a plurality of attributes comprising a user identifier, a time stamp and an activity identifier, the activity table being sorted according to a primary key comprising the user identifier, the time stamp and the activity identifier; the query processing module being operable to evaluate a cohort query by iterating over the activity table, the cohort query specifying a birth condition indicating a division of the plurality of users into a plurality of cohorts and a function of at least one attribute to be evaluated for each cohort of the plurality of cohorts.

20. A data processing system according to claim 19, wherein the cohort query comprises a birth selection operator which retrieves tuples for a user that satisfies a condition for inclusion in a cohort.

21. A data processing system according to claim 19 or 20 wherein the cohort query comprises an age selection operator which retrieves tuples which satisfy a specified condition.

22. A data processing system according to claim 19, wherein the cohort query comprises a birth selection operator which retrieves tuples for a user that satisfies a condition for inclusion in a cohort and an age selection operator which retrieves tuples which satisfy a specified condition and wherein evaluating the cohort query comprises applying the birth selection operator to the activity table and then applying the age selection operator.

23. A data processing system according to any one of claims 19 to 22, wherein the cohort query comprises an aggregation operator which assigns each user to a cohort and aggregates tuples for each cohort.

24. A data processing system according to any one of claims 19 to 23 wherein the storage management module is further operable to divide the activity table into a plurality of data chunks such thai all tuples for any one user are included in one data chunk; and to compress the data chunks.

25. A data processing system according to claim 24 wherein the storage management module is operable to compress the data chunks by applying run- length encoding to a column of the data chunks comprising the user identifier.

26. A data processing system according to claim 25, wherein the query processing module is operable to iterate over the activity table using run length encoded data of user identifiers to advance through the activity table.

27. A data processing system according to any one of claims 24 to 26, wherein the storage management module is further operable to compress the data chunks by applying delta encoding to a column of the data chunks comprising the time stamp and / or a column of the data chunks comprising an integer attribute.

28. A data processing system according to any one of claims 24 to 27, wherein the storage management module is further operable to construct a global dictionary which associates each unique value of the attribute with a global identifier for an attribute of the plurality of attributes, to construct a chunk dictionary which maps a chunk specific identifier for each unique value of the attribute occurring in the chunk to the global identifier corresponding to that unique value for the attribute; and to replace the attribute values for that attribute with respective chunk specific identifiers.

29. A data processing system according to claim 28 wherein the attribute of the plurality of attributes encoded with the global dictionary is the activity identifier.

30. A data processing method of evaluating a query on a data structure, the data structure comprising an activity table comprising plurality of tuples, each tuple having a plurality of attributes comprising a user identifier, a time stamp and an activity identifier, the activity table being sorted according to a primary key comprising the user identifier, the time stamp and the activity identifier, the query comprising a birth selection operator which retrieves tuples for a user that satisfies a condition for inclusion in a cohort and an age selection operator which retrieves tuples which satisfy a specified condition, the method comprising:

applying birth selection operator to the activity table and then applying the age selection operator to the activity table.

31. A method according to claim 30, wherein the activity table is divided into a plurality of data chunks such that all tuples for any one user are included in one data chunk, the data chunks being run-length encoded, the method further comprising: iterating over the activity table using run length encoded data of user identifiers to advance through the activity table.

32. A computer readable carrier medium carrying processor executable instructions which when executed cause a processor to carry out a method according to any one of claims 30 to 31.

Description:
COHORT DATA ANALYSIS METHODS AND SYSTEMS AND DATA

STRUCTURES FOR PERFORMING COHORT DATA ANALYSIS

FIELD OF THE INVENTION

The present disclosure relates to data analysis and more specifically to the analysis of user activity data using cohort analysis.

BACKGROUND OF THE INVENTION

Cohort analysis, which originates from social science, is a powerful data exploration technique for finding unusual user behavior trends from large activity datasets using the concept of cohorts. Cohort analysis can be beneficial to modem Internet applications, such as online commercial websites and mobile game applications, which normally generate a large volume of user behavioral data. With cohort analysis, such applications can efficiently find the abnormal user behavioral patterns and inspect the associated factors to enhance their business. Internet applications often accumulate a huge amount of activity data representing information associated with user actions. Such activity data is often tabulated to provide insights into the behavior of users in order to increase sales and ensure user retention.

In social science, cohort analysis is used an analytical technique for assessing the effects of aging on human behavior in a changing society. In particular, it allows us to tease apart the effect of aging from the effect of social change, and hence can offer insights that are more valuable. With cohort analytics, social scientists study the human behavioral trends in three steps: 1 ) group users into cohorts; 2) determine the age of user activities and 3) compute aggregations for each (cohort, age) bucket. The first step employs the so-called cohort operation to capture the effect of social differences. Social scientists choose a particular action e (called the birth action) and divide users into different groups (called cohorts) based on the first time (called birth time) that users performed e. Each cohort is then represented by the time bin (e.g., day, week or month) associated with the birth time. Each activity tuple of a user is then assigned to the same cohort that this user belongs to. In the second step, social scientists capture the effect of aging by partitioning activity tuples in each cohort into smaller sub-partitions based on age. The age of an activity tuple t is the duration between the birth time of that user and the time that user performed t. Finally, aggregated behavioral measure is reported for each (cohort, age) bucket. The classic cohort analysis we presented so far is extremely useful for user retention analysis. By comparing the cardinality and user behavior between different cohorts, one can infer the possible factors that affect user behavior by first finding certain time epochs from which users behaved differently than they had used to, and then exploring the events that happened at the same time. For example, Internet startups can use this method to evaluate the impact of new functionalities or versions of their products in terms of user acquisition and retention; an online shopping website can investigate whether a new product recommendation algorithm can increase the sales or not. However, there are two limitations in the standard social science style cohort analysis.

First, social scientists typically analyze a whole dataset. This is because the datasets used are usually small and are specifically collected for a certain cohort analysis task. As such, there is no mechanism for extracting a portion of users or activity tuples for cohort analysis. While this task seems trivial, it has to be handled with care as a naive selection may lead to incorrect answers.

Second, social scientists use only the time attribute to identify cohorts. This is because time is considered to be the key attribute that determines social change. However, it would also be desirable to provide support for a more general cohort analysis task where cohorts can be defined with respect to some other attributes of interest. Although this seems only to be a minor extension, it can significantly widen the application spectrum of cohort analytics. Below are several interesting problems we choose from traditional retention analysis, health care and financial investment that cannot be handled by the classic cohort analytics but can be mapped into an extended cohort analytics task. Example 1 . Understanding the retention of users of different attributes (e.g., age, salary and location) can be of interest to company business. By grouping users into cohort based on the attributes of interest and studying user retention of different cohorts, one can understand where to put effort to improve user retention and develop appropriate business plans.

Example 2. A doctor might be interested in finding if there exists a correlation between patient readmission and the physical status of patients when they are initially admitted. This can be achieved by grouping patients into cohorts based on their physical status, and then comparing between cohorts the number of readmissions in different time periods to see how this number is related to the physical status of users in the cohort.

Example 3. Venture Capital is eager to find what kind of startups have potential to provide a positive investment return. To this end, it can group startups into cohorts based on many aspects, such as products, user acquisition, net revenue, user retention and company structure, at the time when they received investment, and then find the attribute values of the cohorts consisting of more companies that finally survived or survived for a long time.

SUMMARY OF THE INVENTION

According to a first aspect of the present invention a data analysis method for analyzing user activity data is provided. The user activity data comprises indications of activities for a plurality of users. The method comprises storing the user activity data as an activity table, the activity table comprising a plurality of tuples, each tuple having a plurality of attributes comprising a user identifier, a time stamp and an activity identifier, the activity table being sorted according to a primary key comprising the user identifier, the time stamp and the activity identifier; and evaluating a cohort query by iterating over the activity table, the cohort query specifying a birth condition indicating a division of the plurality of users into a plurality of cohorts and a function of at least one attribute to be evaluated for each cohort of the plurality of cohorts. Storing the user activity data in an activity table sorted in this way allows efficient processing of cohort queries.

The cohort queries may comprise birth selection operators and / or age selection operators. When a cohort query comprises both birth selection operators and age selection operators, since the operators are commutative, the birth selection operator can be moved in the query plan to optimize the evaluation of the cohort query.

The cohort query may also comprise an aggregation operator which assigns each user to a cohort and aggregates tuples for each cohort.

The storage of the activity table may comprise dividing the activity table into a plurality of data chunks such that all tuples for any one user are included in one data chunk. The chunks may be compressed by various different compression schemes. Run-length encoding may be applied to a column of the data chunks comprising the user identifier. This run length encoded data may be used when advancing through the activity table.

A column of the data chunks comprising the time stamp and / or a column of the data chunks comprising an integer attribute may be compressed using delta encoding.

A dictionary encoding scheme may be applied to the data chunks as follows. For an attribute of the plurality of attributes, constructing a global dictionary which associates each unique value of the attribute with a global identifier; for each data chunk constructing a chunk dictionary which maps a chunk specific identifier for each unique value of the attribute occurring in the chunk to the global identifier corresponding to that unique value for the attribute; and replacing the attribute values for that attribute with respective chunk specific identifiers. Such encoding may be applied to the activity identifier.

According to a second aspect of the second invention there is provided a data structure for storing user activity data comprising indications of activities for a plurality of users. The data structure comprises an activity table comprising plurality of tuples, each tuple having a plurality of attributes comprising a user identifier, a time stamp and an activity identifier, the activity table is sorted according to a primary key comprising the user identifier, the time stamp and the activity identifier.

The activity table may comprise a plurality of data chunks wherein all tuples for any one user are included in one data chunk. Columns of the chunks may be encoded by run-length encoding, delta encoding and / or by a global dictionary in combination with a chunk specific dictionary.

According to a third aspect of the present invention there is provided a data processing system comprising a storage management module and a query processing module. The storage management module is operable to store user activity data comprising indications of activities for a plurality of users as an activity table, the activity table comprising a plurality of tuples, each tuple having a plurality of attributes comprising a user identifier, a time stamp and an activity identifier, the activity table being sorted according to a primary key comprising the user identifier, the time stamp and the activity identifier. The query processing module is operable to evaluate a cohort query by iterating over the activity table, the cohort query specifying a birth condition indicating a division of the plurality of users into a plurality of cohorts and a function of at least one attribute to be evaluated for each cohort of the plurality of cohorts.

According to a fourth aspect of the present invention, there is provided a data processing method of evaluating a query on a data structure. The data structure comprises an activity table comprising plurality of tuples, each tuple having a plurality of attributes comprising a user identifier, a time stamp and an activity identifier. The activity table sorted according to a primary key comprising the user identifier, the time stamp and the activity identifier. The query comprises a birth selection operator which retrieves tuples for a user that satisfies a condition for inclusion in a cohort and an age selection operator which retrieves tuples which satisfy a specified condition. The method comprises: applying birth selection operator to the activity table and then applying the age selection operator to the activity table.

According to a yet further aspect, there is provided a non-transitory computer- readable medium. The computer-readable medium has stored thereon program instructions for causing at least one processor to perform operations of a method disclosed above.

BRIEF DESCRIPTION OF THE DRAWINGS

In the following, embodiments of the present invention will be described as non- limiting examples with reference to the accompanying drawings in which:

Figure 1 shows the corresponding SQL query Q s for an example cohort analytics task;

Figure 2 shows the correspondence between the three proposed cohort operators and the equivalent SQL statements; Figure 3 is a block diagram showing a technical architecture 100 of data processing system according to an embodiment of the present invention;

Figure 4 is a flowchart showing a method of analysing user data according to an embodiment of the present invention;

Figure 5 shows a query plan for the cohort query used in an embodiment of the present invention;

Figures 6a to 6d show the query performance for queries Q1 to Q4 respectively with different chunk sizes using an embodiment of the present invention;

Figure 7 shows the storage size requirements for different chunk sizes using an embodiment of the present invention; Figure 8 is a graph of processing times of queries in an embodiment of the present invention showing the effect of birth selection;

Figure 9 is a graph of processing times of queries in an embodiment of the present invention showing the effect of age selection; Figure 10 shows a comparison of time for generating materialized view for COHANA (which is an embodiment of the present invention) and with alternative methods MonetDB and Postgres (PG); and

Figures 11a to 11 d are graphs showing a comparison of the performance of different evaluation schemes.

DETAILED DESCRIPTION

1. Cohort Analytics

Table 1 below provides an example, which is used in the following description. In the example of a mobile game with samples of player activities listed in Table 1 players tend to buy more weapons in their initial game sessions than they do in later game sessions - this is the effect of aging. On the other hand, social changes may also affect the players' shopping behavior, e.g., with new weapons being introduced by iterative game development, players may start to spend again in order to acquire these weapons.

Table 1 shows some samples of a real dataset containing user activities collected from a mobile game. Each tuple in this table represents a user action and its associated information. For example, tuple t1 represents that player 001 launched the game on 2013/05/19 in Australia in a dwarf role.

Table 1 : Player activity samples of a mobile game

With cohort analytics, we can study the trend of human behavior in three steps. First, users are divided into different cohorts (each cohort is therefore a group of users) based on the time that users first perform an action. This step is of vital importance in cohort analysis since its purpose is to isolate the effect of social changes. Social scientists think that people who are born in the same period will exhibit similar behavioral patterns. In our mobile game example, suppose we cohort players based on the week when they first launched the game. Then, player 001 is assigned to 2013-05-19 launch cohort. In the second step, the activity tuples of interest are also partitioned accordingly so that tuples of a user are assigned to the same cohort as the user. Suppose we are interested in players' shop actions. Then, tuples t2-t4 are assigned to 2013-05-19 launch cohort to which player 001 belongs. Finally, in the third step, to capture the aging effect, the tuples of each cohort are further split into smaller sub-partitions based on age (time). The desired aggregate is then applied on each such sub-partition. In our game example, we organized the shopping tuples into partitions where each partition corresponds to a week's duration (i.e., age). In other words, for each shopping tuple in a cohort, we calculate the number of weeks (i.e., age) that have passed since the time the player first launched the game, and assign the tuple to the respective partition. Finally, we report average gold each cohort spent at different ages.

An obvious solution to obtain insights from such activity data is to apply traditional SQL GROUP BY operators. For example, if we want to look at the players' shopping trend in terms of the gold (the virtual currency) they spent, we may run the following SQL query Q s . SELECT week, Avg(goid) as avgSpent

FROM GameActions

WHERE action = "shop"

GROUP BY Week(time) as week

Executing this query against a sample dataset (of which Table 1 shows some records) results in Table 2, where each tuple represents the average gold that users spent each week. The results seem to suggest that there was a slight drop in shopping, and then a partial recovery. However, it is hard to draw meaningful insights.

Table 2: Results of Q s

Table 3 presents results of the cohort analysis of the shopping trend in our game example.

Table 3: co hort report for she ing trend

Each row of this table represents the shopping trend of a cohort, and each column captures the aging effect of the average expenditures of that cohort since its "birth" with respect to the time the game is launched for the first time. By looking at each row horizontally, we can see the aging effect, i.e., players spent more gold on buying weapons on their initial game sessions than their later game sessions. On the other hand, by comparing different rows (i.e., reading rows vertically), we can observe that the drop-off trend becomes better. This suggests that the iterative game development indeed improves the players' gaming experience as they seemed to be spending more gold on buying weapons to win battles, an insight which cannot be drawn from OLAP style results in Table 2. Moreover, the analysis result can be used as a training dataset for other analytical techniques. For example, we can regard the age as an independent variable and the average gold players spent as a dependent variable. Thus, for each cohort, we can train a regression model to predict the average amount of gold that cohort will spend in subsequent weeks. Such information can be further utilized by a recommendation engine for recommending weapons of similar price to that cohort. The ability to pipeline analytical results to downstream analytical algorithms makes cohort analysis an interesting and powerful tool for transforming raw activity data from Table 1 into actionable insights such as trend analysis and/or recommendation. While the classic cohort analysis we presented so far has been very useful in many analytic applications such as retention analysis, we hope to further widen its application. One limitation of classical cohort analysis is that time is the only dimension which is used for identifying cohorts. This is natural in social science research where birth time is considered to be the key attribute that determines people's behavior. However, in many other applications, there are other factors that determine the similarity of people's behavior.

Referring to our running example (from Table 1 ), suppose we choose launch as the birth action, and we are interested to perform a cohort analysis on those tuples where time > 2013/05/22:0000. Now, the resultant subset of tuples is {t5,t8}. However, we no longer can perform any cohort analysis as the birth activity tuple t , i.e., the activity tuple representing the first launch action of player 001 , has been removed. 2. A Non-intrusive Approach to Cohort Analytics

A least intrusive approach to supporting cohort analytics is to use an existing relational DBMS and express the cohort analysis task as a SQL query. We illustrate such an approach using the following cohort analysis task:

Example 4. Given the launch birth action and the activity table as shown in Table 1 (denoted as D), for players who play the dwarf role at their birth time, cohort those players based on their birth countries and report the total gold that country launch cohorts spent since they were born.

Figure 1 shows the corresponding SQL query Q s for this task. To save space, we use p, a, t, c abbreviations respectively to denote the player, action, time, and country attribute in Table 1. The Q s employs four sub-queries (i.e., Figure 1(a) - Figure 1(d)) and one outer query (i.e., Figure 1(e)) to produce the results. Overall, this SQL approach performs poorly for three reasons:

• The SQL statement Q s is verbose, and its complexity renders it prone to mistakes.

. The SQL statement Q s requires many joins to perform the analysis task. As we shall see in our experimental study, such a query processing scheme can be up to 5 orders of magnitude slower than our proposed solution. . It requires manual tuning. For example, one may notice that we can push the selection condition (i.e., birthRole = "dwarf") from the outer query (Figure 1(e)) to the inner sub-query (Figure 1(c)) to reduce the size of intermediate tables. Ideally, such an optimization can be performed by an intelligent optimizer. However, our evaluation shows that few database systems can perform such an optimization.

To speed up the processing of the analysis task, we can adopt a materialized view (MV) approach that stores some intermediate results. For example, we can materialize the intermediate table cohortT produced by the sub-query in Q s (Figure 1(c)) as follows.

CREATE VIEW MATERIALIZED cohorts as cohortT

With cohorts, we can express the query Q s in a simpler SQL statement consisting of a single sub-query (Figure 1(d)) and an outer query (Figure 1(e)). The performance of the resulting SQL expression is also improved since it only involves a single join. However, the materialized view approach also suffers from a number of issues.

• The cost of generating the MV is still high since it involves two joins (Figure 1 (b) and 1(c)).

. The storage space for the MV is huge if the approach is used as a general cohort query processing strategy. Figure 1 (c) only includes a single calculated birth attribute birthRole as it is the only attribute appearing in the birth selection condition (i.e., the condition of playing as the dwarf role at birth time) of the analysis task. However, if other calculated birth attributes are also involved in the birth selection condition, we need to include those attributes in the MV as well. In the extreme case, every possible birth attribute shall be included in the MV, doubling the storage space as compared to the original activity table.

. The MV only answers cohort queries introduced by launch birth action. If another birth action (e.g., shop) is used, one more MV is required. Obviously, this per birth action per MV approach does not scale even for a small number of birth actions due to the cost of MV construction and maintenance.

. The query performance is still not optimal. By the definition of the analysis task, if a player did not play as dwarf role when that player was bom, we should exclude all activity tuples of that player in the result set. Ideally, If the birth activity tuple indicates that the player is not qualified, we can safely skip all activity tuples of that player without further checking. However, as shown in Figure 1(e), the MV approach needs to, unnecessarily, check each activity tuple of a player to perform the filtering operation (i.e., comparing the value in birthRole attribute against dwarf). Building an index on the birthRole attribute cannot improve the situation much since index look up will introduce too many random seeks on large activity tables. 3. Cohort Analysis Foundations

In the present disclosure, we seek to extend an existing relational database system to support cohort analytics. This section presents the data model, which includes a central new concept of an activity table, and the proposed new cohort operators.

We use the term cohort to refer to a number of individuals who have some common characteristic in performing a particular action for the first time; we use this particular action and the attribute values of the common characteristics to identify the resulting cohort. For example, a group of users who first login (the particular action) in 2015 January (the common characteristic) is called the 2015 January login cohort. Similarly customers who make their first purchase in USA form a USA purchase cohort. Broadly speaking, cohort analysis is a data exploration technique that examines longitudinal behavioral trends of different cohorts since they were born. 3.1. Data Model

We represent a collection of activity data as an instance of an activity relation, a special relation where each tuple represents the information associated with a single user activity. We will also call an activity relation an activity table. In this disclosure, the two terms, i.e., activity relation and activity table are used interchangeably.

An activity table D is a relation with attributes A u , A t , A e , A†,...,A n where n≥ 1. A u is a string uniquely identifying a user; A e is also a string, representing an action chosen from a pre-defined collection of actions, and A t records the time at which A u performed A e . Every other attribute in D is a standard relational attribute. Furthermore, an activity table has a primary key constraint on (A u , A t , A e ). That is, each user / can only perform a specific action e once at each time instant. As exemplified in Table 1 , the first three columns correspond to the user (A u ), timestamp (A t ) and action (A e ) attribute, respectively. Role and Country are dimension attributes, which respectively specify the role and the country of player A u when performing A e at A t . Following the two dimension attributes is gold, a measure attribute representing the virtual currency that player A u spent for this action. We shall continue to use Table 1 as our running example for describing each concept in cohort analysis.

3.2. Basic Concepts of Cohort Analysis

We present three core concepts of cohort analysis: birth action, birth time and age. Given an action the birth time of user / is the first time that / performed

e or -1 if / never performed e, as shown in Definition 1. An action e is called a birth action if e is used to define the birth time of users.

Definition 1. Given an activity table D, and a birth action e e Dom(>4e), a time value is called the birth time of user / if and only if

where π and σ are the standard projection and selection operators.

Definition 2. Given an activity table D, and a birth action a tuple D is called the birth activity tuple of user / ' if and only if

Since (A u ,A t ,Ae) is the primary key of D, we conclude that for each user /, there is only one birth activity tuple of / ' in D for any birth action e that / ' performed.

Definition 3. Given the birth time a numerical value g is called the age of user / ' in tuple d≡ D, if and only if d

The concept of age is designed for specifying the time point to aggregate the behavioral metric of a cohort. In cohort analysis, we calculate the metric only at positive ages, and an activity tuple with a positive age is called an age activity tuple. Furthermore, in practical applications, the age g is normalized by a certain time unit such as a day, week or month. Without loss of generality, we assume that the granularity of g is a day.

Consider the example activity relation in Table 1. Suppose we use the action launch as the birth action. Then, the activity tuple is the birth activity tuple of player 001 , and the birth time is 2013/05/19:1000. The activity tuple t 2 is an age tuple of player 001 produced at age 1.

3.3. Cohort Operators

We now present operations on a single activity table. In particular, we propose two new operators to retrieve a subset of activity tuples for cohort analysis. We also propose a cohort aggregation operator for aggregating activity tuples for each (cohort, age) combination. As we shall see, these three operators enable us to express a cohort analysis task in a very concise and elegant way that is easy to understand. 3.3.1. The Operator

The birth selection operator is used to retrieve activity tuples of qualified users whose birth activity tuples satisfy a specific condition C. Definition 4. Given an activity table D, the birth selection operator is defined as where C is a propositional formula and e is a birth action.

Consider the activity relation D in Table 1. Suppose we want to derive an activity table from D which retains all activity tuples of users who were born from performing launch action in Australia. This can be achieved with the following expression, which returns - 3.3.2. The Operator

The age selection operator is used to generate an activity table from D which retains all birth activity tuples in D but only a subset of age activity tuples which satisfy a condition C.

Definition 5. Given an activity table D, the age selection operator is defined as

where C is a propositional formula and e is a birth action. For example, suppose shop is the birth action, and we want to derive an activity table which retains all birth activity tuples in Table 1 but only includes age activity tuples which indicate users performing in-game shopping in all countries but China. The following expression can be used to obtain the desired activity table.

The result set of the above selection operation is where t 2 is the birth activity tuple of player 001 , t 3 and f 4 are the qualified age activity tuples of player 001. The activity tuples t 7 and is are the birth activity tuple and the qualified age activity tuple of player 002. A common requirement in specifying operation is that we often want to reference the attribute values of birth activity tuples in C. For example, given the birth action shop, we may want to select age activity tuples whose users perform in-game shopping at the same location as their country of birth. We introduce a Birth() function for this purpose. Given an attribute A, for any activity tuple d, the Birth( A) returns the value of attribute A in d[A u ]'s birth activity tuple:

where / = d[A u ] and e is the birth action.

In our running example, suppose shop is the birth action, and we want to obtain an activity table which retains all birth activity tuples but only include age activity tuples which indicate that players performed shopping in the same role as they were born. The following expression is used to retrieve the desired results.

The result set of the above operation is where t 2 and t 7 are the birth activity

tuples of player 001 and player 002, respectively, and t 3 and t8 are the qualified age activity tuples.

3.3.3. The Operator

We now present the cohort aggregation operator This operator produces cohort aggregates in two steps:. 1 ) cohort users and 2) aggregate activity tuples.

In the first step, given an activity table D with its attribute set A and a birth action e, we pick up a cohort attribute set such that and assign each user / to a cohort c specified by Essentially, we divide users into cohorts based on the projection of users' birth activity tuples onto a specified cohort attribute set.

In our running example, suppose launch is the birth action and the cohort attribute set is L={country}, player 001 in Table 1 is assigned to the Australia launch cohort, player 002 is assigned to the USA launch cohort and player 003 is assigned to the China launch cohort.

Definition 6. Given an activity table D, the cohort aggregation operator is defined as

where L is a cohort attributes set, e is a birth action and †A is a standard aggregation function with respect to the attribute A. In the second step, for each possible combination of cohort and age, we select the corresponding age activity tuples of the belonging users and perform the aggregation function against them.

In summary, the cohort aggregation operator takes an activity table D as input and produces a normal relational table R as output. Each row in the output table R consists of four parts (dug,s,m), where di is the projection of birth activity tuples onto the cohort attributes set L and identifies the cohort, g is the age, i.e., the time point that we report the aggregates, s is the size of the cohort, i.e., the number of users in the cohort specified by d L , and m is the aggregated measure produced by the aggregate function f A . Note that we only apply†A on age activity tuples with g > 0.

3.3.4. Properties of Cohort Operators

We note that the two selection operators, are commutative if they involve the same birth action.

Based on this property, we can, as we shall see in Section 4, push the birth selection operator down the query plan to optimize cohort query evaluation. 3.4. The Cohort Query

Given an activity table D and operators o a cohort query Q : D→ R can be expressed as a composition of those operators that takes D as input and produces a relation R as output with the constraint that the same birth action e is used for all cohort operators in Q. To specify a cohort query, we propose to use the following SQL-style SELECT statement.

In the above syntax, e is the birth action that is specified by the data analyst for the whole cohort query. The order of BIRTH FROM and AGE ACTIVITIES IN clauses is irrelevant, and the birth selection (i.e., ) and age selection (i.e., ) clauses are optional. We also introduce two keywords AGE and COHORTSIZE for data analysts to retrieve the calculated columns produced by in the SELECT list. Note that except for projection, we disallow other relational operators such as σ (i.e., SQL WHERE) and γ (i.e., SQL GROUP BY), and binary operators like intersection and join, in a basic cohort query.

With the newly developed cohort operators, the cohort analysis task presented in Example 4 can be expressed by the following query:

3.5. Extensions

Our cohort query proposal can be extended in many directions to enable even more complex and deep analysis. First, cohort queries can be mixed with SQL queries in a single query. For example, one may want to use a SQL query to retrieve specific cohort trends produced by a cohort query for further analysis. This mixed querying requirement can be achieved by applying the standard SQL WITH clause to encapsulate a cohort query as a sub-query that can be processed by an outer SQL query. The following example demonstrates how to use a mixed query to retrieve specific cohort spent trends reported by Qi for further analysis.

Another extension is to introduce binary cohort operators (e.g., join, intersection etc.) for analyzing multiple activity tables.

3.6. Mapping Cohort Operations to SQL Statements

Before leaving this section, we shall demonstrate that given a materialized view (MV) built for a specific birth action, the proposed cohort operators can be implemented by SQL sub-queries. This enables us to pose cohort queries composed of newly developed operators in the context of a non-intrusive mechanism.

As shown in Section 2, the MV approach stores each activity tuple of user / ' along with /"s birth attributes. Thus, to implement the birth selection operator, one can use a SQL SELECT statement with a WHERE clause specifying the birth selection condition on the materialized birth attributes. Similarly, the age selection operator can be simulated by a SQL SELECT statement with a WHERE clause specifying the age selection condition along with an additional predicate to include birth activity tuples. The cohort aggregation operator can be implemented by applying a SQL GROUP BY aggregation operation on the joined results between the cohortSize table and the qualified age activity tuples.

As an example, Figure 2 demonstrates for Q1 of Example 1 the correspondence between the three proposed cohort operators and the equivalent SQL statements posed on the MV built for the launch birth action. As in Figure 1 , the player, action and time attributes are respectively abbreviated to p, a, and t. be, br, bt and age are four attributes additionally materialized along with the original activity table. The first three attributes, be, br and bt, respectively represent the birth attributes for country, role and time. It should be noted that the SQL statements of Figure 2 are separated out for ease of exposition: one can optimize them by combining Figure 2(a) and 2(b) into a single SQL statement, as we do in all experiments. 4. Implementation of a Cohort Query Engine

Figure 3 is a block diagram showing a technical architecture 100 of data processing system according to an embodiment of the present invention. Typically, the methods of storing activity tables and evaluating cohort queries according to embodiments of the present invention are implemented on a computer or a number of computers each having a data-processing unit. The block diagram as shown in Figure 1 illustrates a technical architecture 100 of a computer which is suitable for implementing one or more embodiments herein. The technical architecture 100 includes a processor 122 (which may be referred to as a central processor unit or CPU) that is in communication with memory devices including secondary storage 124 (such as disk drives), read only memory (ROM) 126, random access memory (RAM) 128. The processor 122 may be implemented as one or more CPU chips. The technical architecture 120 may further comprise input/output (I/O) devices 130, and network connectivity devices 132. The technical architecture 100 further comprises activity table storage 140 which may be implemented as a hard disk drive or other type of storage device.

The secondary storage 124 is typically comprised of one or more disk drives or tape drives and is used for non-volatile storage of data and as an over-flow data storage device if RAM 128 is not large enough to hold all working data. Secondary storage 124 may be used to store programs which are loaded into RAM 128 when such programs are selected for execution. In this embodiment, the secondary storage 124 has a storage management module 124a, and a query processing module 124b comprising non-transitory instructions operative by the processor 122 to perform various operations of the method of the present disclosure. As depicted in Figure 3, the modules 124a-124b are distinct modules which perform respective functions implemented by the data processing system. It will be appreciated that the boundaries between these modules are exemplary only, and that alternative embodiments may merge modules or impose an alternative decomposition of functionality of modules. For example, the modules discussed herein may be decomposed into sub-modules to be executed as multiple computer processes, and, optionally, on multiple computers. Moreover, alternative embodiments may combine multiple instances of a particular module or sub-module. It will also be appreciated that, while a software implementation of the modules 124a-124b is described herein, these may alternatively be implemented as one or more hardware modules (such as field-programmable gate array(s) or application-specific integrated circuit(s)) comprising circuitry which implements equivalent functionality to that implemented in software. The ROM 126 is used to store instructions and perhaps data which are read during program execution. The secondary storage 124, the RAM 128, and/or the ROM 126 may be referred to in some contexts as computer readable storage media and/or non-transitory computer readable media. During operation, the functionality of storage management module 124a controls the writing and modification of user activity data to the activity table storage 140.

The I/O devices may include printers, video monitors, liquid crystal displays (LCDs), plasma displays, touch screen displays, keyboards, keypads, switches, dials, mice, track balls, voice recognizers, card readers, paper tape readers, or other well-known input devices.

The network connectivity devices 132 may take the form of modems, modem banks, Ethernet cards, universal serial bus (USB) interface cards, serial interfaces, token ring cards, fiber distributed data interface (FDDI) cards, wireless local area network (WLAN) cards, radio transceiver cards that promote radio communications using protocols such as code division multiple access (CDMA), global system for mobile communications (GSM), long-term evolution (LTE), worldwide interoperability for microwave access (WiMAX), near field communications (NFC), radio frequency identity (RFID), and/or other air interface protocol radio transceiver cards, and other well-known network devices. These network connectivity devices 132 may enable the processor 122 to communicate with the Internet or one or more intranets. With such a network connection, it is contemplated that the processor 122 might receive information from the network, or might output information to the network in the course of performing the method operations described herein. Such information, which is often represented as a sequence of instructions to be executed using processor 122, may be received from and outputted to the network, for example, in the form of a computer data signal embodied in a carrier wave.

The processor 122 executes instructions, codes, computer programs, scripts which it accesses from hard disk, floppy disk, optical disk (these various disk based systems may all be considered secondary storage 124), flash drive, ROM 126, RAM 228, or the network connectivity devices 132. While only one processor 122 is shown, multiple processors may be present. Thus, while instructions may be discussed as executed by a processor, the instructions may be executed simultaneously, serially, or otherwise executed by one or multiple processors.

It is understood that by programming and/or loading executable instructions onto the technical architecture 100, at least one of the CPU 122, the RAM 128, and the ROM 126 are changed, transforming the technical architecture 100 in part into a specific purpose machine or apparatus having the novel functionality taught by the present disclosure. It is fundamental to the electrical engineering and software engineering arts that functionality that can be implemented by loading executable software into a computer can be converted to a hardware implementation by well-known design rules.

Although the technical architecture 100 is described with reference to a computer, it should be appreciated that the technical architecture may be formed by two or more computers in communication with each other that collaborate to perform a task. For example, but not by way of limitation, an application may be partitioned in such a way as to permit concurrent and/or parallel processing of the instructions of the application. Alternatively, the data processed by the application may be partitioned in such a way as to permit concurrent and/or parallel processing of different portions of a data set by the two or more computers. In an embodiment, virtualization software may be employed by the technical architecture 100 to provide the functionality of a number of servers that is not directly bound to the number of computers in the technical architecture 100. In an embodiment, the functionality disclosed above may be provided by executing the application and/or applications in a cloud computing environment. Cloud computing may comprise providing computing services via a network connection using dynamically scalable computing resources. A cloud computing environment may be established by an enterprise and/or may be hired on an as-needed basis from a third party provider.

Figure 4 is a flowchart showing a method 400 of analysing user data according to an embodiment of the present invention. In step 402 the storage management module 124a stores user data in the activity table storage 140 as an activity table. Step 402 may comprise sorting and compressing the user data as described in more detail below. The activity table is stored as a fine-tuned hierarchical storage format for persisting activity tables. In step 404, the query processing module 224b evaluates a cohort analysis query on the activity table stored in the activity table storage 140. Step 404 may comprise scanning the activity table with a modified scan operator, implementing cohort operations and optimizing queries as described in more detail below. The modified table scan operator is capable of skipping age activity tuples of unqualified users. The method may involve a native efficient implementation of cohort operators; The queries may be optimised by a query planner capable of utilizing the cohort operator property (i.e., Equation (1 )) for optimization. We have implemented the proposed techniques in a columnar based query engine, COHAHA, for performance study.

4.1. The Activity Table Storage Format

We store an activity table D in the sorted order of its primary key (A u ; A t ; A e ). This storage layout has two desirable properties: 1 ) activity tuples of the same user are clustered together; we refer to this as the clustering property; 2) The activity tuples of each user are stored in a chronological order; this is called the time ordering property.

With these two properties, we can efficiently find the birth activity tuple of any user for any birth action in a single sequential scan. Suppose the activity tuples of user / ' is stored between dj and d k . To find the birth activity tuple of / for any birth action e, we just iterate over each tuple between d j and d k and return the first tuple db satisfying d b [A e ] = e. We employ a chunking scheme and various compression techniques to speed up cohort query processing. We first horizontally partition the activity table into multiple data chunks such that the activity tuples of each user are included in exactly one chunk. Then, in each chunk, the activity tuples are stored column by column. For each column in a data chunk, we choose an appropriate compression scheme for storing values based on the column type.

For the user column A u , we choose Run-Length-Encoding (RLE) scheme. The values in A u \s stored as a sequence of triples (u,f,n), where u is the user in A u , f is the position of the first appearance of u in the column, and n is the number of appearances of u in the column. We shall see in Section 4.3, a modified table scan operator can directly process these triples and efficiently skip to the activity tuples of the next user if the birth activity tuple of the current user is not qualified with respect to the birth selection condition.

For the action column A e and other string columns, we employ a two level compression scheme presented in A. Hall, O. Bachmann, R. Bussow, S. Ganceanu, and M. Nunkesser. Processing a trillion cells per mouse click. PVLDB, 5(11 ):1436- 1446, 2012. for storing the values. For each such column A, we first build and persist a global dictionary which consists of the sorted unique values of A. Each unique value of A is then assigned a global-id, which is the position of that value in the global dictionary. For each data chunk, the sorted global-ids of the unique values of A in that chunk form a chunk dictionary. Given the chunk dictionary, each value of A in that chunk can be represented as a chunk-id, which is the position of the global-id of that value in the chunk dictionary. The chunk-ids are then persisted immediately after the chunk dictionary in the same order as the respective values appearing in A. This two level encoding scheme enables efficient pruning of chunks where no users perform the birth action. For a given birth action e, we first perform a binary search on the global index to find its global-id g,. Then, for each data chunk, we perform a binary search for a:, in the chunk dictionary. If g, is not found, we can safely skip the current data chunk since no users in the data chunk perform e.

For A t and other integer columns, we employ a two-level delta encoding scheme which is similar to the one designed for string columns. For each column A of this type, we first store the MIN and MAX value of A for the whole activity table as the global range. Then, for each data chunk, the MIN and MAX values are extracted as the chunk range from the segment of A in that chunk and persisted as well. Each value of the column segment is then finally stored as the delta (difference) between it and the chunk MIN value. Similar to the encoding scheme for string columns, this two-level delta encoding scheme also enables the efficient pruning of chunks where no activity tuples fall in the range specified in the birth selection or age selection operation. With the above two encoding schemes, the final representation of string columns and integer columns are arrays of integers within a small range. We therefore further employ integer compression techniques to reduce the storage space. For each integer array, we compute the minimum number of bits, denoted by n, to represent the maximum value in the array, and then sequentially pack as many values as possible into a computer word such that each value only occupies n bits. Finally, we persist the resulting computer words to the storage device. This fixed-width encoding scheme is by no means the most space-saving scheme. However, it enables the compressed values to be randomly read without decompression. For each position in the original integer array, one can easily locate the corresponding bits in the compressed computer words and extract the value from these bits. This feature is of great importance for efficient cohort query processing.

It should be noted that the proposed hierarchical storage format, although highly customized for cohort query processing, is also applicable to database tables and OLAP cubes with the restriction on the order of (A u ,A t ,Ae) removed. Consequently, one can also support conventional database and cube operators on top of this storage format.

4.2. Cohort Query Evaluation

This section presents how to evaluate a cohort query over the activity table compressed with the techniques proposed in Section 4.1. We shall use the cohort query Qi as our running example. The overall query processing strategy is as follows. We first generate a logical query plan, and then optimize it by pushing down the birth selections along the plan. Next, the optimized query plan is executed against each data chunk. Finally, all partial results produced by the third step are merged together to produce the final result. The final merging step is trivial and we shall only present the first three steps.

The cohort query plan we introduced in this paper is a tree of physical operators consisting of four operators: TableScan, birth selection age selection and cohort aggregation Like other columnar databases, the projection operation is

implemented in a pre-processing step: we collect all required columns at query preparation stage and then pass those columns to the TableScan operator which retrieves the values for each column.

In the query plan, the root and the only leaf node are the aggregation operator,

and the TableScan operator, respectively, and between them is a sequence of birth selection operators and age selection operators.

Then, we push down the birth selection operators along the query plan such that they are always below the age selection operators. This push-down optimization is always feasible, since according to equation (1 ), we can arbitrarily swap the order of operators in any sequence consisting of these two operators.

Figure 5 shows the query plan for the cohort query of Qi. As shown in Figure 5, the root node of the query plan 500 is the cohort aggreration operator 502. The leaf node of the query plan is the TableScan operator 504. As shown in Figure 5, in the query plan the birth selection operator 508 and the age selection operator 506 may be swapped. We employ this push-down optimization since, as we shall see in Section 4.3, a specially designed TableScan implementation can efficiently skip age activity tuples without further processing for users whose birth activity tuples do not satisfy the birth selection condition. Therefore, the cost of evaluating birth selection operators before age selection operators is always less than the cost incurred from the reverse evaluation sequence in terms of the number of activity tuples processed. After pushing down birth selections, the resulting query plan will be executed against each data chunk. Before the execution, we apply an additional filtering step by utilizing the A e column's two-level compression scheme to skip data chunks where no users perform the birth action e. The concrete processing strategy is presented in Section 4.1. In practice, we find that this intermediate filtering step is particularly useful if the birth action is highly selective (i.e., only a few users performed that birth action). We will present the implementation of the physical operators in the rest of this section.

4.3. The TableScan Operator

We extend the standard TableScan operator of columnar databases for efficient cohort query processing. The modified TableScan operator performs scanning operation over the compressed activity table that we proposed in Section 4.1. We mainly add two additional functions to a standard columnar database TableScan operator: GetNextUser() and SkipCurUser(). The GetNextUser() function returns the activity tuple block of the next user; the SkipCurUser() skips the activity tuples of the current user.

The modified TableScan operator is implemented as follows. For each data chunk, in the query initialization stage, the TableScan operator collects all (compressed) chunk columns referenced in the query and maintains for each chunk column a file pointer which is initialized to point to the beginning of that chunk column. The implementation of GetNext() function is identical to the standard TableScan operator of a columnar database.

The GetNextUser() is implemented by first retrieving the next triple (u,f,n) of Au column and then advancing the file pointer of each other referenced column to the beginning of the column segment corresponding to user u. The SkipCurUser() is implemented in a similar way. When it is called, the SkipCurUser() function first calculates the number of remaining activity tuples of the current user, and then advances the file pointers of all columns by the same number.

4.4. Cohort Algorithms

This section develops algorithms for the implementation of cohort operators proposed storage format for activity tables.

Algorithm 1 presents the implementation of the birth selection operator . It employs an auxiliary function GetBirthTuple(d,e) (line 1 - line 5) for finding the birth activity tuple of user / ' = d[A u ], given that d is the first activity tuple of / ' in the data chunk and e is the birth action. The GetBirthTuple() function finds Fs birth activity tuple by iterating over each next tuple d s D and checks whether d belongs to / ' and whether d[A e ] is the birth action e (line 3). The first activity tuple d matching the condition is the required birth activity tuple.

To evaluate Algorithm 1 first opens the input data chunk D and initializes the

global variable u c (line 7 - line 8) which points to the user currently being processed. In the GetNext() function, we return the next activity tuple d of u c if u c \s qualified with respect to the birth selection condition (line 11 ). If u c 's activity tuples are exhausted, we retrieve the next user block by calling the GetNextUser() function of the TableScan operator (line 13). Then, we find the birth activity tuple of the new user and check if it satisfies the birth selection condition (line 16 - line 17). If the new user is qualified, its birth activity tuple will be returned; otherwise all the activity tuples of this user will be skipped using the SkipCurUser() function so that its next user can be ready for processing. Therefore, one can continuously call the GetNext() function to retrieve the activity tuples of users that are qualified with respect to the birth selection condition.

The implementation of is much simpler than We also employ the user block processing strategy. For each user block, we first locate the birth activity tuple and then return the birth activity tuple and qualified age activity tuples.

Algorithm 2 presents the implementation of operator. The main logic is implemented in the Open() function. The function first initializes two hash tables and 9 which respectively store the cohort size and per data chunk aggregation result for each (cohort, age) partition (line 2 - line 6). Then, the Open() function iterates over each user block and updates /-ffor each qualified user (determined by )

and f for all qualified age activity tuples (determined by oc, e g ) (line 10 - line 14). To speed up the query processing, we further follow the suggestions presented in J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total. In ICDE, pages 152-159, 1996 and A. Hall, O. Bachmann, R. Bussow, S. Ganceanu, and M. Nunkesser. Processing a trillion cells per mouse click. PVLDB, 5(11):1436-1446, 2012 and use array based hash tables for aggregation. In practice, we find that the use of array-based hash tables in the inner loop of cohort aggregation significantly improves the performance since modern CPUs can highly pipeline array operations.

4.5. Optimizing for User Retention Analysis

One popular application of cohort analysis is to show the trend of user retention. These cohort queries involve counting distinct number of users for each (cohort, age) combination. This computation is very costly in terms of memory for fields with a large cardinality, such as A u . Fortunately, our proposed storage format has a nice property that the activity tuples of any user are included in only one chunk. We therefore implement a UserCount() aggregation function for the efficient counting of distinct users by performing counting against each chunk and returning the sum of the obtained numbers as the final result. 4.6. Analysis of Query Performance

Given there are n users in the activity table D, each user produces m activity tuples, it can be clearly seen that, to evaluate a cohort query composed of and operators, the query evaluation scheme we presented so far only needs to

process 0(l*m) activity tuples in a single pass, where / is the number of qualified users with respect to the birth selection condition. Therefore, the query processing time grows linearly with /, and therefore approaches optimal performance.

5. Performance Study This section presents a performance study to evaluate the effectiveness of our proposed COHANA engine. We mainly perform two sets of experiments. First, we study the effectiveness of COHANA, and its optimization techniques. In the second set of experiments, we compare the performance of different query evaluation schemes.

5.1 Experimental Environment All experiments are run on a high-end workstation. The workstation is equipped with a quad-core Intel Xeon E31220 v3 3.10GHz processor and 8GB of memory. The disk speed reported by hdparm is 14.8GB/S for cached reads and 138MB/s for buffered reads. The dataset we used is produced by a real mobile game application. The dataset consists of 30M activity tuples contributed by 57,077 users worldwide from 2013-5- 19 to 2013-06-26, and occupies a disk space of 3.6GB in its raw csv format. In addition to the required user, action and action time attributes, we also include the country, city and role as dimensions and session length and gold as measures. Users in the game played 16 actions in total, and we choose the launch, shop and achievement actions as the birth actions. In addition, we manually scale the dataset and study the performance of three cohort query evaluation schemes on different dataset size. Given a scale factor X, we produce a dataset consisting of X times users. Each user has the same activity tuples as the original dataset except with a different user attribute.

We implement the SQL based approach and the materialized view (MV) approach on top of two state-of-the-art relational databases: Postgres and MonetDB. For the SQL based approach, we manually translate the cohort query into SQL queries as exemplified in Figure 1. For the MV approach, we first materialize the view beforehand using CREATE TABLE AS command. Specifically, for each birth action, we materialize the age and a birth attribute set of time, role, country and city attribute in its materialized view. This materialization scheme adds 15 additional columns to the original table by performing six joins in total. Given the materialized view, we then follow the method mentioned in Section 3.6 to translate the cohort query into standard SQL queries as well. To speed up the two approaches, we further build a cluster index on the primary key and indices on birth attributes, and allow the two databases to use all the free memory for buffering during query processing. For COHANA, we choose a chunk size of 256K, that is, each chunk contains 256K user activity tuples. We also allow slightly more tuples to be included in a chunk in order to ensure all activity tuples of each user are included in a single chunk.

5.2 Benchmark Queries

We design four queries (described with COHANA's cohort query syntax) for the benchmark by incrementally adding the cohort operators we proposed in this discosure. The first query Q1 evaluates a single cohort aggregation operator. The second query Q2 evaluates a combination of birth selection and cohort aggregation. The third query Q3 evaluates a combination of age selection and cohort aggregation. The fourth query Q4 evaluates a combination of all three cohort operators. For each query, we report the average execution time of five runs for each system.

Q1 : For each country launch cohort, report the number of retained users who did at least one action since they first launched the game.

Q2: For each country launch cohort born in a specific date range, report the number of retained users who did at least one action since they first launched the game.

Q3: For each country shop cohort, report the average gold they spent in shopping since they made the first shop in the game.

Q4: For each country shop cohort, report the average gold they spent in shopping in their birth country where they were born with respect to the dwarf role in a given date range.

In order to investigate the impact of the birth selection operator and the age selection operator on the query performance of COHANA, we further design two variants of

Q1 and Q3 by adding to them a birth selection condition (resulting in Q5 and Q6) or an age selection condition (resulting in Q7 and Q8). The details of Q5-Q8 are shown below.

Q5: For each country launch cohort, report the number of retained users who did at least one action during the date range [d1 ; d2] since they first launched the game.

^

Q6: For each country shop cohort, report the average gold they spent in shopping during the date range [d1 ; d2] since they made their first shop in the game.

Q7: For each country launch cohort whose age is less than g, report the number of retained users who did at least one action since they first launched the game.

Q8: For each country shop cohort whose age is less than g, report the average gold they spent in shopping since they made their first shop in the game.

5.3. Performance Study of COHANA

In this section we report on a set of experiments in which we vary chunk size and birth/age selection condition and investigate how COHANA adapts to such variation.

5.3.1. Effect of Chunk Size

Figures 6 and 7 present the storage space COHANA requires for the activity table compressed with different chunk sizes, and the corresponding query performance.

Figures 6a to 6d show the query performance for queries Q1 to Q4 respectively with different chunk sizes.

Figure 7 shows the storage size requirements for different chunk sizes.

It is clearly seen from Figure 7 that increasing the chunk size also increases storage cost. This is because an increase in the size of a chunk will lead to more players included in that chunk. As a result, the number of distinct values in the columns of each chunk also increases, which in turn requires more bits for encoding values. We also observe that cohort queries can be processed slightly faster under a smaller chunk size than a larger one. This is expected as fewer bytes are read. However, for large datasets, a larger chunk size can be a better choice. For example, at scale 64, COHANA processes Q1 and Q3 most efficiently under 1 M chunk size. This is because the processing of Q1 and Q3 at scale 64 is dominated by disk accesses, whose granularity is normally a 4KB block. Compared with a large chunk size, a small one leads to more part of the neighboring columns to be simultaneously read when reading a compressed chunk column, and hence results in a longer disk read time and a lower memory efficiency due to the memory contention between the useful columns and their unused neighbors within the same chunk.

5.3.2. Effect of Birth Selection

In Section 4.6, we demonstrate that the running time of COHANA is bounded by O(n) where n is the total number of qualified users. This experiment studies the query performance of COHANA with respect to the birth selection selectivity. We run Q5 and Q6, which are respectively a variant of Q1 and Q3, by fixing d1 to be the earliest birth date, and incrementing d2 by one day each time. The dataset used in this experiment is at scale 1.

Figure 8 presents the processing times of Q5 and Q6 which are respectively normalized by that of Q1 and Q3. The cumulative distribution (CDF) of user births is also given in this figure. We do not differentiate the birth distributions between the birth actions of launch and shop, as the birth distributions with respect to both birth actions are similar. It can be clearly observed from this figure that the processing time of Q5 highly coincides with the birth distribution. We attribute this coincidence to the optimization of pushing down the birth selection operator and the refined birth selection algorithm which is capable of skipping unqualified users. The processing time of Q6, however, is not very sensitive to the birth distribution. This is because in Q6, users are born with respect to the shop action, and there is a cost in finding the birth activity tuple for each user. This cost is avoided in Q5 as the first activity tuple of each user is the birth activity tuple of this user (recall that the first action each user performed is launch).

5.3.3. Effect of Age Selection Figure 9 shows the effect of age selection. In this experiment, we run Q7 and Q8, another variant of Q1 and Q3, on the dataset of scale 1 by varying g from 1 day to 14 days to study the query performance of COHANA under different age selection conditions. Figure 9 presents the result of this experiment. As with the results presented in Figure 8, the processing times of Q7 and Q8 are also respectively normalized by that of Q1 and Q3. It can be seen from this figure that the processing times of Q7 and Q8 exhibit different trends.

5.4. Comparative Study

Figure 10 shows a comparison of time for generating materialized view for COHANA, MonetDB and Postgres (PG). As shown in Figure 10, at scale 64, MonetDB needs more than 60,000 seconds (16.7 hours) to generate the materialized view from the original activity table. This time cost is even more expensive in Postgres, which needs more than 100,000 seconds (27.8 hours) at scale 32. The result for Postgres at scale 64 is not available as Postgres is not able to generate the materialized view before using up all free disk space, which also implies a high storage cost during the generation of the materialized view. In a sharp contrast, COHANA only needs 1.25 hours to compress the activity table of scale 64.

Figure 11 shows for each scale (factor) the execution time that each system takes to execute the four queries. The results of the Postgres and the MonetDB databases are respectively shown in the lines labelled by "PG-S/M" and in those labelled by "MONET-S/M", where "S" and "M" respectively mean the SQL and the materialized view approaches. As expected, the SQL based approach is the slowest as it needs multiple joins for processing cohort queries. With the elimination of joins, the materialized view based approach can reduce the query processing time by an order of magnitude. This figure also shows the power of columnar storage in terms of cohort query processing. MonetDB, a state-of-the-art columnar database, can be up to two orders faster than Postgres.

Although the combination of a materialized view and columnar storage can address cohort queries reasonably well on small datasets; however, it is not able to handle large datasets. For example, it takes half an hour to process Q1 at scale 64. The proposed system, COHANA, is able to perform extremely well not only on small datasets, but also on large datasets. Moreover, for each query, COHANA is able to perform better than the MonetDB equipped with the materialized view at any scale. The performance gap between them is one to two orders of magnitude in most cases, and can be up to three orders of magnitude (Q4 at scale 32). We also observe that the two retention queries (Q1 and Q2) enjoy a larger performance gain than Q3 does and in part attribute it to the optimization Section 4.5 presents for user retention analysis. Finally, the generation of the materialized view is much more expensive than COHANA.

6. Conclusion

Cohort analysis is a powerful tool for finding unusual user behavioral trends in large activity tables. Embodiments of the present invention provide support for cohort analysis. Consequently, we have introduced an extended relation for modeling activity data and extended SQL with three new operators for composing cohort queries. We have developed a columnar based query engine, COHAHA, for efficient cohort query processing. Our experimental results showed that COHANA can achieve two orders faster query performance than simply running SQL queries over conventional database systems, demonstrating the possible benefit of extending a database system for cohort queries over implementing cohort queries on top of it.

Whilst the foregoing description has described exemplary embodiments, it will be understood by those skilled in the art that many variations of the embodiments can be made within the scope and spirit of the present invention.