Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
LARGE-SCALE PROCESSING OF DATA RECORDS WITH EFFICIENT RETRIEVAL
Document Type and Number:
WIPO Patent Application WO/2020/170187
Kind Code:
A1
Abstract:
A method and system are provided for processing of data records. The method includes processing raw data records including a first type having static values and a second type relating to transactional records with timestamps of events. The data records are filtered and transformed to standardized formats. The transformed data records are persisted into a first data store for static data relating to entities for retrieval based on static values, and a second data store for transactional data for retrieval based on timestamps. Different categories of transactional records are grouped according to the unique identifiers of entities. The categories are persisted to different tables including maintaining a table of stable size for efficient access of a category of transactional records that is most frequently accessed. The persisted data is used for retrieving features relating to an entity from multiple tables for processing to provide an output relating to the entity.

Inventors:
LELIS STELIOS (AE)
CHATZISTAMATIOU ANTONIOS (AE)
Application Number:
PCT/IB2020/051423
Publication Date:
August 27, 2020
Filing Date:
February 20, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
CHANNEL TECH FZE (AE)
International Classes:
G06F16/21; G06F16/25; G06F16/28; G06Q40/02
Foreign References:
US20170323280A12017-11-09
US9542710B12017-01-10
US20130124392A12013-05-16
US20180374119A12018-12-27
KR100777670B12007-11-19
Download PDF:
Claims:
CLAIMS

1 . A computer-implemented method for large-scale processing of data records, comprising: receiving raw data records including a first type of data records having static values and a second type of data records relating to transactional records with timestamps of events;

filtering and transforming the data records to standardized formats;

persisting the transformed data records into two different data stores, including a first data store for static data relating to entities for retrieval based on static values including a unique identifier of an entity, and a second data store for transactional data for retrieval based on timestamps of events;

grouping different categories of transactional records according to the unique identifiers of entities and persisting the categories to different tables including maintaining a table of stable size for efficient access of a category of transactional records that is most frequently accessed; and

retrieving features relating to an entity from multiple tables for processing to provide an output relating to the entity.

2. The method as claimed in claim 1 , wherein retrieving features relating to an entity includes feature generation with dimensions of behaviour of the entity at a given reference point in time, and includes:

retrieving features from different tables;

joining the features; and

transforming the features to the reference point in time.

3. The method as claimed in claim 1 , wherein the second data store is partitioned for time periods and/or by type of event.

4. The method as claimed in claim 1 , wherein the category of transactional records that is most frequently accessed is open transactions that are ongoing.

5. The method as claimed in claim 1 : processing the static data records first to obtain identifying information of the entities.

6. The method as claimed in claim 1 , wherein the raw data records relate to telecommunications and the entities are users, and wherein the first type of data records are user data records with static values of user attributes and the second type of data records are call detail records of events.

7. The method as claimed in claim 6, including generating a unique identifier for each user as a combination of a Mobile Station International Subscriber Directory Number (MSISDN) and a Subscriber Identity Module (SIM) activation date.

8. The method as claimed in claim 6, wherein filtering and transforming the data records to standardized formats include applying filtering rules to extract relevant records from transactional records including one or more of the group of: records relating to events including: calls, messages, monetary events, data usage events; lifecycle events; advance usage events; loan events; mobile wallet events.

9. The method as claimed in claim 6, including grouping different categories of transactional records according to the unique identifiers of an entity includes categories of open credit advances and closed credit advances, and a category of transactional records that is most frequently accessed are open credit advances available for subsequent analysis.

10. The method as claimed in claim 1 , wherein the output relating to an entity is a user profile for subsequent credit risk analysis.

1 1 . A system for large-scale processing of data records, the system including a memory for storing computer-readable program code and a processor for executing the computer-readable program code, the system comprising:

a data receiving component for receiving raw data records including a first type of data records having static values and a second type of data records relating to transactional records with timestamps of events;

a transforming component for transforming data records to standardized formats;

a filtering component for filtering relevant data records;

a data store persisting component for persisting the transformed data records into two different data stores, including a first data store for static data relating to entities for retrieval based on static values including a unique identifier of an entity, and a second data store for transactional data for retrieval based on timestamps of events;

a grouping component for grouping different categories of transactional records according to the unique identifiers of entities and a table persisting component for persisting the categories to different tables including maintaining a table of stable size for efficient access of a category of transactional records that is most frequently accessed; and

an entity profile component for retrieving features relating to an entity from multiple tables for processing to provide an output relating to the entity.

12. The system as claimed in claim 1 1 , wherein the entity profile component for retrieving features relating to an entity includes feature generation with dimensions of behaviour of the entity at a given reference point in time, including:

a feature retrieving component for retrieving features from different tables;

a feature joining component for joining the features; and

a feature transforming component for transforming the features to the reference point in time.

13. The system as claimed in claim 1 1 , wherein the second data store is partitioned for time periods and/or by type of event.

14. The system as claimed in claim 1 1 , wherein the category of transactional records that is most frequently accessed is open transactions that are ongoing.

15. The system as claimed in claim 1 1 , wherein the raw data records relate to telecommunications and the entities are users, and wherein the first type of data records are user data records with static values of user attributes and the second type of data records are call detail records of events.

16. The system as claimed in claim 15, including a unique identifier component for generating a unique identifier for each user as a combination of a Mobile Station International Subscriber Directory Number (MSISDN) and a Subscriber Identity Module (SIM) activation date.

17. The system as claimed in claim 15, wherein the filtering component applies filtering rules to extract relevant records from transactional records including one or more of the group of: records relating to events including: calls, messages, monetary events, data usage events; lifecycle events; advance usage events; loan events; mobile wallet events.

18. The system as claimed in claim 15, wherein the grouping component includes grouping categories of open credit advances and closed credit advances, and a category of transactional records that is most frequently accessed are open credit advances available for subsequent analysis.

19. A computer program product for large-scale processing of data records, the computer program product comprising a computer-readable medium having stored computer-readable program code for performing the steps of:

receiving raw data records including a first type of data records having static values and a second type of data records relating to transactional records with timestamps of events;

filtering and transforming the data records to standardized formats;

persisting the transformed data records into two different data stores, including a first data store for static data relating to entities for retrieval based on static values including a unique identifier of an entity and a second data store for transactional data for retrieval based on timestamps of events;

grouping different categories of transactional records according to the unique identifiers of entities and persisting the categories to different tables including maintaining a table of stable size for efficient access of a category of transactional records that is most frequently accessed; and

retrieving features relating to an entity from multiple tables for processing to provide an output relating to the entity.

20. A computer-implemented method for large-scale processing of telecommunication data records, comprising:

receiving raw data records including a first type of data records having static values relating to a user entity and a second type of data records relating to transactional records of telecommunication events with timestamps of events;

filtering and transforming the data records to standardized formats;

persisting the transformed data records into two different data stores, including a first data store for static data relating to entities for retrieval based on static values including a unique identifier of an entity, and a second data store for transactional data for retrieval based on timestamps of events;

grouping different categories of transactional records according to the unique identifiers of entities and persisting the categories to different tables including maintaining a table of stable size for efficient access of a category of transactional records that is most frequently accessed relating to open credit advances to an entity; and

retrieving features relating to an entity from multiple tables for processing to provide an output in the form of a credit score relating to the entity.

Description:
LARGE-SCALE PROCESSING OF DATA RECORDS WITH EFFICIENT RETRIEVAL

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority from United Kingdom patent application number 1902413.2 filed on 22 February 2019, which is incorporated by reference herein.

FIELD OF THE INVENTION

This invention relates to processing of data records with efficient retrieval. In particular, the invention relates to large-scale processing data records in the field of telecommunications.

BACKGROUND TO THE INVENTION

Mobile Network Operator (MNO) subscribers regularly consume services in advance of payment and/or receive loans through the operator. For example, prepaid subscribers (i.e. subscribers who have to first prepay airtime to use the network) are, in many MNOs, offered the option to consume airtime and mobile bundles in advance of payment, and pay back with a next recharge. Other types of loans that MNO subscribers may receive, are the crediting of money to their mobile wallet, the provisioning of a good (e.g. a mobile phone) in advance of payment, etc.

The consumption of services in advance and loans incur, in most of the cases, a cost to the subscriber. The cost is a fee or interest that is paid upon repaying the advance or loan. This cost is a gain for the party distributing the advance/loan (either the MNO or a third party). On the other hand, if a subscriber does not pay back the advance or loan, the party realizes losses.

The entity distributing the advances or loans seeks to generate profit by maximizing the gains while minimizing the losses. A way to achieve this goal is to perform credit analysis for each subscriber, determine his/her credit worthiness, and assign an appropriate credit limit to each subscriber, including a credit limit of zero indicating that the subscriber will not be able to borrow.

Usually credit analysis is one of the functions of banking institutions and is based on financial data, such as banking transactions or information on assets held by the client. In general, the MNOs do not possess such data for the subscribers of their network. MNOs have only data about the usage of their network and in some cases basic demographic data about their subscribers.

The data that the MNOs can provide, namely Call Detail Records (CDR), also called Event Data Records (EDR), and the limited demographic data in the form of Know Your Customer (KYC) data are difficult to analyse for credit analysis and other uses as these data, CDR and KYC, do not usually follow a well-defined format and require large-scale processing. Such challenges may make use of this data for frequent (e.g. daily) updates to credit scoring especially difficult.

The preceding discussion of the background to the invention is intended only to facilitate an understanding of the present invention. It should be appreciated that the discussion is not an acknowledgment or admission that any of the material referred to was part of the common general knowledge in the art as at the priority date of the application.

SUMMARY OF THE INVENTION

According to a first aspect of the present invention there is provided a computer-implemented method for large-scale processing of data records, comprising: receiving raw data records including a first type of data records having static values and a second type of data records relating to transactional records with timestamps of events; filtering and transforming the data records to standardized formats; persisting the transformed data records into two different data stores, including a first data store for static data relating to entities for retrieval based on static values including a unique identifier of an entity, and a second data store for transactional data for retrieval based on timestamps of events; grouping different categories of transactional records according to the unique identifiers of entities and persisting the categories to different tables including maintaining a table of stable size for efficient access of a category of transactional records that is most frequently accessed; and retrieving features relating to an entity from multiple tables for processing to provide an output relating to the entity.

The raw data records may be large-scale raw data records and may have no single or well-defined format. The first data store may be configured for efficient retrieval based on static values including a unique identifier of an entity. The second data store may be configured for efficient retrieval based on timestamps of events.

Retrieving features relating to an entity may include feature generation with dimensions of behaviour of the entity at a given reference point in time, and may include: retrieving features from different tables; joining the features; and transforming the features to the reference point in time.

The second data store may be partitioned for time periods and/or by type of event. The category of transactional records that is most frequently accessed may be open transactions that are ongoing. The method may process the static data records first to obtain identifying information of the entities.

In one embodiment, the raw data records relate to telecommunications and the entities are users, and wherein the first type of data records are user data records with static values of user attributes and the second type of data records are call detail records of events. The method may include generating a unique identifier for each user as a combination of a Mobile Station International Subscriber Directory Number (MSISDN) and a Subscriber Identity Module (SIM) activation date.

In this embodiment, filtering and transforming the data records to standardized formats may include applying filtering rules to extract relevant records from transactional records including one or more of the group of: records relating to events including: calls, messages, monetary events, data usage events; lifecycle events; advance usage events; loan events; mobile wallet events. Grouping different categories of transactional records according to the unique identifiers of an entity may include categories of open credit advances and closed credit advances, and a category of transactional records that is most frequently accessed are open credit advances available for subsequent analysis. The output relating to an entity may be a user profile for subsequent credit risk analysis.

According to another aspect of the present invention there is provided a system for large-scale processing of data records, the system including a memory for storing computer-readable program code and a processor for executing the computer-readable program code, the system comprising: a data receiving component for receiving raw data records including a first type of data records having static values and a second type of data records relating to transactional records with timestamps of events; a transforming component for transforming data records to standardized formats; a filtering component for filtering relevant data records; a data store persisting component for persisting the transformed data records into two different data stores, including a first data store for static data relating to entities for retrieval based on static values including a unique identifier of an entity, and a second data store for transactional data for retrieval based on timestamps of events; a grouping component for grouping different categories of transactional records according to the unique identifiers of entities and a table persisting component for persisting the categories to different tables including maintaining a table of stable size for efficient access of a category of transactional records that is most frequently accessed; and an entity profile component for retrieving features relating to an entity from multiple tables for processing to provide an output relating to the entity.

The entity profile component for retrieving features relating to an entity may include feature generation with dimensions of behaviour of the entity at a given reference point in time, and may include: a feature retrieving component for retrieving features from different tables; a feature joining component for joining the features; and a feature transforming component for transforming the features to the reference point in time.

In one embodiment, the raw data records relate to telecommunications and the entities are users, and wherein the first type of data records are user data records with static values of user attributes and the second type of data records are call detail records of events. In such an embodiment, the system may include a unique identifier component for generating a unique identifier for each user as a combination of a Mobile Station International Subscriber Directory Number (MSISDN) and a Subscriber Identity Module (SIM) activation date. The filtering component may apply filtering rules to extract relevant records from transactional records including one or more of the group of: records relating to events including: calls, messages, monetary events, data usage events; lifecycle events; advance usage events; loan events; mobile wallet events. The grouping component may include grouping categories of open credit advances and closed credit advances, and a category of transactional records that is most frequently accessed are open credit advances available for subsequent analysis.

According to a further aspect of the present invention there is provided a computer program product for large-scale processing of data records, the computer program product comprising a computer-readable medium having stored computer-readable program code for performing the steps of: receiving raw data records including a first type of data records having static values and a second type of data records relating to transactional records with timestamps of events; filtering and transforming the data records to standardized formats; persisting the transformed data records into two different data stores, including a first data store for static data relating to entities for retrieval based on static values including a unique identifier of an entity, and a second data store for transactional data for retrieval based on timestamps of events; grouping different categories of transactional records according to the unique identifiers of entities and persisting the categories to different tables including maintaining a table of stable size for efficient access of a category of transactional records that is most frequently accessed; and retrieving features relating to an entity from multiple tables for processing to provide an output relating to the entity.

Further features provide for the computer-readable medium to be a non-transitory computer- readable medium and for the computer-readable program code to be executable by a processing circuit.

According to a further aspect of the present invention there is provided a computer-implemented method for large-scale processing of telecommunication data records, comprising: receiving raw data records including a first type of data records having static values relating to a user entity and a second type of data records relating to transactional records of telecommunication events with timestamps of events; filtering and transforming the data records to standardized formats; persisting the transformed data records into two different data stores, including a first data store for static data relating to entities for retrieval based on static values including a unique identifier of an entity, and a second data store for transactional data for retrieval based on timestamps of events; grouping different categories of transactional records according to the unique identifiers of entities and persisting the categories to different tables including maintaining a table of stable size for efficient access of a category of transactional records that is most frequently accessed relating to open credit advances to an entity; and retrieving features relating to an entity from multiple tables for processing to provide an output in the form of a credit score relating to the entity.

Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

In the drawings:

Figure 1 is a flow diagram showing an example embodiment of a method in accordance with the invention;

Figure 2 is a flow diagram showing a further example embodiment of an aspect of the method of Figure 1 ; Figure 3 is a block diagram of an example embodiment of a system in accordance with the invention;

Figure 4 illustrates an example of a computing device in which various aspects of the disclosure may be implemented.

DETAILED DESCRIPTION WITH REFERENCE TO THE DRAWINGS

The described method and system provide for processing of data records, typically large-scale processing of data records, in which the data records include a first type of data records having static values and a second type of data records relating to transactional records with timestamps of recorded events. Large-scale processing may refer to processing hundreds of millions to billions of data records upwards. In the described example embodiments, the data records are in the form of telecommunication records including a first type of data records being customer demographic records such as Know Your Customer (KYC) records and a second type of data records being Call Detail Records (CDR). Flowever, the described method and system may equally apply to other forms of data records in which there are similar first and second types of data records.

An application of the described method and system is to utilize the telecommunications data for credit scoring subscribers of Mobile Network Operators (MNO), and subsequently using the credit score to assign appropriate credit limits. This may be applied to any type of lending at the MNO including, but not limited to, the provisioning of network usage advances, monetary loans credited on mobile wallets, consumer loans for the purchase of goods (e.g. mobile phones), etc.

Referring to Figure 1 , a flow diagram (100) shows an example embodiment of the described method as carried out by a computer-implemented data processing system. The method is described in general terms that are applicable to different forms of data records, with examples given in the field of telecommunication data records. The method incorporates or ingests raw input data records to a data store that is usable for further analysis of the records.

The method may receive and process (101 ) raw data input records including a first type of data records having static values and a second type of data records relating to transactional records with timestamps of events. The raw data input records may be large-scale raw data input records. The data input records in the form of telecommunication records may include the first type of data records being customer demographic records such as Know Your Customer (KYC) records and the second type of data records being Call Detail Records (CDR).

The raw telecommunication data which are the input to the system may be provided in a text format (e.g., comma-separated files) or in a binary format. CDRs are of a transactional nature, they describe an event that took place in a specific moment in time and its attributes (e.g., timestamp) are immutable. KYC records on the other hand refer to attributes of a subscriber that are mostly static (e.g., the date a subscriber joined a Mobile Network is a static piece of information) while new attributes may become known in the future with new records.

The method may process (102) the first type of data records having static values first as they contain identifying information for entities to which the data records relate. In the example where data records relate to telecommunications, entities may be users in the form of customers or subscribers to mobile telecommunication services and the first type of data records may be customer data records (e.g., KYC records) with static values of user attributes. The system may perform many passes over the data if needed and may also combine data from different sources and in different formats.

The method may configure (103) data record fields to a standard format. This may be applied to both the first and second types of data records to ensure that fields use a consistent, standard format. For example, configuration may be specific to different MNOs and such configuration items include the time-zone(s) in the country served by the MNO, the currency used, the country calling code, and others.

The method may extract (104) specific values from the raw data in a usable format whilst removing information that is not needed in the subsequent analysis. The purpose of this step is to extract, in a usable format, all the necessary values from the input records. Values include the MSISDN (or other identifier) of the subscriber, the SIM activation date for the subscriber, the timestamp of the event, the duration of the event, if applicable, the monetary amount involved, and other event-specific values. This may extract a unique identifier for an entity, such as a user identifier, if available in the first type of data records.

The method may filter (105) the data records to determine the event type of each input record by applying filtering rules. These rules are either provided by the MNO along with the data or they are constructed after analysis of the data. The event type in telecommunications data may be Recharge, Network Usage Advance, Repayment, Bundle purchase, P2P transfer, etc. The raw input data records may include records that are not needed in the subsequent analysis (such as records regarding the internal workings of the systems of the MNO). The purpose of this step is to classify each record and reject any unnecessary records.

The method may transform (106) the filtered data records into a standard format for each type of data record. After the transformation all records of a specific type (e.g., all Recharge records) have the same schema (i.e., the same set of fields) and all values conform to a standardized format. The same schema/format combination is used for all records of a specific type from all MNOs. Example transformations may include:

• the appropriate country calling code is prepended to the MSISDN;

• timestamps are converted to the ISO 8601 format using UTC as time-zone;

• amounts are converted to the appropriate currency unit;

• event-specific data are converted to standardized values.

The method may generate (107) a unique identifier for each entity, if this is not obtained from the static data values in the first type of data records in step (104) above. For example, the unique identifier for each user entity may be a combination of Mobile Station International Subscriber Directory Number (MSISDN) and Subscriber Identity Module (SIM) activation date. The MSISDN alone is not sufficient, since it is possible that it is reused (i.e., after a subscriber terminates its subscription the MSISDN in question may be assigned to a new subscriber). Therefore, the method may generate a combination of MSISDN with the SIM activation date, if available, which uniquely identifies each subscriber. The SIM activation date should be available from a previous processing step. The generated unique identifier is added to the stored data records.

The method may persist (108) the transformed records to an appropriate data store that supports a large amount of records, in the order of billions or more.

The first type of data records with static values (e.g., the first KYC data records) including a generated or extracted unique identifier for entities may be persisted in a data store that provides efficient retrieval based on the unique identifier for the entity (e.g., the MSISDN concatenated with the SIM activation date).

The second type of data records in the form of transactional records (e.g., the CDR data records) may be persisted in a data store that provides efficient retrieval based on the timestamp of the event. The records may be partitioned according to their type. For more efficient retrieval of information, the records may be further partitioned per day using the partitioning methods provided by the data store (e.g., if the data store consists of comma-separated files in a typical filesystem, the partitioning may be implemented by using a different file or folder for each day).

The method may group (109) different categories of transactional records according to the unique identifiers of an entity and persisting the categories to different tables including maintaining a table of stable size for efficient access of a category of transactional records that is most frequently accessed. One category may be ongoing, open, transactional records and another category may be closed transaction records. The closed transactional records will keep growing in size, whereas the open transactional records will be generally stable in size as some transactions close and some new ones open.

In the telecommunication data example, this step may examine all advances and recoveries of credit to a subscriber in order to determine which advances have been repaid, termed“closed”, and which advances have not been repaid, termed“open”. Advances and repayments may be retrieved from the data store and are grouped together per subscriber (using the unique subscriber identifier). The input to this grouping includes data from previous executions of the method, namely any advances that are still open. Each such group is sorted according to its timestamp, from the least to most recent.

Each group is then examined separately. If an advance is encountered it is recorded for further examination, while if a repayment is encountered it is matched, partially or fully, with a recorded advance, according to the rules set forth by the MNO. A possible rule is that a repayment matches the earliest open advance. An advance that has been fully matched with repayments is marked as closed.

At the end of this process advances that are still open are persisted in one table of the data store and advances that have been closed are persisted in a different table. In one embodiment, there may be one table with all open advances, and in another embodiment, there may be one document table with one document per subscriber. The reason for this split is to ensure efficient retrieval of open advances for subsequent analysis. The count of the open advances is quite stable since in normal operation, at each execution of the system, the count of new advances is in the same order of magnitude as the count of older advances that are closed in said execution. Therefore, the table holding the open advances remains stable in size (which translates to more efficient access) while the table holding the closed advances grows in size after each execution of the system. A separate table is used in order to efficiently update the table. As it is only open advances that are updated, the retrieval and update is faster if there is a table with only open advances.

The method may retrieve (1 10) features relating to an entity from one or more of the multiple tables for processing to provide an output relating to the entity. Feature generation is carried out with dimensions of behaviour of the entity at a given time including: retrieving features from different tables, joining these, and transforming the extracted features to a reference point.

An output relating to an entity may relate to a user and may include a list with entity identifiers and credit limits with optional additional information. The output may be an extraction of potential borrower’s profiles, development of credit scoring models, and methods of credit scoring subscribers and assigning credit limits.

Referring to Figure 2, a flow diagram (200) shows an example embodiment of an aspect of processing data records in accordance with the method of Figure 1 .

The method may receive (201 ) a next transactional data record from the raw input data records and may extract (202) record values. Filtering rules may be applied (203) and the type of record determined (204).

It may be determined (205) if the record should be persisted to a data store. If the record is not persisted to a data store, the method may loop to process a next record (201 ). If the record is persisted to a data store, the record may be transformed (206) to a standard format and a unique identifier may be used (207) for the record. The unique identifier may be generated (207) from the record if this is not available from the extracted record values. The record may be persisted (208) to the data store.

It may be determined (209) if there is another record to process. If so, the method loops to process a next record (201 ). If there are no further records to process, the method may update (210) an output status such as a loan status for telecommunication subscribers.

The method may extract the following event records from the telecommunications data.

• From CDR records: Call events, which are events where a subscriber of the MNO makes or receives a call.

Short Message Service (SMS) events, which are events where a subscriber of the MNO sends or receives an SMS.

Monetary Recharge events (also called “top-ups”), which are events where a subscriber of a pre-paid mobile service spends an amount of money in order to increase his account balance with the Mobile Network Operator and in so doing keep using the service in the future.

Person to Person (P2P) transfers, which are events where a subscriber transfers a portion of his/her account balance to another subscriber.

Bundle purchases, which are events where a subscriber buys a product offered by the Operator that combines a number of services (voice, data, etc.) with specific volumes for each product. Such products are commonly referred to as bundles.

Bundle activations, which are events where a subscriber buys with money from his main account (and therefore already credited to the MNO) a product offered by the Operator that combines a number of services (voice, data, etc.) with specific volumes for each product.

Data usage, which are events of data consumption.

Lifecycle events, which are events of changes in the status of the subscribers at the MNO.

Network airtime usage advances, which are events where airtime/bundles are granted to a subscriber beyond their account balance or contract. These advances are paid back by the subscriber in a future Recharge event.

Network usage advance repayments, which are events where network usage advances are repaid.

Mobile wallet transactions, which are debit and credit events on the subscriber’s mobile wallet.

Mobile wallet loans, which are events where a monetary amount is credited on the mobile wallet of the subscriber with the condition to be paid later commonly along with a fee.

Mobile wallet loan repayments, which are events where mobile wallet loans are repaid. Any other type of records that can be extracted from the CDRs. • From Know Your Customer records: demographic information about subscribers, such as the subscriber’s name, date of birth, or address, and data about the relationship of the subscriber to the network, such as price plan selected or subscription date.

Given the input telecommunication data, the output of the method consists of a list where each list item consists of an identifier that uniquely identifies each subscriber associated with a credit limit for this subscriber, and additional information that encapsulates the analysis that led to the specific credit limit. The method may provide a list of any length, ranging from one subscriber to millions, and provides the output real time for small lists and offline for large lists.

The method described in Figure 1 describes a method of efficient data ingestion including incorporating the provided raw input records into a data store that is usable for analysis. This may be used for different applications, one of which is for entity profiling (including user profiling), credit scoring and limit setting for telecommunication subscribers.

An entity profiling method is described in the context of user profiling that returns the user profiles for subsequent credit risk analysis and limit assignment may be provided by the described method. A user profile is a collection of aggregate and static values derived from the user data processed in the data ingestion method described with reference to Figure 1 . These values are called features and describe the user’s behaviour given a reference time point. The method may provide user profiles in the format required for credit risk analysis and limit assignment.

The method generates thousands of features describing the behaviour of users at a given reference point in time. The features are aggregates of transactional events (recharges, advances, etc.), categorical and cardinal features based on KYC, and several combinations of these. The features describe several dimensions of user behaviour on different time frames. Dimensions of user behaviour include recency, loyalty, frequency, aggregates of amounts and durations. They also include minimum, maximum, counts, means, standard deviations, medians, quartiles and trends of counts, amounts and duration. In addition, they include ratios, and average amounts and activity. Time frames, at which dimensions are expressed before the reference point are recent weeks, fortnights, and months. As well as quarters, semesters and years. In addition, several features describe the dimensions across the full duration of a service/product usage by the subscriber.

The table below shows a small subset of features relevant to a user’s recharge pattern.

Most of the features on the user profile have a temporal dimension; e.g. rvl ma, recharge value one month ago. Retrieving aggregate features in bulk mode (i.e. for many users) is faster when data are partitioned (i.e. indexed according to time) as described in Figure 1.

The method of user profiling includes handling configuration that is specific to each source of data, such as an MNO. The configuration items include the time-zone, the list of features that can be retrieved, and the data store and tables they are to be retrieved from. For optimized execution several features can be retrieved from different tables.

The method then retrieves the features from the persisted data of the data ingestion method and joins the features retrieved from all different data stores and tables. The extracted features are then joined to the reference point in time. A method of credit scoring and limit setting is an example use of the user profiles generated above to develop credit scoring models, to credit score the subscribers, and assign credit limits.

Credit scoring models predict with high accuracy the probability that a subscriber will default on a loan or a credit service. Credit scoring models can be developed following statistical, Machine Learning and Artificial Intelligence methods and are selected based on their performance. Either one model can be developed for the total base of the MNO, or different models can be developed for different segments of the base.

The credit scoring model can then be used to calculate the credit score of each subscriber and assign an appropriate credit limit. Given the credit score multiple ways can be used to assign a credit limit, with the rationale being that, everything else equal, subscribers with a higher credit score (and therefore more credit worthy) will receive a higher credit limit, while subscribers with a lower score will receive a lower credit limit, and subscribers with the lowest score may not be allowed to borrow.

An example of a method for the assignment of credit limit given the credit score is the threshold- based method. This method consists of an ordered list of score thresholds and associated credit limits. The score of the subscriber is checked against this list and the credit limit associated with the highest threshold that is smaller compared to the said score is used as the credit limit of said subscriber.

Referring to Figure 3, a block diagram illustrates an example embodiment of the described system (300) for processing of data records. The system (300) includes at least one processor (301 ) or multiple processors running in parallel and memory (302) for storing computer-readable program code in the form of computer instructions (303).

The system (300) includes a data ingesting system (310), an entity profile component (330) with outputs to applications such as a model development component (340) and a credit score and limit assignment component (350).

The data ingesting system (310) may include a data receiving component (31 1 ) for receiving and processing raw data records including a first type of data records having static values and a second type of data records relating to transactional records with timestamps of events. The raw data records may be large-scale raw data records, e.g. in that they may be received in large volumes (e.g. hundreds of millions to billions of records, or more). In an example embodiment, the raw data records relate to telecommunications data. The raw data records may have no single or well-defined format in that different data records may be formatted differently such that similar types or categories of data or similar features contained in the different data records are recorded differently such that direct extraction of a type or feature of data is not practical.

The data ingesting system (310) may include a configuration component (312) for configuring the data record fields to standard formats and an extracting component (313) for extracting specific values from the raw data in usable format.

The data ingesting system (310) may include a filtering component (315) for filtering relevant data records by applying filtering rules to extract relevant records from transactional records and a transforming component (314) for transforming data records to standardized formats.

The data ingesting system (310) may include a data store persisting component (316) for persisting the transformed data records into two different data stores (320, 321 ) that support a large amount of records. A first data store (320) is for static data relating to entities for retrieval based on static values including a unique identifier of an entity, and a second data store (321 ) is for transactional data for retrieval based on timestamps of events.

The first data store may be configured for efficient retrieval based on static values (such as unique identifier). For example, the first data store may be configured to store information using the unique identifier of the entity as a key so as to enable efficient retrieval of the entities’ static data.

The second data store may be configured for efficient retrieval based on timestamps of events. For example, in one implementation, the second data store may be configured to store information using the timestamp of the transactions as a key. Storing information with only the timestamp as the key may enable efficient retrieval of transactional data for a batch of entities, or all subscribers. For example, in another implementation, the second data store may be configured to store information using the timestamp of the transactions and the unique identifier of the entity as a key (e.g. a compound key). This may enable efficient retrieval of transactional data per entity. In either case the retrieval of transactional data per entity, or entities, at different points in time and as new transactional data are continuously added, and the subsequent creation of features based on this transactional data, may be very efficient.

The data stores may be implemented as databases (e.g. relational databases) and the keys may provide the mechanism for application software to identify, access and update information in a database table.

The second data store (321 ) may be partitioned for time periods and/or by type of event.

The data store configuration may be an architecture-level arrangement that operates irrespective of the actual data being processed (but requires simply that the data includes two different types or characteristics, being static and transactional). In other words, it does not matter the specific data values that are being processed. By providing two different data stores that are configured differently for efficient retrieval of different types of data a system with improvements to data processing efficiency regardless of specific types of data may be provided. Aspects of the present disclosure may thus find application in general data processing involving the processing of static data and transactional data.

The data ingesting system (310) may include a grouping component (318) for grouping different categories of transactional records according to the unique identifiers of an entity and a table persisting component (319) may be provided for persisting the categories to different tables including maintaining a table (322) of stable size for efficient access of a category of transactional records that is most frequently accessed. For example, one category may be ongoing, open, transactional records and another category may be closed transaction records. The closed transactional records will keep growing in size, whereas the open transactional records will be generally stable in size as some transactions close and some new ones open.

The number of transactions included in the transactional data may be very large and can grow at a rate of hundreds of millions of transactions per day. Maintaining tables with relative stable size and not growing at this rate, for transactions that are retrieved separately and most frequently, allows more efficient access for those transactions. For example, in one embodiment, where network airtime usage advances and repayments are ingested from telecommunication data, extraction of transactional data and subsequent creation of features, requires the very efficient retrieval of non-repaid advances. Maintaining separate tables for paid and non-paid advances, allows for very quick retrieval of non-paid advances, as their size grows at a very slow rate maintaining in effect a relatively stable table size of non-paid network airtime usage advances.

The data ingesting system (310) may include a unique identifier component (317) for generating a unique identifier for entities of the data records, if this is not extracted by the extracting component (313). For example, entities may be telecommunication subscribers. The data ingesting system (310) may include a status update component (360) for updating status results further to processing data records.

The entity profile component (330) may retrieve features relating to an entity from multiple tables of the data stores (320, 321 ). The entity profile component (330) may generate user profiles when the entity is a user. The entity profile component (330) for retrieving features relating to an entity includes feature generation with dimensions of behaviour of the entity at a given time and, in one embodiment, includes a feature retrieving component (331 ) for retrieving features from different tables, a feature joining component (332) for joining the features, and a feature transforming component (333) for transforming the features to a reference point.

The entity profile component (330) may provide an output from an output component (334) relating to the entity. The output relating to an entity may be a user profile for subsequent credit risk analysis by use of the model development component (340) and the credit score and limit assignment component (350).

Figure 4 illustrates an example of a computing device (400) in which various aspects of the disclosure may be implemented. The computing device (400) may be embodied as any form of data processing device including a personal computing device (e.g. laptop or desktop computer), a server computer (which may be self-contained, physically distributed over a number of locations), a client computer, or a communication device, such as a mobile phone (e.g. cellular telephone), satellite phone, tablet computer, personal digital assistant or the like. Different embodiments of the computing device may dictate the inclusion or exclusion of various components or subsystems described below.

The computing device (400) may be suitable for storing and executing computer program code. The various participants and elements in the previously described system diagrams may use any suitable number of subsystems or components of the computing device (400) to facilitate the functions described herein. The computing device (400) may include subsystems or components interconnected via a communication infrastructure (405) (for example, a communications bus, a network, etc.)· The computing device (400) may include one or more processors (410) and at least one memory component in the form of computer-readable media. The one or more processors (410) may include one or more of: CPUs, graphical processing units (GPUs), microprocessors, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs) and the like. In some configurations, a number of processors may be provided and may be arranged to carry out calculations simultaneously. In some implementations various subsystems or components of the computing device (400) may be distributed over a number of physical locations (e.g. in a distributed, cluster or cloud-based computing configuration) and appropriate software units may be arranged to manage and/or process data on behalf of remote devices.

The memory components may include system memory (415), which may include read only memory (ROM) and random access memory (RAM). A basic input/output system (BIOS) may be stored in ROM. System software may be stored in the system memory (415) including operating system software. The memory components may also include secondary memory (420). The secondary memory (420) may include a fixed disk (421 ), such as a hard disk drive, and, optionally, one or more storage interfaces (422) for interfacing with storage components (423), such as removable storage components (e.g. magnetic tape, optical disk, flash memory drive, external hard drive, removable memory chip, etc.), network attached storage components (e.g. NAS drives), remote storage components (e.g. cloud-based storage) or the like.

The computing device (400) may include an external communications interface (430) for operation of the computing device (400) in a networked environment enabling transfer of data between multiple computing devices (400) and/or the Internet. Data transferred via the external communications interface (430) may be in the form of signals, which may be electronic, electromagnetic, optical, radio, or other types of signal. The external communications interface (430) may enable communication of data between the computing device (400) and other computing devices including servers and external storage facilities. Web services may be accessible by and/or from the computing device (400) via the communications interface (430).

The external communications interface (430) may be configured for connection to wireless communication channels (e.g., a cellular telephone network, wireless local area network (e.g. using Wi-Fi™), satellite-phone network, Satellite Internet Network, etc.) and may include an associated wireless transfer element, such as an antenna and associated circuitry. The computer-readable media in the form of the various memory components may provide storage of computer-executable instructions, data structures, program modules, software units and other data. A computer program product may be provided by a computer-readable medium having stored computer-readable program code executable by the central processor (410). A computer program product may be provided by a non-transient computer-readable medium, or may be provided via a signal or other transient means via the communications interface (430).

Interconnection via the communication infrastructure (405) allows the one or more processors (410) to communicate with each subsystem or component and to control the execution of instructions from the memory components, as well as the exchange of information between subsystems or components. Peripherals (such as printers, scanners, cameras, or the like) and input/output (I/O) devices (such as a mouse, touchpad, keyboard, microphone, touch-sensitive display, input buttons, speakers and the like) may couple to or be integrally formed with the computing device (400) either directly or via an I/O controller (435). One or more displays (445) (which may be touch-sensitive displays) may be coupled to or integrally formed with the computing device (400) via a display (445) or video adapter (440).

The foregoing description has been presented for the purpose of illustration; it is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Persons skilled in the relevant art can appreciate that many modifications and variations are possible in light of the above disclosure.

Any of the steps, operations, components or processes described herein may be performed or implemented with one or more hardware or software units, alone or in combination with other devices. In one embodiment, a software unit is implemented with a computer program product comprising a non-transient computer-readable medium containing computer program code, which can be executed by a processor for performing any or all of the steps, operations, or processes described. Software units or functions described in this application may be implemented as computer program code using any suitable computer language such as, for example, Java™, C++, or Perl™ using, for example, conventional or object-oriented techniques. The computer program code may be stored as a series of instructions, or commands on a non-transitory computer-readable medium, such as a random access memory (RAM), a read-only memory (ROM), a magnetic medium such as a hard-drive, or an optical medium such as a CD-ROM. Any such computer-readable medium may also reside on or within a single computational apparatus, and may be present on or within different computational apparatuses within a system or network. Flowchart illustrations and block diagrams of methods, systems, and computer program products according to embodiments are used herein. Each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, may provide functions which may be implemented by computer readable program instructions. In some alternative implementations, the functions identified by the blocks may take place in a different order to that shown in the flowchart illustrations.

Some portions of this description describe the embodiments of the invention in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are commonly used by those skilled in the data processing arts to convey the substance of their work effectively to others skilled in the art. These operations, while described functionally, computationally, or logically, are understood to be implemented by computer programs or equivalent electrical circuits, microcode, or the like. The described operations may be embodied in software, firmware, hardware, or any combinations thereof.

The language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based hereon. Accordingly, the disclosure of the embodiments of the invention is intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.

Finally, throughout the specification and claims unless the contents requires otherwise the word ‘comprise’ or variations such as‘comprises’ or‘comprising’ will be understood to imply the inclusion of a stated integer or group of integers but not the exclusion of any other integer or group of integers.