INVESTMENT ANALYSIS USING EVENT DETECTION ON ALTERNATIVE DATA

Title:

INVESTMENT ANALYSIS USING EVENT DETECTION ON ALTERNATIVE DATA

Document Type and Number:

WIPO Patent Application WO/2018/162945

Kind Code:

A1

Abstract:

This research project represents a very unique way of performing investment analysis. The idea is to identify different types of market, financial, geo-political, social and economic events that could help identify the reasons behind speculations that lead to stock market fluctuations. The proposed framework is broken down into 3 phases. The first phase is responsible for data extraction from e-newspapers likes the economic times and business insider using information extraction techniques of web scraping, crawling and harvesting. It then involves knowledge representation and reasoning using methods like bag-of-words and TF-IDFs to store semantic data. The second phase is responsible for event detection using Topic Modeling algorithms like Latent Semantic Analysis (LSA) & Latent Dirichlet Allocation (LDA) and Entity Extraction and Chunking methods like Name Entity Recognition (NER). It then involves correlation and causation analytics to identify trends among events and stock markets by focusing on company portfolios. The third phase is real time visualization using interactive graphs that demonstrates the effects of trending events on stock markets.

Inventors:

SANYAL ANIKET (IN)

Application Number:

PCT/IB2017/051276

Publication Date:

September 13, 2018

Filing Date:

March 05, 2017

Export Citation:

Click for automatic bibliography generation Help

Assignee:

SANYAL ANIKET (IN)

International Classes:

G06Q40/06; G06F17/30

Foreign References:

US20110153520A1	2011-06-23
EP2786335A1	2014-10-08

Download PDF:

View/Download PDF PDF Help

Claims:

im 1: An investment analysis system comprising:

^■ Information Extractor (IE) - that extracts data from the web using techniques of web scraping, crawling and harvesting.

^■ NoSQL Database - a No SQL database where the extracted information is stored in JSON format.

^■ Pre-processing component - that removes noise such as stop words, punctuations, etc from data.

^■ Knowledge Representation & Reasoning component - that converts textual data into semantic data using concepts like bag-of-words, TF-IDFs and n-grams.

^■ Distributed Database - a mix of SQL and no SQL databases where semantic data is stored. It is deployed on a Hadoop cluster to improvise read/write performance.

^■ Topic Modeler - the heart of the framework responsible for detecting events-topics from e-newspapers and articles. The specifics of algorithms range from Latent Semantic Analysis to Latent Dirichlet Allocation.

^■ Name Entity Recognition - the Entity Extraction and Chunking method to discover entities like names, places, organisations, etc. from e-newspapers and articles.

^■ Company Portfolio - a mini database that stores company portfolio information like PnL, FTEs, etc.

^■ Correlation and Causation Analytics - that investigates the impact of detected events and entities on a company's portfolio.

^■ Visualisation - is real time visualization component that uses interactive graphs to demonstrate the effects of trending events on stock markets.

Description:

Investment Analysis Using Event Detection on Alternative Data

1. Information Extraction

News articles from the previous five years (or more) from the news websites http://economictimes.indiatimes.com and http://www.businessinsider.in/ are extracted with a scraper built in Python. This is done using Beautiful Soup, which is a Python library for pulling data out of HTML and XML files. These articles will be extracted along with their primary attributes such as date, title and the domain (e.g. market, news, politics, industry, etc.) that the article belongs to. This is dumped into a NoSQL Database, MongoDB.

2. Storage in NoSQL DB

The extracted news articles are dumped into a MongoDB database. The data will be stored in JSON format.

3. Pre-Processing

The junk data in the articles that don't help in training the model have to be removed. This is done by taking all the words into account and eliminating words that are either too frequent (e.g. articles, pronouns, etc.) or too rare. Other factors taken into account are word length, parts of speech, etc. At the end of this process, we have a list of only those words that will be helpful in training the model .

4. Knowledge Representation and Reasoning

The articles along and the list of words are fed into the knowledge representation and reasoning algorithm. The representation is modelled as either bag of words or term frequency inverse document frequency (TF-IDF) or the n-grams method depending on the complexity of an article. All the articles are represented in this manner.

5. Distributed Database

The articles, after being represented in a meaningful semantic format, are stored in the distributed Hadoop database using Spark Infrastructure. The Spark + Hadoop database is implemented to increase the DB performance, namely read/write access and fault tolerance.

6. Topic Modeller

A topic modeller is trained using the articles from the distributed database to extract topics / events from the articles. This is done using Latent Semantic Analysis & Latent Dirichlet Allocation. It is implemented using the Gensim API in python.

7. Name Entity Recognition

Name and Entities, namely Proper nouns are extracted to help with the event detection. This is done using the Stanford's Natural Language Processing Library. l This is done to increase the accuracy of the event detection, in case the topic modeller misses.

Company Portfolio

This is all the information about the companies whose trends and indicators that the model analyses. This data is obtained from Reuters. Correlation and Causation Analytics

The events detected from the articles and the company portfolio data (& parameters) are fed into a machine learning (ML) algorithm (Information Gain driven Random Forests) that correlates the events to the various trends and indicators of a company portfolio. The python ML libraries numpy, scipy, pandas and sklearn are used for learning this model. Later, this model is also used to predict the causation between current events and the current parameters of the companies. It assigns a probability to all the real-time events during causation and selects the event with the highest probability to be responsible for the current change in the trend. Visualization

A live graph of a parameter from the company portfolio is displayed. Various possible events for the trends & indicators in the graph are plotted along with it. The python libraries Matplotlib and Plotly are used for this.

Previous Patent: RECOMBINANT ANTIBODIES TO PROGRAMMED DEATH 1 (PD-1) AND USES THEREFOR

Next Patent: WIRELESS SAILING