US20110153520A1 | 2011-06-23 | |||
EP2786335A1 | 2014-10-08 |
im 1: An investment analysis system comprising: ■ Information Extractor (IE) - that extracts data from the web using techniques of web scraping, crawling and harvesting. ■ NoSQL Database - a No SQL database where the extracted information is stored in JSON format. ■ Pre-processing component - that removes noise such as stop words, punctuations, etc from data. ■ Knowledge Representation & Reasoning component - that converts textual data into semantic data using concepts like bag-of-words, TF-IDFs and n-grams. ■ Distributed Database - a mix of SQL and no SQL databases where semantic data is stored. It is deployed on a Hadoop cluster to improvise read/write performance. ■ Topic Modeler - the heart of the framework responsible for detecting events-topics from e-newspapers and articles. The specifics of algorithms range from Latent Semantic Analysis to Latent Dirichlet Allocation. ■ Name Entity Recognition - the Entity Extraction and Chunking method to discover entities like names, places, organisations, etc. from e-newspapers and articles. ■ Company Portfolio - a mini database that stores company portfolio information like PnL, FTEs, etc. ■ Correlation and Causation Analytics - that investigates the impact of detected events and entities on a company's portfolio. ■ Visualisation - is real time visualization component that uses interactive graphs to demonstrate the effects of trending events on stock markets. |
1. Information Extraction
News articles from the previous five years (or more) from the news websites http://economictimes.indiatimes.com and http://www.businessinsider.in/ are extracted with a scraper built in Python. This is done using Beautiful Soup, which is a Python library for pulling data out of HTML and XML files. These articles will be extracted along with their primary attributes such as date, title and the domain (e.g. market, news, politics, industry, etc.) that the article belongs to. This is dumped into a NoSQL Database, MongoDB.
2. Storage in NoSQL DB
The extracted news articles are dumped into a MongoDB database. The data will be stored in JSON format.
3. Pre-Processing
The junk data in the articles that don't help in training the model have to be removed. This is done by taking all the words into account and eliminating words that are either too frequent (e.g. articles, pronouns, etc.) or too rare. Other factors taken into account are word length, parts of speech, etc. At the end of this process, we have a list of only those words that will be helpful in training the model .
4. Knowledge Representation and Reasoning
The articles along and the list of words are fed into the knowledge representation and reasoning algorithm. The representation is modelled as either bag of words or term frequency inverse document frequency (TF-IDF) or the n-grams method depending on the complexity of an article. All the articles are represented in this manner.
5. Distributed Database
The articles, after being represented in a meaningful semantic format, are stored in the distributed Hadoop database using Spark Infrastructure. The Spark + Hadoop database is implemented to increase the DB performance, namely read/write access and fault tolerance.
6. Topic Modeller
A topic modeller is trained using the articles from the distributed database to extract topics / events from the articles. This is done using Latent Semantic Analysis & Latent Dirichlet Allocation. It is implemented using the Gensim API in python.
7. Name Entity Recognition
Name and Entities, namely Proper nouns are extracted to help with the event detection. This is done using the Stanford's Natural Language Processing Library. l This is done to increase the accuracy of the event detection, in case the topic modeller misses.
Company Portfolio
This is all the information about the companies whose trends and indicators that the model analyses. This data is obtained from Reuters. Correlation and Causation Analytics
The events detected from the articles and the company portfolio data (& parameters) are fed into a machine learning (ML) algorithm (Information Gain driven Random Forests) that correlates the events to the various trends and indicators of a company portfolio. The python ML libraries numpy, scipy, pandas and sklearn are used for learning this model. Later, this model is also used to predict the causation between current events and the current parameters of the companies. It assigns a probability to all the real-time events during causation and selects the event with the highest probability to be responsible for the current change in the trend. Visualization
A live graph of a parameter from the company portfolio is displayed. Various possible events for the trends & indicators in the graph are plotted along with it. The python libraries Matplotlib and Plotly are used for this.
Next Patent: WIRELESS SAILING