Towards Stock Market Data Mining Using Enriched Random Forests from Textual Resources and Technical Indicators

. The present paper deals with a special Random Forest Data Mining technique, designed to alleviate the significant issue of high dimensionality in volatile and complex domains, such as stock market prediction. Since it has been widely acceptable that media affect the behavior of investors, information from both technical analysis as well as textual data from various on-line financial news resources are considered. Different experiments are carried out to evaluate different aspects of the problem, returning satisfactory results. The results show that the trading strategies guided by the proposed data mining approach generate higher pro ﬁ ts than the buy-and-hold strategy, as well as those guided by the level-estimation based forecasts of standard linear regression models and other machine learning classifiers such as Support Vector Machines, ordinary Random Forests and Neural Networks.


Introduction
Stock market prediction has always gained certain attention from researchers.There is a controversy as regards to whether there is a method for accurate prediction of stock market movement, mainly due to the fact that modeling market dynamics is a complex and volatile domain.Stock market research encapsulates two main philosophical attitudes, i.e. fundamental and technical approaches [1].The former states that stock market movement of prices derives from a security's relative data.Fundamentalists are of the belief that numeric information such as earnings, ratios, and management effectiveness could determine future forecasts.In technical analysis, it is believed that market timing is the key.Technicians utilize charts and modeling techniques to identify trends in price and volume.These latter individuals rely on historical data in order to predict future outcomes.However, according to several researchers, the goal is not to question the predictability of financial time series but to discover a good model that is capable of describing the dynamics of stock market.
There is a plethora of proposed methods in stock market prediction.The majority of them are strongly related to structured, numerical databases and domain expertise rules.In the field of trading, most of decision support tools focus on statistical analysis of past price records.Nevertheless, throughout recent studies, prediction is also based on textual data, based on the rational assumption that the course of a stock price can be influenced by news articles, ranging from companies releases and local politics to news of superpower economy [2].
However, unrestricted access to news information was not possible until the early 1990's.Nowadays, news are easily accessible, access to important data such as inside company information is relatively cheap and estimations emerge from a vast pool of economists, statisticians, journalists, etc., through the World Wide Web.Despite the large amount of data, advances in Natural Language Processing and Knowledge Discovery from Data (also known as Data Mining) allow for effective computerized representation of unstructured document collections, analysis for pattern extraction and discovery of relationships between document terms and time-stamped data streams of stock market quotes.
Nevertheless, when data tend to grow both in number of records and features, numerous mining algorithms face significant complications, resulting in poor prediction ability.The aim of this study is to propose a potential solution to the problem, by considering the well-known algorithm of Random Forests [3] and altering their construction phase by utilizing a Markov Blanket approach which discards irrelevant features, thus improving classification results.The importance of this study lies to the fact that technical analysis contains the event and not the cause of the change, while textual data may interpret that cause.Certainly, as it is tedious for a human investor to read all daily news concerning a company and other financial information, a prediction system that could analyze such textual resources and find relationships with price movement at future time windows is beneficial.
The paper is structured as follows: section 2 provides an overview of literature concerning Stock Market prediction using Data Mining techniques.Section 3 describes the proposed Markov Blanket Random Forest utilization.Section 4 provides an overview of our experimental design and discusses the evaluation outcome.

Previous Work
Due to numerous studies in traditional technical analysis, we shall emphasize on researches that study the influence of news articles on stock markets.Chang et al., [4] were among the first to confirm the reaction of the market to news article.They had shown that economic news always has a positive or negative effect in the number of traded stock.They used salient political and economic news as proxy for public information.Klibannof et al., [5] deal with closed-end country fund's prices and country specific salient news.They stated that there is a positive relationship between trading volume and news.Similar to the aforementioned approach, Chan and Wei [6] founded that news that is placed in the front page of the South China Post increase the return volatility in the Hong Kong stock market.Mitchell and Mulherin [7] used the daily number of headlines of Dow Jones as a measure of public information.They mentioned the positive impact of news on absolute price changes.Mittermayer [8] proposed a prediction system called NEWSCATS, which provides an estimate of the price after the publication of press releases.Schumaker and Chen [9] examined three different textual representation formalisms and studied their abilities to predict discrete stock prices 20 minutes after an article release.

Markov Blanket Random Forests
A problem arises when the number of possible features is vast and the percentage of actually informative features is small, i.e. the performance of the base classifiers degrades.This phenomenon is particularly present in financial data sets, where most attributes represent technical indicators with little or unknown certainty about their correlation to the true course of a stock.Technically, in the case of a Random Forest classifier, this problem arises due to the fact that, if simple random sampling is used for selecting the subset of m eligible features at each node, almost all these subsets are likely to contain a predominance of non-informative features.
The solution proposed in this paper is based on the notion of a feature selection and reasoning algorithm, i.e. the Markov Blanket of the class attribute.The identification of relevant variables is an essential component of construction of decision support models, and computer-assisted discovery.In financial decision systems for example, such as the task at hand, elimination of redundant features could increase the computational performance significantly.The problem of variable selection in financial domains is more pressing than ever, due to the recent emergence of many news portals, on-line financial services, etc.Similar cases are also common in biomedical engineering, computational biology, text categorization, information retrieval, mining of electronic medical records, consumer profile analysis, temporal modelling, and other domains [10].Several researchers [11] have suggested, intuitively, that the Markov Blanket (MB) of the target variable t, denoted as MB(t), is a key concept for solving the variable selection problem.MB(t) is defined as the set of variables conditioned on which all other variables are probabilistically independent of t.Thus, knowledge of the values of the Markov Blanket variables should render all other variables superfluous for classifying t.

Bayesian Networks and Markov Blanket
In order to better capture the significant properties of a Markov Blanket, a brief introductory section of Bayesian networks is included.Bayesian networks graphically represent the joint probability distribution of a set of random variables.A Bayesian network is composed of a qualitative portion (its structure) and a quantitative portion (its conditional probabilities).The structure BS is a directed acyclic graph where the nodes correspond to domain variables x 1 ,…, x n and the arcs between nodes represent direct dependencies between the variables.Likewise, the absence of an arc between two nodes x i and x j represents that x j is independent of x i given its parents in BS.Following the notation of Cooper and Herskovits [12], the set of parents of a node x i in BS is denoted π i .The structure is annotated with a set of conditional probabilities (BP), containing a term P(x i =X i |π i =Π i ) for each possible value X i of feature x i and each possible instantiation Π i of π i .A Markov Blanket of a node x i , denoted as MB(x i ), is a minimal attribute set, such that every other attribute is independent of x i given its Markov Blanket.Mathematically, the above statement is translated into: where M denotes the conditional independence of i x with k x given ) ( i x MB . Suppose B i and B j are two Bayesian networks that have the same probability distribution, then MB Bi (x k ) = MB Bj (x k ) for any variable x k .Certainly, MBs are not exclusive and may vary in size, but any given BN has a unique MB(x i ) for any x i , which is the set of parents, children and parents of children of x i .In Fig. 1, a Bayesian network is depicted along with the Markov Blanket of a target node x, colored in blue.As regards to the dataset interpretation, feature x is independent of all other features given its MB(x)={U i ,U j ,Y k ,Y l ,Z km ,Z ln }.

Random forests
Random Forests, in general, are a combination of decision tree classifiers such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest.Given a training set X comprised of N instances, which belong to two classes, and F features, a Random Forest multi-way classifier Θ(x) consists of a number of decision trees, with each tree grown using some form of randomization, where x is an input instance.The leaf nodes of each tree are labelled by estimates of the posterior distribution over the data class labels [13].Each internal node contains a test that best splits the space of data to be classified.A new, unseen instance is classified by sending it down every tree and aggregating the reached leaf distributions.The process is described in Fig. 2. Each tree is grown as follows: • If the number of cases in the training set is N, sample N cases at random but with replacement, from the original data.This sample will be the training set for growing the tree.• If there are F input features, a number m<<F is specified such that at each node, m variables are selected at random out of the F and the best split on these m is used to split the node.The value of m is held constant during the forest growing.• Each tree is grown to the largest extent possible.Therefore, no pruning procedures are applied.

Markov Blanket Random Forests Implementation
Based on the existing implementations of Random Forests and taking our initial concerns on feature relevance into consideration, we propose a novel algorithm for classification using RF.The algorithm is entitled "Markov Blanket Random Forests-MBRF", since the danger of selecting irrelevant and misleading features is remedied by using the Markov Blanket of the class node to provide the best splitting criteria for each tree.By selecting random samples and obtaining the extracted MB of the target node, the probability of tree containing more informative features is increased.In case of high-dimensional datasets, the diversity of the ensemble is not compromised and is more robust that other, pre-filtering or weighting schemes.The algorithm is consisted of two distinct phases; the former regards the construction of the Markov Blanket and the latter deals with constructing the trees.Its basic procedure can be sketched in the following phases:

Experimental Design and Evaluation
As mentioned earlier, articles containing financial news were combined with a plethora of technical indices in order to search for direct influence patterns of the former to the latter.More specifically, we focused on three heterogeneous stock securities from the Greek stock market (Athens Stock Exchange, .ATG), a major Greek bank (Piraeus Bank, .TPEIR), the main telecommunication provider of Greece (OTE, .OTE) and one of the biggest Greek airline companies (Aegean, .AEGN).We incorporated past data from the major European, Asian and American stock markets, as well as data from energy and metal commodities.Finally, for each of the aforementioned three stock securities, a variety of major technical indices was utilized.News was automatically extracted from the electronic versions of the leading Greek financial newspapers, i.e. "Naftemporiki" (www.naftemporiki.gr) and "Capital" (www.capital.gr).The time period for all collected data was from November 2007 to January 2010.The technical indices were calculated using the AnalyzerXL tool.Table 1 tabulates data regarding the three benchmark stocks and their corresponding articles that were collected, while Table 2 contains data about historical data of other, main markets and commodities.Finally, Fig. 3 depicts a categorized list of the technical indices that were also taken into consideration.Stock quotes are gathered on a per day basis and articles are aligned accoriding to their release date.In case an article was pubished on a Friday evening (after the closing of the Athens stock market) or during the weekend, it was considered as published on a Monday.The textual analysis phase consisted of three activities: (a) removal of stop words (i.e.articles, special characters, etc), (b) lemmatization of words using a Levenshtein distance based Greek lemmatizer [14], (c) removal of terms appearing less than 30 times within the complete article corpus and taking the 150 most frequent of them.Upon completion of the aforementioned phases, we kindly asked a domain expert (financial journalist) to annotate terms according to their genre.More specifically, she annotated each word with a signed integer according to whether it encompasses a very positive (+2), positive (1), neutral(0), negative(-1) of very negative(-2) sense.Examples of such terms respectively are: κερδοφορία (profitability, +2), ισχυρή (powerful, +1), πορεία (course, 0), υποχώρηση (downgrading, -1) and κρίση (crisis, -2).The predicted class attribute contained three discrete values, namely UP, STEADY and DOWN, if the stock quote closed at a price more than 1%, between 1% and -1% and less than -1% in the following day respectively.A window of 5 days was used in order to predict the class, resulting in a high-dimensional dataset of more than 620 features.Article as well as stock quotes data was processed by our proposed methodology (MBRF), regular Random Forests (RF), Radial Basis Functions neural networks (RBF) and a derivative of Support Vector Machines, namely Sequencial Minimal Optimization (SMO) which can handle discrete values and acts similar to regression.Since the latter Machine Learning algorithms do not reduce features by default, in order to compare the MBRF technique against them, a PCA analysis approach was followed using the Nmath library for .NET platforms (http://www.centerspace.net/products/nmath).
Regarding the experimental design, two different approaches were followed.The former dealt with standard, 10 fold cross validation, classification in terms of stock quotes closeness, using datasets with articles and without articles, in order to evaluate the impact of articles on the predictability of a stock quote.We used the F-measure metric for evaluation, which acts as the harmonic mean of precision and recall.Table 3 tabularizes the F-measure score of all machine learining algorithms against linear regression (LR).From these outcomes, we could initially observe that combining information from both time series and textual data leads to improvement of the performance for all methodologies.Furthermore, by using only technical analysis data, SMO perform similar to MBRF and significantly outperform all other approaches, while when incorporating textual information, MBRF is noticeably the best classification approach, a fact that could be attributed to the dimensionality reduction when applying the Markov Blanket preprocessing step.According to Table 3, the performance of MBRF is one of the highest ever reported, with the drawback of a very time and recource consuming training phase.The latter experimental design developed was a simulated trading strategy, in an effort to further examine if the MBRF model could practically be applied to generate higher profits than those earned by employing the traditional regression model of by simply following a buy-and-hold (passive) investment strategy.The operational details of the trading simulation are explained as follows: The trading simulation assumes that the investor has 100,000€ to create a portfolio by selecting a balanced percentage of each of the three Greek stock quotes mentioned earlier.Each day, the investor could buy, sell or wait, according to the class prediction of the MBRF model.
We assume that transactional costs apply when buying or selling (0,335% and 0,35% respectively) and a random choice between 5% and 10% of the current portfolio can be traded each day.The time period was set to the last 35 weekdays of the aforementioned dataset.As Fig. 4 depicts, the dashed line, which represents the portfolio budget for the MBRF investing strategy is clearly outperforming the solid line of the buy-and-hold investment strategy by a mean factor of 12.5% to 26% for the first 2 weeks and from 16% to 48% for the remaining ones.

Fig. 1 .
Fig. 1.An example of a Bayesian Network with the Markov Blanket of node x.

Fig. 4 .
Fig. 4. Plot of portfolio outcomes using the two different trading strategies

Table 2 .
Market and commodities data.

Table 3 .
Classification performance in terms of F-measure.