A Deep Reinforcement Learning Approach for Automated Cryptocurrency Trading

. Nowadays, Artiﬁcial Intelligence (AI) is changing our daily life in many application ﬁelds. Automatic trading has inspired a large number of ﬁeld experts and scientists in developing innovative techniques and deploying cutting-edge technologies to trade diﬀerent markets. In this context, cryptocurrency has given new interest in the application of AI techniques for predicting the future price of a ﬁnancial asset. In this work Deep Reinforcement Learning is applied to trade bitcoin. More precisely, Double and Dueling Double Deep Q-learning Networks are compared over a period of almost four years. Two reward functions are also tested: Sharpe ratio and proﬁt reward functions. The Double Deep Q-learning trading system based on Sharpe ratio reward function demonstrated to be the most proﬁtable approach for trading bitcoin.


Introduction
Nowadays, Artificial Intelligence (AI) is reshaping our daily life.AI is the study and design of intelligent agents where an agent is a system that perceives its environment and takes actions in order to maximize its chances of success.AI excels at interpreting signals and real-time analytic which underpin many different applications.For instance, AI is changing the way medical science was perceived just few years ago.Autonomous machines play an increasingly important role in surgery, improving patient outcomes and reducing expensive hospital stay time.Elsewhere, computer vision are improving diagnostic technologies and making them more accessible, while predictive algorithms are facilitating more rapid drug discovery.A less noble application is related to the financial sector, where AI is used to build automatic trading systems which are poised to foster a new financial technology transformation.Furthermore, the arrival of cryptocurrency has given new interest in the application of AI techniques for predicting the future price of a financial asset (i.e.Bitcoin).
In this context, Reinforcement Learning (RL) [6] [21] has demonstrated the potential to transform how classical trading systems work.RL is an autonomous, self-teaching system that essentially learns by trial and error.It performs actions with the aim to maximize rewards and achieve the best outcomes [2] [17].
In this work, we investigate the performance of two different trading systems based on deep RL approaches: Double Deep Q-Network (D-DQN) [16] and Dueling Double Deep Q-Network (DD-DQN) [24].The two trading systems are compared with a Deep Q-Network (DQN) [16].
The article is structured as follows: Section 2 provides a definition of cryptocurrency and bitcoin.Section 3 gives a short description of Reinforcement Learning.Section 4 introduces and describes the proposed Q-learning trading system.Main results are reported in Section 5 and Section 6 concludes the work.

Related Work
Deep Learning (DL) and Reinforcement Learning (RL) are viable approaches for market making.In recent years, the use of DL and RL is increased a lot demonstrating the powerful of these techniques.
McNally, S. et al. ( 2018) [15] applied different Machine Learning (ML) techniques on bitcoin cryptocurrency.More precisely, they compared Recurrent Neural Network (RNN) and Long Short Term Memory (LSTM) network against a more classical approach such as AutoRegressive Integrated Moving Average (ARIMA) model.RNN and LSTM outperformed ARIMA in a traditional classification setting.
Patel, Y. (2018) [19] proposed a multi-agent approach that operates at two different levels: (i) minute level (macro-agent) and (ii) order book level (microagent).The macro-agent is based on a Double Q-learning network composed by a Multi-Layer Perceptron (MLP) and the micro-agent is realized with a Dueling Double Q-learning network with reward function based on volume weighted average bitcoin price.The multi-agent did not outperfom the simple macro-agent but it obtained better results with respect to a uniform Buy and Hold and Momentum Investing techniques in terms of cumulative profits.
Previous works were applied only to bitcoin movements, Bu, S.-J.et al (2018) [5] tested a hybrid approach (Boltzmann machine and Double Q-learning network) against LSTM, MLP, Convolutional Neural Network (CNN) over eigth cryptocurrencies.They used the ratio between total value after investement and initial value as evaluation score.The hybrid approach demonstrated to be more profitable than competitors but more risky and unstable.
Alessandretti, L. and coauthors (2018) [1] and Jiang, Z. et al. (2017) [10] applied Artificial Intellingent (AI) approaches on portfolio management.In [1] the authors applied a gradient boosting decision tree (i.e.XGBoost) and LSTM network on a cryptocurrency portfolio.Performance were evaluated considering Sharpe ratio [20] and geometric mean return.All proposed strategies produced profit over the entire test period.Jiang, Z. et al. (2017) [10] applied a deterministic policy gradient using a direct reward function (average logarithmic return) for solving the portfolio management problem.The approach demonstrated to outperfom classical management techniques except against a Passive Aggressive Mean Reversion technique in terms of cumulative return.
In this work, the proposed trading systems based on deep RL approaches differ from previous techniques for the use of a reward function based on Sharpe ratio.Furthermore, Double Q-learning and Dueling Double Q-learning networks are used as agents that interact with the financial market in a macro level.

Cryptocurrency and Bitcoin
A cryptocurrency can be seen as a digital or virtual currency that works as a medium of exchange.In few words, it is a set of limited entries in a database that no one can change unless specific conditions are fulfilled.Bitcoin is one of the most established and discussed cryptocurrency available today.Since its origination in 2009, bitcoin has received the stature of a digital commodity and its value is considered comparable to traditional currencies [7].The exchanges of bitcoin are verified for secure transaction by network nodes which use cryptographic techniques.They are recorded in a public distributed ledger called block chain which records bitcoin transactions [7].
Considering a specific time interval, bitcoin price information is represented by candlesticks, or Open-High-Low-Close (OHLC) chart.A candle consists of four measurements for an asset during a period: the opening price at the start of the period, the highest and lowest price within the period, and the closing price at the end of the period.The opening and closing part of a candle is usually charted as a box and the highest and lowest prices as the "wicks" above and below.Candles themselves trivially aggregate into larger candles.For instance, a 1 hour candle is easily derived by aggregating 60 candles of 1 minute.

Automated trading
Automated trading can be seen as an automated decision-making procedure.Usually, automated trading procedures aim at predicting whether a possible positive return will be realized in the near future.The automated trading procedure should define whether to buy or sell the asset under consideration or hold the current position.
At time step t, the automated trading procedure will then act based on the decision rule defined in Eq. 1.
Given the price of an asset at time t, p t , the automated trading procedure buys if the expected price at time t + h, E[p t+h ], is greater than p t and sell if E[p t+h ] is lower than p t , otherwise it does not do any action (hold ).h is some positive number of time steps in the future [3].
Throughout this work we make use of two common market orders: long and short.Long trades are the classic method of buying with the intention of profiting from a rising market, i.e.E[p t+h ] > p t .Short trades are used with the intention of profiting from a falling market, i.e.E[p t+h ] < p t .Other two orders are commonly used in defining trading strategies: stop-loss and take-profit.Both of them are used to buy or sell an asset when it reaches a particular price.Stop-loss is used to reduce a possible loss.Take-profit is used to guarantee a possible gain.

Reinforcement learning
Reinforcement learning (RL) can be seen as the formalization of an optimal policy capable of ensuring the maximization of the expected cumulative profit of an agent [6].In the course of this section, we keep close to the description as given in [9] [12] [18].
The agent interacts with the environment by executing actions and receiving observations and rewards.At each time step t, which ranges over a set of discrete time intervals, the agent select an action a from a set of legal actions A at state s t ∈ S, where S is the set of possible states.Action selection is based on a policy, π.The policy is a description of the behaviour of the agent and tells the agent which actions should be selected for each possible state.As a result of each action, the agent receives a scalar reward r t ∈ R, and observes next state s t+1 ∈ S. The transition probability of each possible next state s t+1 is defined as P (s t+1 |s t , a t ), with s t+1 , s t ∈ S and a t ∈ A. Similarly, the reward probability of each possible reward r t is defined as P (r t |s t , a t ) where s t ∈ S, a t ∈ A. Hence, the expected scalar reward, r t , received by executing action a in current state s is calculated based on E P (rt|st,at) (r t |s t = s, a t = a).This framework can be seen as a finite Markov Decision Process (MDP).
The aim of the learning agent is to learn an optimal policy π * , which defines the probability of selecting action a in state s, so that the sum of the discounted rewards over time is maximized.The expected discounted return R at time t is defined as follows: where E[.] is the expectation with respect to the reward distribution and 0 < γ < 1 is called the discount factor.At this point a Q-value function, Q π (s, a), can be defined as follows: The Q-value, A π (s, a), for an agent is the expected return achievable by starting from state s ∈ S and performing action a ∈ A following policy π.Eq. 3 satisfies a recursive property, so that an iterative update procedure can be used for the estimation of Q-value function: for all s, s ∈ S and a, a ∈ A.
Reinforcement learning agent aims at finding the policy which achieves the greatest outcome.Hence, it must learn an optimal policy π * with the expected value greater than or equal to all other policies, and leading to an optimal Qvalue Q * (s, a).In particular, the iterative update procedure for estimating the optimal Q-value function ca be defined as in Eq. 5.
The iteration procedure converges to the optimal Q-value, Q * , as i → ∞ and is called value iteration algorithm.One of the most popular value-based algorithms is the Q-learning algorithm [25].The basic version of Q-learning algorithm makes use of the Bellman equation for the Q-value function [4] whose unique solution is Q * (s, a): where B is the Bellman operator mapping any function K : S × A → R into another function S × A → R and is defined as follows: where T is the function for calculating the transaction value to go from s to s given action a.One general proof of convergence to the optimal value function is available [25] under the conditions that: (i) the state-action pairs are represented discretely, and (ii) all actions are repeatedly sampled in all states (which ensures sufficient exploration, hence not requiring access to the transition model).
In that context, a parametric value function Q(s, a; θ) is needed, where θ refers to some parameters that define the Q-values.Different Q-networks are available in literature: -Deep Q-Networks (DQNs): DQNs were introduced by Mnih et al. ( 2015) [16].DQNs stabilize the training of action value function approximation with deep neural networks, in particular Convolutionary Neural Networks (CNNs) [6], using experience replay [13] and target network.-Double Deep Q-Networks (D-DQNs): D-DQN improved DQN avoiding over-estimation.In D-DQN a greedy policy is evaluated in accordance with a online network and a target network is used to estimate its value.-Dueling Double Deep Q-Networks (DD-DQNs): DD-DQN [24] is based on a dueling network architecture to estimate value function V (s) and the associated advantage function A(s, a) = Q(s, a) − V (s), and then combine them in order to estimate Q(s, a).In DD-DQN, a CNN layer is followed by two streams of fully connected (FC) layers, used to estimate the value function and the advantage function separately; then the two streams are combined to estimate the action value function.

Q-learning Trading System
The proposed Q-learning trading system is based on (i) D-DQN and (ii) DD-DQN.In both cases, an agent interacts with the financial market.Given a certain state of the financial market, the agent defines the type of the action a (buy, hold, sell) to do on a bitcoin unit.If a bitcoin is acquired, it is then added to a wallet.A stop-loss (sl = −5%) and a take-profit (tp = +12%) are also applied to the wallet.For instance, if the wallet loses more than a threshold (i.e.−5%), all open positions are closed.
The exploration-exploitation dilemma is of fundamental importance for deep RL techniques as well as for the proposed Q-learning trading system.Exploitation concerns information about the environment (i.e.transition and reward functions) while exploitation is about maximizing the expected return given the current knowledge.For this reason, the agent can take a random action with probability, , and follows the policy that is believed to be optimal with probability, 1 − ( -greedy technique).In the proposed trading system, an initial = 1 is selected for the first observations (n obs = 300) and then is set to a new value new = 0.12.For a more realistic study, a trade transition cost equal to 0.3% is applied both for long and short actions.
The Q-learning trading system rewards the agent with two possible functions: (i) Sharpe ratio [20], s pt = , where pt is the return of the portfolio or merely the return of the asset, f is the risk-free rate ( f = 0 in our work), σ pt is the standard deviation of portfolio's return and (ii) a simple profit function, g prof it = (p t − p t−1 ) (i.e.nominal return), where p t is the asset price at time t and p t−1 the asset price at time t − 1.More precisely, in the first case the trading strategy at time t is: In the second case, the trading strategy at time t is: Fig. 1 shows the Q-learning trading system based on a Double Deep Qlearning Network with Sharpe ratio reward function.The basic version of Q-learning algorithm makes use of the Bellman equation for the Q-value function [CITA Bellman and Dreyfus, 1962] whose unique solution is Q ⇤ (s, a): where B is the Bellman operator mapping any function K : S ⇥ A ! R into another function S ⇥ A ! R and is defined as follows: where T is the function for calculating the transaction value to go from s to s 0 given action a.One general proof of convergence to the optimal value function is available [CITA Watkins and Dayan, 1992] under the conditions that: (i) the state-action pairs are represented discretely, and (ii) all actions are repeatedly sampled in all states (which ensures su cient exploration, hence not requiring access to the transition model).
In that context, a parameterized value function Q(s, a; ✓) is needed, where ✓ refers to some parameters that define the Q-values.Lin, 1992] and target network.In fact, DQN uses CNNs to approximate the optimal action value function: In standard Q-learning, as well as in DQN, the parameter ✓ in Q(s, a; ✓) is update as follows ✓t+1 = ✓t + ↵(y Q t Q(st, at; ✓t))rQ t (st, at; ✓t), where ↵ is the learning rate and y Q t = rt+1 + maxa Q(st+1, a; ✓t).so that the max operator uses the same values to both select and evaluate an action.As a consequence, it is more likely to select over-estimated values, and results in over-optimistic value estimates [CITA].In D-DQN the greedy policy is evaluated in accordance with the online network, the target network is used to estimate its value [CITA].This can be achieved replacing y Q t with: Action a t {Buy, Hold, Sell}!
The aim of the learning agent is to learn an optimal policy ⇡ ⇤ , which defines the probability of selecting action a in state s, so that with following the policy the sum of the discounted rewards over time is maximized.The expected discounted return R at time t is defined as follows: Where E[.] expectation with respect to the reward distribution and 0 < < 1 is called the discount factor.With regard to the transition probabilities and the expected discounted immediate rewards, which are the essential elements for specifying dynamics of a finite MDP, Q-value function, Q ⇡ (s, a) is defined as follows: The Q-value A ⇡ (s, a) for an agent is the expected return achievable by starting from state s, s 2 S, and performing action a, a 2 A and then following policy ⇡, where ⇡ is a mapping from states to actions or distributions over actions.With unfolding the Eq. 3 it is clear that it satisfies a recursive property, so that the following iterative update can be used for the estimation of Q-value function: ) For all s, s 0 2 S and a, a 0 2 A, in Eq. 4, both states a relationship between the value of an action in a state and the values of its next actions which can be performed It also cites the way of estimating the value based on its subsequent ones.
Reinforcement learning agent wants to find a policy which achieves the greatest future reward in the course of its execution.Hence, it must learn an optimal policy ⇡ ⇤ , a policy which is resulted to an expected value greater than or equal of following other policies for all states, and as a result, an optimal Q-value Q ⇤ (s, a).In particular, an iterative update for estimating the optimal Q-value function is defined as follows: The iteration converges to the optimal Q-value, Q ⇤ as i ! 1 and called value iteration algorithm [CITA].
Usually, an RL agent includes a representation of a value function that provides a prediction of how good each state or each state/action pair is (model-free RL).One of the most one of the simplest and most popular value-based algorithms, the Q-learning algorithm [CITA Watkins, 1989] The Q-learning trading system rewards the agent with two possible functions: Sharpe Ratio [CITA], sp t = (rp t rf t ) pt , and a profit function.More precisely, in the first case the trading strategy at time t (t = 0, 1, . . ., T ) if sp t > 0 is: In the second case, the trading strategy at time t (t = 0, 1, . . ., T ) is: where pt is simple the profit at time t.The Q-learning trading system based on the D-DQN is composed by 2 CNN layers with 120 neurons each.In the case of DD-DQN, 2 CNN layers with 120 neurons each are followed by two streams of FC layers: the first with 60 neurons dedicated to estimate the value function and the second with 60 neurons to estimate the advantage function.
In both cases, the number of epochs is set to 40 as well as the batch size.For weight optimization, the ADAM algorithm [CITA] is applied.The loss function is the Mean Squared Error, MSE = . The activation function is set as the Leaky Rectified Linear Units (Leaky ReLU) function [CITA].
The discount factor, , is set to 0.98 in both D-DQN and DD-DQN.
5 Experimental Data and Results

Bitcoin historical data
In this work, we test the proposed Q-learning trading strategies on bitcoin historical data.Data can be found in the well-known Kaggle (www.kaggle.com)platform 1 .We consider bitcoin price in USD dollars from the 1 st December 2014 to the 27 th June 2018, sampled at 1 minute interval.For each observation, time stamp, OHLC (Open, High, Low, Close) values, volume in bitcoin and in estimate the action value function.Usually Eq. 12 is used to combine V (s) and A(s, a).
In Eq. 12, and are parameters of the two streams of FC layers.In DD-DQN, Wang et al. (2016) [18] propose to replace max operator with average action value (Eq.13).

Q-learning Trading System
The proposed Q-learning trading system is based on (i) D-DQN and (ii) DD-DQN.In both cases, an agent interacts with the financial market.Given a certain state of the financial market, the agent defines the type of the action a (buy, hold, sell) to do on a bitcoin unit.If a bitcoin is acquired, it is then added to a wallet.
A stop-loss (sl = 5%) and a take-profit (tp = +12%) are also applied to the wallet.For instance, if the wallet loses more than a threshold (i.e.5%), all open positions are closed.
The exploration-exploitation dilemma is of fundamental importance for deep RL techniques as well as for the proposed Q-learning trading system.Exploitation concerns information about the environment (i.e.transition and reward functions) while exploitation is about maximizing the expected return given the current knowledge.For this reason, the agent can take a random action with probability, ✏, and follows the policy that is believed to be optimal with probability, 1 ✏ (✏-greedy technique).In the proposed trading system, an ✏ initial = 1 is selected for the first observations (n obs = 300) and then is set to a new value ✏ new = 0.12.For a more realistic study, a trade transition cost equal to 0.3% is applied both for long and short actions.
The Q-learning trading system rewards the agent with two possible functions: (i) Sharpe ratio [14], , where % pt is the return of the portfolio or merely the return of the asset, % f is the risk-free rate (% f = 0 in our work), pt is the standard deviation of portfolio's return and (ii) a simple profit function, g prof it = (p t p t 1 ) (i.e.nominal return), where p t is the asset price at time t and p t 1 the asset price at time t 1.More precisely, in the first case, if s pt > 0, the trading strategy at time t is: The Q-learning trading system based on the D-DQN is composed by 2 CNN layers with 120 neurons each.In the case of DD-DQN, 2 CNN layers with 120 neurons each are followed by two streams of FC layers: the first with 60 neurons dedicated to estimate the value function and the second with 60 neurons to estimate the advantage function.In both cases, the number of epochs is set to 40 as well as the batch size.For weight optimization, the ADAM algorithm [11] is applied.The loss function is the Mean Squared Error, M SE = n i=1 (yi−ŷi) 2  n .The activation function is set as the Leaky Rectified Linear Units (Leaky ReLU) function [14].The discount factor, γ, is set to 0.98 in both D-DQN and DD-DQN.
A similar setting is also used to implement the trading strategies based on DQN.

Bitcoin historical data
The proposed Q-learning trading systems are tested on bitcoin historical data.Data can be found on the well-known Kaggle (www.kaggle.com)platform1 .We considered bitcoin price in USD dollars from the 1 st December 2014 to the 27 th June 2018, sampled at 1 minute interval.For each observation, time stamp, OHLC (Open, High, Low, Close) values, volume in bitcoin, volume in USD dollars, and weighted bitcoin price are collected.The dataset is composed by roughly 2 million rows and 8 variables.Based on the time stamp, data is hourly aggregated obtaining a final dataset with more than 30.000observations and the same number of variables.

Results
The Q-learning trading system is tested with four different settings based on:  Test 1.All Q-learning trading system settings are compared sampling 10 different periods of size 4.000.For each period, 80% is dedicated for training purpose and 20% for testing the performance.In Fig. 2, the cumulative average return (%) over the 10 test sets is reported.95% confidence intervals around the mean are also included.DD-DQN and D-DQN trading systems clearly outperform the simpler DQN system.In average the best cumulative return (%) is reached by the SharpeD-DQN.
Table 1 summarizes main statistical indicators.The trading systems based on DD-DQN and D-DQN reaches higher cumulative average return (%).In fact, the ProfitDQN and SharpeDQN obtain the worst results over all the test periods.Furthermore, DQN has the highest standard deviation demonstrating high instability.SharpeD-DQN has the highest average return (5.81%) over all the test period.It reaches a maximum value of return percentage equal to 26.14% and a minimum value equal to -5.64%.The DD-DQN and D-DQN trading systems based on the profit reward function have comparable results.
From this preliminary analysis, the SharpeD-DQN has demonstrated to be the best Q-learning trading system.Test 2. Given the previous results, the SharpeD-DQN is tested on the entire period (from the 1 st December 2014 to the 27 th June 2018).Observations from 1 st December 2014 to 1 st November 2017 are used by SharpeD-DQN system to learn how to trade the cryptocurrency.After that period, SharpeD-DQN system has acted as an autonomous algorithmic trading system (from 2 st November 2017 to 26 th June 2018).It had an average percentage return (%) of almost 8% with a standard deviation 2.77.In Fig. 3, the cumulative percentage return over the entire period is shown.

Conclusions and Future Work
In this work, the performance of different trading systems based on Deep Reinforcement Learning were tested on hourly cryptocurrency (i.e.bitcoin) prices.The trading systems were based on Double and Dueling Double Deep Q-learning Networks.Furthermore, the previous trading systems were compared with a simpler Deep Q-learning Network.Each of them were tested with two different reward functions.The first function was based on the Sharpe ratio, a measure of the risk-adjusted return on an investment, and the second function was related to profit.Then, six different Q-learning trading system settings where tested on bitcoin data from the 1 st December 2014 to the 27 th June 2018.Performance were evaluated in terms of percentage returns.
All systems produced positive return (in average) for a set of shorter trading periods (different combinations of start and end dates for the trading activity).The trading systems based on Double Q-learning and Sharpe ratio reward function (SharpeD-DQN) achieved larger return values.SharpeD-DQN was also tested over the entire considered period producing a positive percentage return value (average percentage return 8%).
It is important to stress that this work has some limitations.First, a broader set of performance indicators should be used to compare the different approaches.Second, the proposed Deep Reinforcement Learning techniques should be compared with recent AI approaches for a more accurate comparison study.Third, a parameter optimization should be done to improve the performance of the learning techniques.Given that, the presented methods were able to generate positive returns on all conducted tests.Extending the current analysis by considering these elements is a direction for future work.
A different yet promising approach is to study the impact of social media on bitcoin and other cryptocurrency fluctuation prices and incorporating news and public opinion into the Deep Reinforcement Learning approach.In addition, uncertainty estimations should be investigated since uncertainty is essential for efficient reinforcement learning.
Lastly, the proposed approaches can be extended for anomaly detection.Following the work of Du, M. et al. (2017) [8], Q-learning approaches can be used to build a framework for online log anomaly detection and diagnosis.Such an approach could be a critical step towards building a secure and trustworthy anomaly detection system.

Fig. 1 .
Fig. 1.Double Deep Q-learning trading system with Sharpe reward function.

3. 1
DOUBLE DEEP Q-NETWORKS Double Deep Q-Networks (D-DQN) are deep RL methods based on Deep Q-Networks (DQN).DQN have been introduced by Mnih et al. (2015) [CITA].DQN stabilizes the training of action value function approximation with deep neural networks, in particular Convolutionary Neural Networks (CNNs) [CITA], using experience replay [CITA

Fig. 2 .
Fig. 2. Average percentage returns over the 10 trading periods, i.e. different combinations of start and end dates for the trading activity.

1 -
Double Deep Q-Network with a profit reward function (ProfitD-DQN); 2-Double Deep Q-Network with Sharpe ratio reward function (SharpeD-DQN); 3-Dueling Double Deep Q-Network with a profit reward function (ProfitDD-DQN); 4-Dueling Double Deep Q-Network with Sharpe ratio reward function (SharpeDD-DQN); The four settings are compared with a Deep Q-Network with profit reward function (ProfitDQN) and a Deep Q-Network with Sharpe ratio reward function (SharpeDQN).

Table 1 .
Average performance over the 10 trading periods.