The Hierarchical Continuous Pursuit Learning Automation for Large Numbers of Actions

,


Introduction
This paper deals with the well-trodden field of Learning automata (LA) 1 .For decades, this field, initiated by Tsetlin [10], has been studied as a typical model for learning in random environments, and has served as the precursor for the area of reinforcement learning.Unlike other fields of Artificial Intelligence (AI), an LA, by definition, operates in random environments, where the "Teacher" can respond differently and randomly, for the same query, at different time instances.More specifically, an LA is an adaptive decision-making unit that learns the optimal action from among a set of actions offered by the Environment that it operates in.Without loss of generality, to render the problem non-trivial, the Environment is stochastic.At each iteration (or time step), the LA selects one action and communicates it to the Environment.This, in turn, stochastically triggers either a reward or a penalty as a response from the Environment.Based on the response and the knowledge acquired in the past iterations, the LA, either deterministically or stochastically, adjusts its action selection strategy.This is done so as to make a "wiser" decision in the next iteration.Thus the LA, even though it lacks a complete knowledge about the Environment, is able to learn through repeated interactions with the Environment, and adapts itself, or "converges", to the optimal decision.
Although LA have been studied extensively [1,[13][14][15]17] and been applied in many fields [4,9], designing LA when the number of actions involved, R, is large is extremely complex.The solution that we propose in this paper attempts to resolve this problem.

Contributions of the Paper
The contributions of this paper are the following: 1. We propose a hierarchical LA strategy which superimposes the learning process on a tree structure.Unlike the traditional hierarchical schemes, we do not resort to Fixed Structure Stochastic Automata (FSSA) or Variable Structure Stochastic Automata (VSSA) to achieve the learning.2. We propose a novel learning process that involves multi-level two action Continuous Pursuit Algorithm (CPA) machines.The estimation and interaction with the real-world environment occur only at the leaf level.3. We propose the process of trickling-up the estimates and accomplishing the learning by only invoking learning between a node and its sibling.4. The scheme that we have proposed is novel in that it never involves action probabilities that are below machine accuracy.It also involves estimates whose accuracies can easily be attained.5.The scheme that we have proposed is ε-optimal in all random environments [12].
6.The speed of the proposed scheme is many times faster than that of all the LA reported in the literature.It is thus the fastest and most accurate reported LA for environments with a large number of actions.As far as we know, no experiments have ever been done in the field of LA for environments when the number of actions was so large, and in that sense, this is truly a pioneering and ground-breaking venture, clearly proving the power of the scheme! 2 The HCPA LA

Rationale for Our Solution
The philosophy motivating our new scheme resorts to superimposing the actions onto a binary tree2 , in which, the leaves are the actual actions themselves.Further, each internal node represents the best action in the entire subtree below that node.By performing comparisons between the actions in a pairwise manner, i.e., at the leaves of the tree, only the superior actions are trickled up towards the root.By doing this, one always deals with 2-action LA.Here, however, unlike the work of previous researchers [2], we do not resort to FSSA or traditional VSSA, to differentiate between the various pairs of actions at the leaves.Rather, we shall use the 2-action continuous pursuit LA [16].Since R = 2 at every level, the number of iterations required to achieve the estimation is considerably less.Further, the estimation that is achieved at the leaf level, is all that is required for the entire tree -no estimation operations are required at the internal nodes.
A notable attempt to devise hierarchical LA is due to Papadimitriou [7].Before we comment on this work, we mention that the Pursuit concept can be used in a Continuous or Discretized paradigm, and that the action probabilities can be changed on Reward-Penalty (RP), Reward-Inaction (RI) and Inaction-Penalty (IP) scenarios.Consequently, we would have six Pursuit variants: CP RP , DP RP , CP RI , DP RI , CP IP and DP IP , and of these, Agache and Oommen [6] showed that the DP RI is the most superior one.The author of [7] has precisely used this machine, and this is commendable.The differences between that work and the work that we have done here is, however, significant.First of all, this lies is in the way that we have modeled the tree along which the actions have been placed.Secondly, the strategy by which we have trickled up the "maximum" estimate at every node is quite unique and novel, and it does not require us to probe (query) the environment at every time instant, implying that these interactions with the Environment are only at the leaves.All of these lead to the superiority of our scheme over the recorded ones, demonstrated for experiments done for a much larger set than what has been reported in the literature3 !

Construction of the Hierarchy
The search space for the binary tree alluded to above is constructed as follows.First of all, the hierarchy is organized as a balanced full 4 binary tree with maximal depth K.For the sake of convenience and in the interest of mathematical formalism, we will use the same notation adopted in [3,11], and index the nodes using both their depth in the tree and their relative order with respect to the nodes located at the same tree depth.The details of the hierarchy as described as follows.
1. Root node: The LA at the root of the hierarchy is the one at depth 0.
• We can informally say that A {k+1,2 j−1} and A {k+1,2 j} are the Left Child and Right Child of the parent LA A {k, j} respectively.
-The LA at depth K − 1: The LA at depth K − 1 (i.e., at the level just above the leaves) is responsible for choosing the action from the stochastic environment.
• Observe that α {K, j} is attached to (or associated with) its "parent LA" 4. At level K: Finally, at depth K, i.e., at the maximal depth of the tree, the nodes do not have children.

The Proposed Solution
At the bottom-most level, i.e., the level of the leaves, we invoke a two-action CPA to determine which is the superior action between two actions that are siblings at this level.To do this, we merely maintain running estimates of the reward probabilities of these two actions, and using this two-dimensional estimate vector and the corresponding two-action probability vector, the updating is achieved.The larger of these estimates is trickled to their common parent, and this estimate is now compared with the corresponding reward probability estimate of its sibling whose value was obtained from its children.This process is now recursively repeated, using the estimate of the reward probability at this level and the probability vector at this level, whence the updates are performed.The same process continues up the tree to the root itself.
-P {k, j} = [p {k+1,2 j−1} , p {k+1,2 j} ] T is the action probability vector of LA A {k, j} , where Begin Algorithm HCPA Parameters: λ: The learning parameter, where 0 < λ < 1, where λ is close to zero.u {K,2 j−1} , u {K,2 j} : The number of times α {K,2 j−1} , α {K,2 j} have been rewarded when it has been selected.v {K,2 j−1} , v {K,2 j} : The number of times α {K,2 j−1} , α {K,2 j} , has actually been selected.d{K,2 j−1} , d{K,2 j} : The estimate of the reward probabilities of d {K,2 j−1} , d {K,2 j} , computed as: d{K, D is the vector of the estimates { d}.m: The index of the optimal action.h: The index of the greatest element of D. R: The response from the Environment, where R = 0 corresponds to a Reward, and R = 1 to a Penalty.T : A Threshold, where T ≥ 1 − ε.Initialization: Traditional Pursuit algorithms require that we choose each action a few times to initialize the estimates of the reward probabilities.This step is really not so crucial and so we have avoided it and assumed that the estimate of the reward probabilities are initialized to 0.5.
EndFor Loop selects an action by randomly sampling as per the action probability vector ]. -Let j 1 (t) be the index of the chosen action where j 1 (t) ∈ {1, 2}.
-The next LA is activated A {1, j 1 (t)} which in turn chooses an action and activates the next LA at level '2'.
-The procedure continues recursively until LA at level K − 1.
-Let A {k, j k (t)} be the set of activated LA, where j k denotes the activated LA at level k.
2. k = K: Level K -Update D{K, j K (t)} based on the response from the Environment at the leaf level, K: v {K, j K (t)} (t) .-For all other "leaf actions", where j ∈ {1, ..., 2 k } and j = j K (t), v {K, j} (t) .3. Define the reward estimate for all other actions along the path from the root, 0 < k < K − 1 in a recursive manner 5 , where the LA at any one level inherits the feedback from the LA at the next level: d{k, j} (t) = max( d{k+1,2 j−1} (t), d{k+1,2 j} (t)).

Else
p {k, j h (t)} (t + 1) = p {k, j h (t)} (t) p {k, j h (t)} (t + 1) = p {k, j h (t)} (t).EndIf -For each A {k, j} , if either of its action probabilities p {k+1,2 j−1} and p {k+1,2 j} surpasses a threshold T , where T is a positive number that is close to unity, the action probabilities for this LA will stop updating, with its larger action probability jumping to unity.

EndLoop End Algorithm HCPA
The HCPA scheme proposed and described above has been shown to be ε-optimal in all random environments.The proofs are quite deep and intricate.However, due to the space limitations, these theoretical results are omitted here.They are included in [12].

Experimental Results
To evaluate the performance of the LA-based schemes, we carried out extensive simulations for environments with a "large" number of actions, where the total number of actions was set to various values.The main aspect that we intended to demonstrate was that if the learning problem was tackled using traditional VSSA, the convergence would be both less accurate and very slow.The reason for this, as mentioned earlier, is that if the number of actions is large, many of the action probabilities would be small, implying that these would be chosen seldom.Thus, even if we invoked estimator-based LA, it would be unreasonable to assume that each action would be chosen "a large number of times".Further, the estimates would be correspondingly inaccurate.The HCPA resolves both of these issues.
The simulations that we conducted were intended to capture two important metrics, namely, the accuracy of the convergence of HCPA, and its speed of the convergence.Our goal was also to compare its convergence with the existing LA.

The Data Sets for the Environment
The benchmark datasets reported in the existing literature had at most ten actions.In the absence of established benchmarks for larger numbers of actions, we have designed a set of Environments which can be used as benchmarks by other researchers.First of all, we determined the number of actions involved in the learning problem.To render the problem non-trivial, the total numbers of actions was initially configured to be 16, 32 and 64.Once the number of actions was set, the actual reward probabilities associated with the different actions were uniformly distributed in the interval between zero and unity.Understandably, the difficulty the Environment increased with the the number of actions.The reward probabilities associated with the configurations for 16 and 32 actions are the first 16 and 32 elements in Table 1, respectively.The reward probabilities of the configuration with 64 actions constitute the entire set given in Table 1.

Convergence of the HCPA Algorithm
If λ is sufficiently small, the HCPA will converge to the action with the maximum reward probability.To observe the convergence of the algorithm with a minimum number of iterations, our task was to determine the optimal value for λ for different configurations.The optimal λ value is the maximum λ value that will make the LA to consistently converge to the correct action.Obviously, for different configurations for the Environment, the value for optimal λ would vary.In this simulation, to find the optimal λ, we decreased the value of λ until we reached the one that provided the LA for the first 200 consecutive occurrences of convergence to the correct action.
Based on our simulations, for the configuration with 64 actions, the optimal λ was 0.000051.In other words, with this value of λ ≤ 0.000051 system would consistently converge accurately.Similarly, the optimal values for λ for the configurations for 32 and 16 actions were 0.00085 and 0.0065 respectively.Understandably, the values of λ have an increasing trend when the environment becomes less challenging.

Average Convergence Iterations
To illustrate the average number of iterations before convergence, we present the simulation results of the experiments6 in Tables 2. The standard deviation of the iterations are also included.To compare the HCPA with existing approaches, we include the simulation results for the L R−I and CPA machines in the same environment.The λ values utilized in the HCPA are the ones shown in Section 3.2 while the ones in the CPA are the optimal values found based on the same approach explained in Section 3.2.
For each replication in HCPA, we register the number of iterations when all the LAs along the correct path had converged to the action probabilities which are greater than or equal to 0.99.Similarly, for each trial for the CPA and the L R−I , we record the number of iterations when the LA had converged to the correct action with an action probability greater than or equal to 0.99.All the results presented in the table have been averaged over an ensemble of 400 independent replications using the optimal λ determined above.From Table 2, we can clearly see that HCPA outperforms CPA and L R−I in general, especially when the number of actions is large.Thus, for example, for the 64-action environment, the L R−I required 644,234 iterations.The HCPA required less than 18% of the number of iterations, namely 115,295.These results are typical.This confirms the efficiency of the hierarchical structure when the number of actions increases.

Environment with 128 actions
The HCPA was also tested on environment with 128 actions, and as mentioned earlier, the testing of LA in environments with such a large number of actions is pioneering -it has been unreported in the literature.Rather than list the reward probabilities, we have plotted them in Figures Fig. 1: An example of an 128-action Environment.
In the case of the first environment plotted in Figure 1, the L R−I required 734,474 steps for absolute convergence for an ensemble of 400 trials.The CPA, on the other hand, required 543,529 steps -which represented a decrease of about 26%.Astonishingly, the HCPA needed only 266,257 steps.This implied an advantage of about 51% over the CPA and of almost 64% over the L R−I .One can clearly see the advantage of the HCPA over the state-of-the-art.

Conclusions
In this paper, we have pioneered a new paradigm for designing and implementing Learning Automata (LA) when the number of actions is large.Learning in environments of this type is particularly hard because the dimensionality of the action probability vector is correspondingly large, and consequently, most components of the vector will, after a relatively short time, have values that are smaller than the machine accuracy, implying that they will never be chosen.This means that the traditional LA will be sluggish and inaccurate, and it would be unreasonable to assume that each action would be chosen "a large number of times" if we invoked estimator-based LA.In this paper, we have pioneered a solution that extends the Continuous Pursuit Algorithm's (CPA's) paradigm to such large-actioned problem domains.The salient feature of our new solution is that it is hierarchical, where all the actions offered by the environment reside as leaves of the hierarchy.Further, at every level, we merely require a two-action LA which automatically resolves the problem of dealing with arbitrarily small action probabilities.Most importantly, since all the LA invoke the pursuit paradigm, the best action at every level trickles up towards the root.Thus, by invoking the property of the "max" operator, in which, the maximum of numerous maxima is the overall maximum, the hierarchy of LA converges to the optimal action.The paper also reported experimental results that demonstrated the power of the scheme and its computational advantages.

Table 1 :
This table lists the reward probability of the 64 actions in our experiments.The reward probabilities for 16 and 32 actions are the corresponding 16 and 32 entries in the table, respectively.

Table 2 :
The simulation results obtained for various environments with different numbers of actions. 1.