Introduction

This page compares and contrasts a number of Adaptive Heuristic Critic (AHC) algorithms, designed to explore a simulated 2D grid for a goal value. Learning is achieved with the use of a Temporal Difference (TD).

The first algorithm presented, performs a basic random search for the goal. The other algorithms build on this basic search, with the use a type of Tabu search which adds negative reinforcement to previously visited locations in order to avoid them, this information is then either discard or retained for subsequent runs. In addition two selection methods are explored, one using a strict method of selecting the highest immediate reward and a second using a variable stochastic element. The implementation of these algorithms and selection methods are then tested in simulation and an analysis made of the effectiveness of each.

The following chapter describes the Adaptive Heuristic Critic with Temporal Difference learning and gives an overview of Tabu search.

An Adaptive Heuristic Critic (AHC) is a type of reinforcement learning architecture. AHCs have been successfully applied to a number of problems, including the ‘inverted pendulum, Towers of Hanoi and foraging tasks’ Sutton (1992).

Reinforcement learning, as described by Kaelbling et al (1996), “is a way of programming agents by reward and punishment without needing to specify how the task is to be achieved”. Reinforcement learning is used in complex, and often partially observable situations where evaluating the system as a whole would be too time consuming, if not impossible. There are many differnt types of reinforcement learning architectures, however, all use feedback (negative, positve or both) as reward or punishment to signify how well the agent has performed in any given situation. This feedback is essential; “without some feedback about what is good and what is bad, the agent will have no grounds for deciding which move to make” (Russell and Norvig, 2003).

The AHC approach to assigning reward is to evaluate the system at each step where possible, instead of evaluating the system as a whole at the end, “The Adaptive Heuristic Critic architecture uses a predictor of return (the long-term cumulative reward), not reward, to take non-immediate rewards into account” (Nehmzow, 2003). In this way the AHC is capable of applying feedback to the steps preceding the final state. Nehmzow (2003), reviews the current state of research in reinforcement learning, “One fundamental problem of reinforcement learning is the temporal credit assignment problem: because reinforcement is only received once the final goal state has been reached, it is hard to apportion credit to the actions that preceded the final, successful action”. This problem is often addressed by “learning an internal evaluation function which predicts the long-term reward of taking action A in state S” (Nehmzow, 2003). An AHC is a system, which uses heuristics to learn the internal evaluation function.

2.1. What’s the Temporal Difference

In AHC a temporal difference (TD) is used as an immediate learning technique. Temporal difference is often represented as TD(?) with ? being any whole number from 0 to infinity. A TD of 0 signifies that learning will be passed back to the previous state, with each increment of TD(n), the learning value is passed back to the previous n+1 steps.

Using an example of an agent repeatably searching for a goal, TD learning can be used to eventually predict in which direction the agent should move to find the goal from any location. “In effect, the learning system predicts reward, and uses this predicted reward for the learning system, until actual external reward is received.” (Nehmzow, 2003).

Increasing the TD value, in the majority of circumstances, increases the learning rate of the agent. However, the choice of what value TD to use depends on a number of issues. The first issue is that with a higher value TD more steps need to be remembered and passed back to previous steps, which uses more computational resources. This is a small issue in the simple grid exploration task, however in real world applications, such as autonomous robots, it is likely that much more data would need to be processed at each step. A second issue is that of premature convergence, with higher TD values converging on a sub-optimal solution. This issue is explained in more detail in the following section.

2.2. Convergence

Convergence, in the context of reinforcement learning, is where individual parts of an overall solution change until all parts are close to or equal than each other. In the best case the overall solution will converge and produce an optimal solution. If the convergence happens prematurely then this can result in a sub-optimal solution.

Premature convergence often occurs when strict rules are enforced. In the case of the AHC exploration example, premature convergence would result in an indirect route to the goal. A type of strict rule is the ‘greedy agent’, this is a type of selection, which always chooses the best option, “Repeated experiments show that the greedy agent very seldom converges to the optimal policy for the environment and sometimes converges to really horrendous policies.” (Russell and Norvig, 2003). A common way to reduce the chances of premature convergence is with the use of a stochastic element, such as a random number, in any decision processes. In the AHC exploration example, this could be used to make the agent go in the best direction the majority of the time, but to try a different direction some of the time.

2.3. Tabu Search

A Tabu search is a technique used for optimisation. Hertz et al (1995) describes the Tabu search as, “a method of keeping track of not only local information but also of some information related to the exploration process”. An essential component of Tabu search is the use of memory, “As soon as non-improving moves are possible, the risk of visiting again a solution and more generally cycling is present. This is the point where the use of memory is helpful to forbid moves which might lead to recently visited solutions.” (Hertz et al, 1995).

Tabu search is currently not an established technique in AHC. The benefits of using this technique are that an agent would have the ability to remember its previous steps, using either a short term or long term memory. This means, in the context of exploration, the memory could be used in order to stop the agent needlessly retracing its steps. This ability would come with the costs of greater computation and storage.

The following chapter describes the AHC algorithms implemented, two of which are based around a type of Tabu search.

Grid Lock Implementation

The AHC model was implemented in C# code (Code available on request), the application produced is named ‘Grid Lock’, as the gird over time resembles a bird’s eye view of a traffic jam.

A GUI is used to adjust the starting parameters in order to change between the different AHC algorithms. As shown in Appendix A, the grid is represented as a 2D image; it depicts the location of the agent, the starting position, the goal, grid cells containing reward and the grid cells previously explored, depending on the algorithm being used. In addition the grid can either be set to show an approximation or the actual utility value for each grid cell, by use of a varying colour or number. This produces a visual map of the paths containing the highest reward values. In order to test the efficiently of each algorithm, a graph is provided, which plots the results for each run and the number of steps taken to find the goal.

The algorithms which were implemented are presented in pseudo code in the following sections. These algorithms were decided on before any coding had taken place and are precisely implemented in the code.

3.1. The Agent

The agent is represented by a simple ‘grid explorer’, which is able to move north, east, south, west, within the confines of the grid. When the cell moved to by the agent contains the goal, or a reward value then a utility value is passed back to a number of the previous cells as denied by the temporal difference. It is important to note, that the agent does not know where it is on the grid in relation to anything else. The agent is only able to peak at the values of its neighbouring grid cells.

3.2. Exploration and Exploitation

The exploration element of the AHC grid simulation is involved with how the agent moves around the grid from cell-to-cell. Two separate exploration elements have been implemented, either of which can be selected during a simulation run. The first exploration type, the ‘greedy selection’, selects the highest available move, whereas the second type employs a variable stochastic element. These two exploration types are described in the next sections.

It is important to have a balance between the amount of exploration and the amount of exploitation, as it cannot be assumed that the agent will find the shortest path to the ‘goal’ first time. “Pure exploitation risks getting stuck in a rut. Pure exploration to improve one’s knowledge is of no use if one never puts that knowledge into practice.” (Russell and Norvig, 2003). The agent needs to use exploration to, not only improve its immediate reward, but to maximise its long-term rewards. This can often come at an initial cost of exploration steps, however as Russell and Norvig (2003) state, “With greater understanding, less exploration is necessary”.

The two selection methods are used to show the difference between using a definite method for selecting a move and using a varying method based on some stochastic decisions. The intent is to compare the two methods and show what effects a stochastic method for exploration has on the process of exploitation.

3.2.1 Greedy Selection

This method selects the highest neighbouring utility value that is available, if there are two or more the same then one is selected at random, if there are no cells with utility values then again one is chosen at random.

3.2.2 Roulette Wheel Selection

A selection method with a stochastic element has been implemented, which employs a technique often used for evolutionary algorithms, that of a ‘Roulette Wheel’ selection. This selection technique, as described by Fogel (2000), is used to proportionally divide a number of differently valued options, with the higher ranked options receiving a greater proportion of positions in the ‘Roulette Wheel’. A position is then chosen randomly from the ‘Roulette Wheel’. The idea is that higher ranked options receive a greater proportion of the wheel, giving those options a greater chance of being selected.

The options available to the agent at each step can differ depending on what moves have been made before. The selection works on the basis that at least one move will always be possible. The possible options, which can be available to the agent at any one time, and default percentage probability weights for each option, are:

• Choose the Highest Reward Grid Cell – 75%
• Choose a lower Reward Grid Cell – 10%
• Choose a Grid Cell not previously visited – 10%
• Choose a Grid Cell Previously Visited – 5%

The difference between this implementation of the ‘Roulette Wheel’ selection and that which is often used for evolutionary algorithms is these operate on a constant number of options, however in the case of an exploration selection the options can change. To overcome this problem the selection uses the proportional percentages based on which options are valid at each selection time.

3.2.3. Avoiding Previous Locations

For both selection techniques it is possible to set the simulation to either identify or overlook previously explored grid cells. By identifying previous grid cells it is possible to stop the agent from repeatably exploring cells with no utility values if a new step is available, this can be thought of as a type of negative reinforcement, ‘punishing’ previously explored locations. This process is based on a type of Tabu search, described in chapter 2, the algorithms for which are described in more detail in section 3.4.

3.3. Starting Positions

The starting position for the agent is a location on the grid where exploration starts. Three different starting position options have been implemented, these are:

• Start from the top left grid cell
• Start in any grid cell randomly
• Start in any grid cell at the edge of the grid randomly

The starting at any edge grid cell was a method used to produce a more repeatable set of results. As with the random start, the agent on one run may start next to the goal and on the next far away. It could be argued that if this were a real environment, then the majority of the time the agent would start some distance from the goal, thus starting at the edge of the grid is more realistic.

3.4. Algorithm Decisions

To investigate the effect different design decisions had on the average number of steps required for the agent to reach the goal, a number of algorithms were produced. These algorithms are described in the sections that follow and show the decision process made before each move is made. These algorithms are presented in pseudo code for readability purposes and accurately represent the implemented code. All the algorithms assume that the grid has been initialised with zero values, the goal set and the agent placed at random.

3.4.1. Roulette Wheel Selection

The move decision algorithms that are detailed in the proceeding sections, all use the ‘greedy’ selection method. However the ‘Roulette Wheel’ selection shown below can be used in its place, to provide the algorithms with a stochastic element.

``` Choose Random Number From 0 to Sum of Decision Probabilities

If (Number is between 0 and Highest Cell Probability)
Move to Highest Reward Cell
Else-If (Number is between Highest Cell Probability and Lower Cell Probability)
Move to a Lower Reward Cell
Else-If (Number is between Lower Cell Probability and New Cell Probability)
Move to a New Cell
Else-If (Number is between New Cell Probability and Previous Cell Probability)
Move to a Previously Explore Cell
End-If
```

3.4.2. No New Cell Marking

The basic algorithm, shown below, has been implemented to perform a simple search and move around the grid. The algorithm implements the ‘greedy’ selection method and will follow the highest utility path.

``` Do
Search Neighbouring Grid Cells
If (Number of Grid Cells containing Reward Values > 0)
If (Number of Cells with the Highest Reward == 1)
Move to Highest Reward Cell
Else
Move to a RANDOM Highest Reward Cell
End-If
Set Temporal Difference of previous n Grid Cells
Else
Move to Grid Cell at RANDOM
End-If
Until Goal Found
```

3.4.3 Mark New Cells and Remember

The mark and remember algorithm extends the use of the agents ability to mark a value on the grid. This algorithm is an extension of the one previous, the difference is the agent uses negative reinforcement to mark on the grid an ‘explored’ value to any ‘new’ cell which it moves to. This value is stored as a utility value and is overwritten by any subsequent greater rewards. As with the previous algorithm the agent favours higher reward cells and will always move in the direction of the highest neighbouring cell. As the ‘explored’ value is set at -100 and the grid is initialised with 0 values, no additional routines are needed to make the agent choose an unexplored cell.

This algorithm uses many of the concepts of a type of Tabu search, described in chapter 2, although the algorithm was formulated before any knowledge of the Tabu Search technique was known. The Tabu search method makes use of memory to remember previous steps, which as described by Hertz et al (1995), is often part of the agent. In this algorithm the agent contains no memory of the previously visited grid cells other than that used for temporal difference. Instead the grid is used to hold the values of the previously explored cells, which in this algorithm are stored between runs, so subsequent explorations use the previously explored information from all prior runs.

``` Do
Search Neighbouring Grid Cells
If (Number of Grid Cells containing Reward Values > 0)
If (Number of Cells with the Highest Reward == 1)
Move to Highest Reward Cell
Else
Move to a RANDOM Highest Reward Cell
End-If
Set Temporal Difference of previous n Grid Cell(s)
Else
If (All Grid Cells are Equal)
Move to Grid Cell at RANDOM
Else-If (Unexplored Grid Cells == 1)
Move to Unexplored Grid Cell
Else
Move to a RANDOM unexplored Grid Cell
End-If
End-If
If (Current grid cell == NOT_EXPLORED)
Current grid cell = EXPLORED
End-If
Until Goal Found
```

3.4.4 Mark New Cells and Reset

The final algorithm is the same as the previous, with the addition of a function call, which when the goal is found resets all grid cell values containing an explored value (-100) to an unexplored value (0). This algorithm and the one previous were designed to exploit the current abilities of the simple agent without adding any new abilities, such as memory or moving diagonally, in order to maximise its efficiency in finding the goal.

The following chapter analyses the results gained from the implementation of these algorithms.

Analysis of Algorithms

The following sub sections review the different algorithms produced and with the use of graphs, show the relative effectiveness of each. The graphs are produced from an average of 100 separate trials, to remove the majority of anomalies, which are an effect of using random searches.

The AHC simulation application produced provides a number of independently adjustable parameters, which can be set to produce a vast number of different configurations. This discussion will be limited to testing with TD(0), TD(1000), ‘greedy’ selection and a small number of different ‘Roulette Wheel’ values. In the tests TD(1000) is equivalent to passing reward back to every move the agent makes.

The tests described below, use the same initial settings. The explorer started at a random position at the edge of the grid each time, as this gives a fair distance from the goal and still allows for a number of random stating positions. The goal remained stationary for each run in the centre of the grid.

4.1. Temporal Difference Sizes

A number of trials were run with different TD values, these trials identified two important issues concerning large TD values. The first issue is that areas of the grid are not explored, which could potentially contain higher reward values. The second issue is concerned with indirect ‘concrete’ routes being defined. Figures 4.1 – 4.3 show the result of three separate tests. Figure 4.1 is the result of 20 runs to the goal, using a TD of 0, ‘greedy’ selection and marking the explored grid cells. Figure 4.2 is the same as Figure 4.1, except it uses a TD of 1000. Figure 4.3 is the same as Figure 4.2, except it uses ‘Roulette Wheel’ selection. It can be seen from Figure 4.2 that using a large TD value can result in areas of the grid not being explored, a potential problem of this is if multiple, different valued, goals were used then in this case the larger reward goal may never be found. This issue is addressed by using a stochastic element in the decision process, as implemented in the ‘Roulette Wheel’ selection. Figure 4.3 shows a trial using the ‘Roulette Wheel’ selection, it can be seen from this that by randomly choosing a different route some of the time, the agent is able to eventually explore the whole grid.

The second issue, concerning large TD values, is that of premature convergence resulting in unnecessarily long ‘concrete’ routes being defined. A number of tests were run (see Appendix B) which attempted to address this problem. These tests involved using the ‘Roulette Wheel’ selection method set with a number of different values and comparing these results with that of the ‘greedy’ selection. It can be seen from these graphs provided in Appendix B that using a stochastic element in the search had a small negative effect on the number of steps needed to find the goal. This was not the result expected. As mentioned in chapter 3, it was thought that a stochastic element would find new paths in the grid, which would reduce the average number of steps needed to find the goal.

It can be seen from these two sets of data that a stochastic element in the move selection, partly solves the two issues raised concerning large TD values. The ‘Roulette Wheel’ selection method is able to search almost the entire grid, finding all possible goals, whereas the ‘greedy’ selection can overlook large areas of potentially higher reward cells. However it has not be proven that a stochastic approach to exploring, finds better paths to the goal, in such away as to reduce the number of steps needed to reach the goal. However, it seems likely that if a form of negative reinforcement, described in chapter 2, were used, feeding back results of better paths found, then the agent would be able to learn and exploit these new paths, instead of only finding them. This method of negative reinforcement has not been implemented, due to time issues, however this could be seen as a natural progression to the results already achieved.

4.2. Marking Previously Visited Grid Cells

This section relates to the testing of the Tabu type search, described in chapter 3. Each test is used to highlight the difference between the agent exploring the grid with no concept of cells explored cells and of the agent exploring only new available cells. Each trial uses TD(0) and a grid of 16×16, again these tests were iterated 100 times to remove any anomalies. Figure 4.4 – No Marking of New Grid Cells Figure 4.5 – Marking New Grid Cells, Not Resetting after Goal Found Figure 4.6 – Marking New Grid Cells, Resetting after Goal Found

The resulting graphs from the, ‘marking explored location’ tests, are shown in Figures 4.4 – 4.6. These graphs show 100 runs of the agent finding the goal and the number of steps needed at each run (Shown by the red line). The graphs also shows, by means of the blue line, the number of unexplored grid cells at the start of each run.

From these results it can be seen that an agent marking previous explorer locations and using these to choose subsequent moves, initially finds the goal in less number of steps than an agent that is not. It can also be shown that after a number of runs, all three methods eventually converge on the same number of average steps needed to find the goal. In the case of the 16 x 16 grid this is near 80 runs, however more testing would be needed to show how this convergence relates to the size of the grid.

The graph in Figure 4.5, ‘not resetting after the goal is found’, shows the same value for run 1 as the graph in Figure 4.6. This is to be expected as for the first run, in both cases, the grid has not been explored thus no grid cells are marked. The graph of Figure 4.6 shows a large improvement from that of Figure 4.5 the gradient of the number of steps needed for each run is much smoother and it can be seen that from the first run to the last there is a continual improvement of the number of steps required to find the goal.

It can be seen in the graph of Figure 4.5, by means of the blue line, that for the first few runs the agent finds the goal in a decreasing number of steps. This continues until half of the grid has been explored, where at which point the number of steps increases. After approximately 20 runs all of the grid cells have been explored, consequently the remainder of the graph closely matches that of Figure 4.4. It can be concluded from this that, the initial marking and following of new locations has no effect on the length of routes produced. It could be reasoned that if multiple goals were present then this may have an effect on the routes produced as the agent is exploring the grid earlier the goals could be found sooner.

Conclusion

This discussion has explored various approaches to implementing an Adaptive Heuristic Critic (AHC). Three algorithms for searching a simple grid and two selection methods have been implemented and tested. Results have been presented, which show that an agent that is able to know which locations it has explored previously in a current run, is able to avoid retracing its steps and find the goal faster and more efficiently. It has also been shown that increasing the Temporal Difference size improves the learning rate, but can quickly define indirect routes.

The main issues of AHC have been addressed and methods have been presented, which can be used to optimise a simple grid exploration example. A number of important concerns have been raised from testing the implementation, which could be the subject of future work. The first is that of premature convergence, which resulted in indirect routes to the goal, as described in chapter 4. An approach, which may lead to a solution, would be to use negative reinforcement to complement the stochastic operation of the ‘Roulette Wheel’ selection. This could be done using the steps recorded in the TD as a short-term memory to compare how ‘good’ a path is at the current position with the path of a previous position. This information could be used to attribute negative reinforcement to an individual cell or even the entire sub-route.

The algorithms implemented have all been tested in simulation, however they could be extended to real world situations, for example a mobile robot using the ‘mark new locations’ algorithm. This could involve a robot agent producing a trail to show where it has been, much like an animal scent trail, and using this as part of an exploration decision process.

Appendices

Appendix A: Grid Lock GUI Appendix B: Results

The graphs shown below present the results of three tests using different selection methods.

‘Greedy’ Selection 500 Runs, 10 Trials and Temporal Difference = 1000

‘Roulette Wheel’ Selection 500 Runs, 10 Trials and Temporal Difference = 1000 500 Runs, 10 Trials and Temporal Difference = 1000

References

Sutton, S. (1992), Reinforcement Learning Architectures. Proceedings ISKIT ’92 International Symposium on Neural Information Processing. Fukuoka Japan.

Kaelbling, L., Littman, M., Moore, A. (1996), Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 4, pp 237-285.

Russel, S., Norvig, P. (2003), Artificial Intelligence A Modern Approach: International Edition. 2nd ed. New Jersey: Prentice Hall.

Nehmzow, U. (2003), Mobile Robotics A Practical Introduction. 2nd ed. London: Springer-Verlag.

Hertz, A., Taillard, E., Werra, D. (1995), A Tutorial on Tabu Search. Proceedings of Giornate di Lavoro AIRO’95. pp 12-24.

Fogel, D. (2000), Evolutionary Computation. 2nd ed. New York: IEEE Press.