A Platform for Implementation of Q-Learning Experiments

Francesco De Comité

1  Presentation

1.1  Introduction

This document presents a Platform for Implementing Q-Learning Experiments (PIQLE ) : it consists of a collection of Java packages written for :
  1. Concretely implement and test reinforcement learning algorithms.
  2. Code classic exercices found in Reinforcement Learning or Machine Learning books (as  [SB98], or  [Mit97] for example).
  3. Easily code new problems, focusing only on their representation and coding.
  4. Compare different algorithms, or different settings of a particular algorithm.
  5. Quickly implement and test new algorithms from their (theoretical) description.
  6. Verify performance of new algorithms.
  7. Find optimal, or almost optimal policies for a given problem.
  8. Being able to understand or explicit those policies.
  9. Creating, saving and using an agent applying this policy.
The program comes in two versions : The modularity of the program conception makes it easy to focus only on the part of interest for the user :

1.2  Where to find information

Please feel free to send me critics, remarks, bugs, ideas...

2  Getting started

At this point, everything must have been ok...

3  General Organisation

The packages architecture is composed of four main packages, several packages to concretely implement problems, and some technical other packages.

3.1  Main packages

3.1.1  Agents

An agent has only one thing to do : knowing its state, choose an action (Method agir()).

To make this choice, it first builds the list of all possible moves, and asks its adviser which one it must play. Following this advice, it plays this move, get a reward, and informs its adviser of : This is done in the protected method applyAction. The adviser, depending on its definition, can use this information to improve its subsequent choices (learning).

In addition, agents for two-players games are also informed of there opponent move.

3.1.2  Arbitres

Referees (arbitres in french) control the progress of an episode : There are two kinds of referees : The main method of both classes is episode, which returns : Both also retain the total reward of the last episode.

3.1.3  Algorithmes

All algorithms implements the Selectionneur interface, whose two main methods are :

3.2  The coding of the problem

Any problem is an instantiation of three classes : Action, State, Contraintes. Coding a game just needs to define those instantiations, the genericity of the rest of the program makes it immediate to use this coding with whatever algorithm you choose. As Q-Learning algorithms intensively use storing of values for (state,action) pairs, you must redefine hashCode and equals methods in Action and State instanciations.

4  One player puzzle coding example

The gambler game is describe in  [SB98], page 101 : starting with an initial capital of less than 100 $, a gambler can bet the amount he wants, provided its capital permits it. He then flips a coin, which falls on heads with probability p. If he wins, the amount he bet is added to its capital, otherwise he looses it. He wins when he arrives at 100 $, and loose when its capital falls to zero.

We just focus on the coding of this problem, letting the discussion on optimal strategies to the reader.

Please look at java files for more details.

4.1  Coding Action

Coding an action is just memorizing the current bet :

this is done in ActionGambler.java.

4.2  Coding State

Here again, coding is only the storage of an integer. This coding is in EtatGambler.java.

4.3  Coding Contraintes

A little bit mode code in GamblerGame.java : See GamblerTest.java to see an example of those classes' use : the data printed on the screen can be used to draw the curves in  [SB98], page 103.

5  A two players game coding example

Jenga is a two players skill game : the players must in turn take a block out of a tower, and put it on top of this tower. The player who make the tower fall looses.

If we consider that both players are of perfect skill, then the game becomes a Nim-like game, and the winning strategy can be expressed as a function of : See  [Zwi02] for an in-depth analysis.

5.1  Coding Action

There are only three static actions : we just have to give them a number.

5.2  Coding State

We mainly have to memorize the three integers X,Y,Z. We do not use the modulo 3 simplification : Q-Learning algorithms are strong enough to learn without this information.

The sample program JengaTest.java trains two Q-Learning players against each other. Program output shows that, after circa 10000 games, player two has found a strategy strong enough to beat its opponent. This result corresponds to Zwick's analysis.

Nevertheless, note that in order to obtain fast and strong convergence, we use geometrical decay of a, meaning that at each learning step, a is multiplied by a constant. This speed of decay does not fulfill the conditions theoretically needed for convergence of Q-Learning.

More reflexion would lead to finely tune alpha and alphaDecayPower, to reach convergence in a theoretically proven frame...

6  Coded games and puzzles

It is possible to define a graphical interface for one player games and puzzles. The default is to open a window displaying the current state (as defined by the toString method). See packages voitures or labyrinthes for examples of non-default graphical interface

6.1  Gambler

See section 4 : The gambler problem described in  [SB98], page 101. Aside from allowing to try different algorithms on it, the class can compute V* values by Value Iteration algorithm, and extract an optimal policy from thos values : this is done by implementing the interface ValueIteration : each time the set of states is enumerable, such functions can be implemented.

6.2  Jack's Car Rental

This problem, described and analysed in [SB98], page 98, also supports Value Iteration method. The sample program TestCarRenTal.java shows how to redraw the book's diagrams.

6.3  Mountain-Car Task

Described and analysed in [SB98], page 214. We do not use neither Sarsa algorithm, nor Tile Coding, because those features are not still implemented. Nevertheless, the program Mountain.java shows that we can solve the task. By choosing the algorithm, and tuning a, one can control the convergence rate.

The last part of the program Mountain.java produces data enabling to redraw the plots described in [SB98].

Please note again that convergence is obtained, despite the fact that the ai do not respect condition (2.8) page 39 in [SB98].

6.4  Jenga

See section 5 : a easy to code, easy to play, fast and short game with a optimal strategy. Can serve as benchmark for algorithms.

6.5  Labyrinthes

Labyrinthe is the french word for maze : rectangular two dimensional mazes, with walls and treasures (end cells). The design of the maze can be controlled in different ways : See MazeTest.java for a basic example : the maze is a 6x6 one, with a treasure at the center of the maze, surrounded by wall in 6 directions. Optimal solution is found in approximatively 150 steps.

Modifying the default graphical output, one could follow the learning process, by looking at the agent behaviour. This is done by redefining methods setGraphics and sendState in abstract class AbstractContraintesSolitaire from package environnement. Please have a look at Labyrinthe.java and LabAlice.java.

MazeTest.java is a toy-size example, the platform can handle much bigger mazes without problems.

Several implementations and definitions of mazes are implemented : Those different settings illustrate the difference between plant state and observed state, and thus the PIQLE ability to manipulate POMDP's.


Figure 1: A robot on treasure island





Figure 2: Alice's adventures in Legoland (see [AS04])





Figure 3: Alice's Adventures in Legoland ,part II (see [AS04])



6.6  Car race

The package voitures is an attempt of implementing more continuous schemes, will the general architecture of the project still allows only to handle discrete implementations. A mobile as to find its way inside an arena filled with obstacles. The reward depends on : Two types of arenas are provided, but one can easily define other configuration : Figures 4 and 5 show screenshoots of those two implementations.


Figure 4: A circular arena





Figure 5: An arena filled with obstacles



The robot perceives its environement by the mean of an elementary linear retina of m pixels, telling it how far are the obstacles in m directions. The angle of vision (between two retina's pixel) and the number of pixels are parameters and can be setted. The true distance between the robot and a wall (for a certain angle) is computed, and then disretized, using the notions of near, median,far. Those notions are themselves learned by the algorithm.

At this time (October 2005), this puzzle is the only one for which we defined a non-trivial graphical interface. See program TestCourse.java for an illustration.

6.7  Memory

Memory is a popular two-players game : cards are disposed face hidden on the table. At his turn, each player must flip two cards : if they match, he take them, otherwise he put them again, face hidden, on the board.

[ZP93] analysed this game, proving that there is a winning strategy.

The game is given in two flavors : one that try to maximize the number of cards won, the other trying to maximize the probability of winning.

6.8  Missionnaires

Missionaries and Cannibals is an old puzzle : three missionaries and three cannibals must cross a river, using a boat that can take only two people on board. Cannibals can never be in majority on a bank, as they would eat the clergymen.

The puzzle as a 11-states solution. Not too difficult to find, this implementation can serve as basic benchmark for candidate algorithms.

Generalizations (by means of modifying the parameters : boat capacity, number of people) are also possible, and solutions are very often easily found.

The game comes also in two flavors : one only generating diet moves (no one is eated), the other generating all moves, and giving negative reward when a move entails someone being eated.

6.9  Othello

The game is more difficult to learn than the others : the rules are implemented, algorithms can be apply, but at this time, I never succeeded in making it learn to play at a reasonable level...

6.10  Random Walk

A toy example from  [SB98], page 139 : mainly serves to compare Q-Learning and Value Iteration.

6.11  Tic-Tac-Toe

The game is neither too big, nor too difficult to program. Moves and games are quite limited, and building a good player is not hard.

See the two programs :

7  Coding new problems

Coding new problems consists mainly in implementing Interfaces Action, State and Contraintes.

The abstract class Abstract State defined some common methods shared by quite all instanciations of State.

Make Action and State implementations as basic as possible, defining the problem quite completely in the implementation of Contraintes.

8  Implemented algorithms

8.1  Parameters

Most of implemented algorithms are variations of reinforcement learning algorithms. It was the initial purpose of this platform, that is why agents are acting and receiving rewards.

Two others algorithms are provided, to serve as benchmarks or sparring partners: Another algorithm is coded in hardware, and can be seen into the class AgentTTT, which is a non-generic agent dedicated to tic-tac-toe.

All algorithms except the one in AgentTTT are generic algorithms, which can be used by any agent on any problem.

For the same algorithm, performances can vary from one problem to another, and you will have to tune the parameters to control their convergence.

Typical parameters are :

8.2  The algorithms

9  Implementing new algorithms

New algorithms must comply to the Selectionneur interface. This interface only contains four methods : As you can (or will) see, you don't have to know nothing about the specificity of the problem you want to study ...

10  Interface with Weka

Weka [WF00]consists mainly in a set of machine learning algorithms written in Java. The second version of our algorithm (piqleWeka.tgz) is able to build datasets in a format suitable for feeding Weka's algorithms. Please refer to Weka documentation for more information.

More precisely, each Q-Learning algorithm can extract a dataset from the (state,action,value) table it memorized. The coding of an example is the concatenation of the coding of state, the coding of action, and eventually the coding of the value.

There are three ways to extract a dataset from examples : Once the datasets are build, one can either save them into arff files, and study them into Weka graphical interface, or call weka classes inside one's own programs.

The sample programs mainly use the first way of using weka.

11  Conclusion

We have presented a framework that we think makes it easy to explore reinforcement learning topics, from the coding, testing, and comparaison of RL algorithms to the implementation of new problems and the discovery of their optimal strategies.

The genericity of the construction limits the amount of coding time, letting the user concentrates himself on the output of his programs.

New ideas on algorithms can be tested quickly, and the available program outputs make it easy to graphically visualize results.

12  Perspectives

References

[AS04]
M. Asadpour and Roland Siegwart. Cdompact q-learning optimized for micro-robots with processing and memory constraints. Robotics and Autonomous Systems, 48(1):49--61, August 2004.

[Mit97]
T. Mitchell. Machine Learning. McGraw-Hill, 1997.

[PW94]
Jing Peng and Ronald J. Williams. Incremental multi-step q-learning. In International Conference on Machine Learning, pages 226--232, 1994.

[SB98]
Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. A Bradford Book.

[Wat89]
C. J. Watkins. Learning from delayed rewards. PhD thesis, Cambridge university, 1989.

[WF00]
Ian H. Witten and Eibe Frank. Data Mining : Practical Machine Learning tools and Technics with Java Implementations. Morgan Kaufmann, 2000.

[ZP93]
Uri Zwick and Michael S. Paterson. The memory game. Theor. Comput. Sci., 110(1):169--196, 1993.

[Zwi02]
Uri Zwick. Jenga. In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, pages 243--246. Society for Industrial and Applied Mathematics, 2002.

This document was translated from LATEX by HEVEA.