This document presents a Platform for Implementing Q-Learning Experiments (PIQLE ) : it consists of a collection of Java packages written for :
Concretely implement and test reinforcement learning algorithms.
Code classic exercices found in Reinforcement Learning or Machine Learning books (as [SB98], or [Mit97] for example).
Easily code new problems, focusing only on their representation and coding.
Compare different algorithms, or different settings of a particular algorithm.
Quickly implement and test new algorithms from their (theoretical) description.
Verify performance of new algorithms.
Find optimal, or almost optimal policies for a given problem.
Being able to understand or explicit those policies.
Creating, saving and using an agent applying this policy.
The program comes in two versions :
A standalone version (piqle.tgz).
A version that uses Weka [WF00] to transform examples in Arff format, which allows to fill them into machine learning algorithms. See section 10 for more details.
The modularity of the program conception makes it easy to focus only on the part of interest for the user :
Trying to find or understand good policies for a problem needs only to concentrate on the coding of the problem, and the interpretation of episodes output, neglecting the implementation of the learning algorithms.
When comparing algorithms, one can use predefined problems to test those algorithms on the same benchmark.
When trying to build a good player for a given game, just choose the learning algorithm, let the program run until the result is judged good enough, then save the player.
You obtain a sources directory containing sample programs and package directories, and a docs directory containing the associated javadoc. File makedoc serves as an example/reminder to rebuild the doc.
Create a classes directory.
Choose the program you want to compile, for example sampleProgram1.java
Compile it : javac -d classes -sourcepath sources sources/sampleProgram1.java, assuming you are in the father directory of sources
An agent has only one thing to do : knowing its state, choose an action (Method agir()).
To make this choice, it first builds the list of all possible moves, and asks its adviser which one it must play. Following this advice, it plays this move, get a reward, and informs its adviser of :
Its preceeding state.
The action it took.
The state it is actually in.
The reward it received.
This is done in the protected method applyAction. The adviser, depending on its definition, can use this information to improve its subsequent choices (learning).
In addition, agents for two-players games are also informed of there opponent move.
Any problem is an instantiation of three classes : Action, State, Contraintes.
Coding a game just needs to define those instantiations, the genericity of the rest of the program makes it immediate to use this coding with whatever algorithm you choose.
Action : define a suitable coding of basic possible moves.
State : same as above.
Contraintes : here must be detailed things like initial state, passing from one state to another, testing whether the game is finished, and who won it, attributing rewards.
As Q-Learning algorithms intensively use storing of values for (state,action) pairs, you must redefine hashCode and equals methods in Action and State instanciations.
The gambler game is describe in [SB98], page 101 : starting with an initial capital of less than 100 $, a gambler can bet the amount he wants, provided its capital permits it.
He then flips a coin, which falls on heads with probability p. If he wins, the amount he bet is added to its capital, otherwise he looses it.
He wins when he arrives at 100 $, and loose when its capital falls to zero.
We just focus on the coding of this problem, letting the discussion on optimal strategies to the reader.
Jenga is a two players skill game : the players must in turn take a block out of a tower, and put it on top of this tower.
The player who make the tower fall looses.
If we consider that both players are of perfect skill, then the game becomes a Nim-like game, and the winning strategy can be expressed as a function of :
The number of still complete levels (modulo 3) : X.
The number of levels with two two adjacents blocks (modulo 3): Y.
We mainly have to memorize the three integers X,Y,Z. We do not use the modulo 3 simplification : Q-Learning algorithms are strong enough to learn without this information.
The sample program JengaTest.java trains two Q-Learning players against each other. Program output shows that, after circa 10000 games, player two has found a strategy strong enough to beat its opponent. This result corresponds to Zwick's analysis.
Nevertheless, note that in order to obtain fast and strong convergence, we use geometrical decay of a, meaning that at each learning step, a is multiplied by a constant. This speed of decay does not fulfill the conditions theoretically needed for convergence of Q-Learning.
More reflexion would lead to finely tune alpha and alphaDecayPower, to reach convergence in a theoretically proven frame...
It is possible to define a graphical interface for one player games and puzzles. The default is to open a window displaying the current state (as defined by the toString method). See packages voitures or labyrinthes for examples of non-default graphical interface
See section 4 : The gambler problem described in [SB98], page 101. Aside from allowing to try different algorithms on it, the class can compute V* values by Value Iteration algorithm, and extract an optimal policy from thos values : this is done by implementing the interface ValueIteration : each time the set of states is enumerable, such functions can be implemented.
This problem, described and analysed in [SB98], page 98, also supports Value Iteration method. The sample program TestCarRenTal.java shows how to redraw the book's diagrams.
Described and analysed in [SB98], page 214. We do not use neither Sarsa algorithm, nor Tile Coding, because those features are not still implemented. Nevertheless, the program Mountain.java shows that we can solve the task. By choosing the algorithm, and tuning a, one can control the convergence rate.
The last part of the program Mountain.java produces data enabling to redraw the plots described in [SB98].
Please note again that convergence is obtained, despite the fact that the ai do not respect condition (2.8) page 39 in [SB98].
Labyrinthe is the french word for maze : rectangular two dimensional mazes, with walls and treasures (end cells). The design of the maze can be controlled in different ways :
Randomly add walls and treasures.
Give a design to the Constructor.
See MazeTest.java for a basic example : the maze is a 6x6 one, with a treasure at the center of the maze, surrounded by wall in 6 directions. Optimal solution is found in approximatively 150 steps.
Modifying the default graphical output, one could follow the learning process, by looking at the agent behaviour. This is done by redefining methods setGraphics and sendState in abstract class AbstractContraintesSolitaire from package environnement. Please have a look at Labyrinthe.java and LabAlice.java.
MazeTest.java is a toy-size example, the platform can handle much bigger mazes without problems.
Several implementations and definitions of mazes are implemented :
Classes Labyrinthe, EtatLabyrinthe, Action Labyrinthe : the agent knows it position (as a set of coordinates) into the maze. Roughly speaking, the agent can use a kind of primitive GPS system.
Classes LabLocal,EtatLabLocal,ActionLabyrinthe : the agents knows its orientation, and what it sees in adjacent celles, in 8 directions. The agent has lost its GPS, but still has a compass...
ClassesLabAlice,EtatAlice,ActionALice : The unfortunate agent also looses its compass : it just knows what it sees in the four adjacent cells (forward, backward, left, right). This situation better simulates the behaviour of a mini robot using limited capacity sensors. See [AS04], sample program AliceMazeWeka.java, and algorithm SelectionneurQLINT as an example of using PIQLE to simulate a real robot behaviour.
Figures 2 and 3 are graphical representation of this kind of problems.
Classes LabAliceDistance,EtatAliceDistance,ActionAlice : the agent can still only see what surrounds it, but at a distance of two squares : its prowimity detector is more sensitive.
ClassesLabAliceDistanceContinu,EtatAliceDistanceContinu,ActionAlice : the robot can estimate the distance to the nearest wall : the result is still an integer.
Those different settings illustrate the difference between plant state and observed state, and thus the PIQLE ability to manipulate POMDP's.
Figure 1: A robot on treasure island
Figure 2: Alice's adventures in Legoland (see [AS04])
Figure 3: Alice's Adventures in Legoland ,part II (see [AS04])
The package voitures is an attempt of implementing more continuous schemes, will the general architecture of the project still allows only to handle discrete implementations. A mobile as to find its way inside an arena filled with obstacles. The reward depends on :
The time spend before the robot hits an obstacle.
The robot's speed.
Two types of arenas are provided, but one can easily define other configuration :
A circular circuit, made of two concentric circles : CircularCircuit.java.
A circular arena, filled with 9 bumpers :Pinball.java.
Figures 4 and 5 show screenshoots of those two implementations.
Figure 4: A circular arena
Figure 5: An arena filled with obstacles
The robot perceives its environement by the mean of an elementary linear retina of m pixels, telling it how far are the obstacles in m directions. The angle of vision (between two retina's pixel) and the number of pixels are parameters and can be setted.
The true distance between the robot and a wall (for a certain angle) is computed, and then disretized, using the notions of near, median,far. Those notions are themselves learned by the algorithm.
At this time (October 2005), this puzzle is the only one for which we defined a non-trivial graphical interface. See program TestCourse.java for an illustration.
Memory is a popular two-players game : cards are disposed face hidden on the table. At his turn, each player must flip two cards : if they match, he take them, otherwise he put them again, face hidden, on the board.
[ZP93] analysed this game, proving that there is a winning strategy.
The game is given in two flavors : one that try to maximize the number of cards won, the other trying to maximize the probability of winning.
Missionaries and Cannibals is an old puzzle : three missionaries and three cannibals must cross a river, using a boat that can take only two people on board. Cannibals can never be in majority on a bank, as they would eat the clergymen.
The puzzle as a 11-states solution. Not too difficult to find, this implementation can serve as basic benchmark for candidate algorithms.
Generalizations (by means of modifying the parameters : boat capacity, number of people) are also possible, and solutions are very often easily found.
The game comes also in two flavors : one only generating diet moves (no one is eated), the other generating all moves, and giving negative reward when a move entails someone being eated.
The game is more difficult to learn than the others : the rules are implemented, algorithms can be apply, but at this time, I never succeeded in making it learn to play at a reasonable level...
Most of implemented algorithms are variations of reinforcement learning algorithms. It was the initial purpose of this platform, that is why agents are acting and receiving rewards.
Two others algorithms are provided, to serve as benchmarks or sparring partners:
SelectionneurAleatoire which plays at random a valid move.
Gagnant (winner) : which plays a winning move if possible, otherwise plays at random.
Another algorithm is coded in hardware, and can be seen into the class AgentTTT, which is a non-generic agent dedicated to tic-tac-toe.
All algorithms except the one in AgentTTT are generic algorithms, which can be used by any agent on any problem.
For the same algorithm, performances can vary from one problem to another, and you will have to tune the parameters to control their convergence.
Typical parameters are :
Balance between exploration and exploitation : epsilon is the variable controlling this balance : you can initialize and modify it from outside the algorithms (method setEpsilon).
Another way to tune this balance...is to not tune it at all , and use
RouletteWheelSelection, available in some algorithms : an action is then selected according to its Q(s,a) value.
Learning rate : when present, its name is alpha : you can control its value, and the speed of its decay.
Discount rate : when present, it is called gamma, as usual in reinforcement learning.
SelectionneurHumain : human player with basic text interface.
SelectionneurMemoire : abstract class for basic Q-Learning algorithms. Its two implemented subclasses are :
SelectionneurQL : Standard Q-Learning algorithm ([SB98] p149), memorizing Q(s,a).
SelectionneurNN : The only difference with SelectionneurQL is that Q(s,a) are no longer memorized, but there values are learned and computed by a neural network.
SelectionneurNNSingle : very similar to SelectionneurNN, but the neural networks not even maintains a learning set : each new experience is used to learn, and then thrown away.
SelectionneurNN and SelectionneurNNSingle can be used when the (state,action) space is too big to fit in memory.
AbstractSelectionneurQLambda : abstract class to factorize the behaviour of both its actual implementations :
SelectionneurPeng : Q(l) algorithm as described by [PW94].
SelectionneurWatkins : Q(l) algorithm as described by [Wat89], and detailed in [SB98].
SelectionneurPengNN and SelectionneurWatkinsNNSingle are variations of the above algorithms, differing only in the methods used to memorize (or not) the Q(s,a) values. Please look at source code...
SelectionneurQLINT : an implementation of a compact version of Q-Learning, designed to be embedded in limited ressources micro-robots. See [AS04] for more details.
New algorithms must comply to the Selectionneur interface. This interface only contains four methods :
getChoix : Choose an action among a list of possible moves. The list of possible moves also contains the starting state.
apprend : the french word for learn : uses both states (star and end of the move), the action taken and the reward to update its knowledge.
extractDataset : uses its knowledge to build a learning set. If you do not plan to train an external neural network from this knowledge, you can return null. Note that this has nothing to do with already written neural network learners.
nouvelEpisode : some algorithms (Watkins or Peng, for example) have a initial work to do before the beginning of an episode : this is the place to perform this task.
As you can (or will) see, you don't have to know nothing about the specificity of the problem you want to study ...
Weka [WF00]consists mainly in a set of machine learning algorithms written in Java.
The second version of our algorithm (piqleWeka.tgz) is able to build datasets in a format suitable for feeding Weka's algorithms. Please refer to Weka documentation for more information.
More precisely, each Q-Learning algorithm can extract a dataset from the (state,action,value) table it memorized. The coding of an example is the concatenation of the coding of state, the coding of action, and eventually the coding of the value.
There are three ways to extract a dataset from examples :
public Instances extractArffDataSet(String relationName) : Each example gives rise to an Instance of the forme (State coding,Best action coding), where the class is supposed to be the best action.
public Instances extractArffDataSet(String relationName,boolean discrete) : build Instances of the form (State coding, Action coding,value coding), where the class is the value.
According to the value of discrete, value coding is either :
discrete (discrete=true) : with values good, mid, bad. Each class having the same cardinal. For use with machine learning algorithms asking for discrete classes.
continuous (discrete=false) : for use with regression algorithms.
Once the datasets are build, one can either save them into arff files, and study them into Weka graphical interface, or call weka classes inside one's own programs.
The sample programs mainly use the first way of using weka.
We have presented a framework that we think makes it easy to explore reinforcement learning topics, from the coding, testing, and comparaison of RL algorithms to the implementation of new problems and the discovery of their optimal strategies.
The genericity of the construction limits the amount of coding time, letting the user concentrates himself on the output of his programs.
New ideas on algorithms can be tested quickly, and the available program outputs make it easy to graphically visualize results.
Implementation of new algorithms : [SB98] algorithms are not all implemented...We also plan to implement and test algorithms recently appeared in litterature.
Implementation of new puzzles, games, or problems : as we began the job, it would be interesting to continue coding [SB98] examples.
Graphical interface : we began to implement basic graphical interfaces for one player games, here again in a generic way : at this time (october 2005), non-trivial graphical interfaces are written for mazes and car races.
M. Asadpour and Roland Siegwart.
Cdompact q-learning optimized for micro-robots with processing and
memory constraints.
Robotics and Autonomous Systems, 48(1):49--61, August 2004.
Uri Zwick.
Jenga.
In Proceedings of the thirteenth annual ACM-SIAM symposium on
Discrete algorithms, pages 243--246. Society for Industrial and Applied
Mathematics, 2002.