TORCS  1.3.9
The Open Racing Car Simulator
DiscretePolicy Class Reference

Discrete policies with reinforcement learning. More...

#include <policy.h>

Inheritance diagram for DiscretePolicy:

Public Member Functions

 DiscretePolicy (int n_states, int n_actions, real alpha=0.1, real gamma=0.8, real lambda=0.8, bool softmax=false, real randomness=0.1, real init_eval=0.0)
 Create a new discrete policy. More...
 
virtual ~DiscretePolicy ()
 Kill the agent and free everything. More...
 
virtual void setLearningRate (real alpha)
 Set the learning rate. More...
 
virtual real getTDError ()
 Get the temporal difference error of the previous action. More...
 
virtual real getLastActionValue ()
 Get the vale of the last action taken. More...
 
virtual int SelectAction (int s, real r, int forced_a=-1)
 Select an action a, given state s and reward from previous action. More...
 
virtual void Reset ()
 Use at the end of every episode, after agent has entered the absorbing state. More...
 
virtual void loadFile (char *f)
 Load policy from a file. More...
 
virtual void saveFile (char *f)
 Save policy to a file. More...
 
virtual void setQLearning ()
 Set the algorithm to QLearning mode. More...
 
virtual void setELearning ()
 Set the algorithm to ELearning mode. More...
 
virtual void setSarsa ()
 Set the algorithm to SARSA mode. More...
 
virtual bool useConfidenceEstimates (bool confidence, real zeta=0.01, bool confidence_eligibility=false)
 Set to use confidence estimates for action selection, with variance smoothing zeta. More...
 
virtual void setForcedLearning (bool forced)
 Set forced learning (force-feed actions) More...
 
virtual void setRandomness (real epsilon)
 Set randomness for action selection. Does not affect confidence mode. More...
 
virtual void setGamma (real gamma)
 Set the gamma of the sum to be maximised. More...
 
virtual void setPursuit (bool pursuit)
 Use Pursuit for action selection. More...
 
virtual void setReplacingTraces (bool replacing)
 Use Pursuit for action selection. More...
 
virtual void useSoftmax (bool softmax)
 Set action selection to softmax. More...
 
virtual void setConfidenceDistribution (enum ConfidenceDistribution cd)
 Set the distribution for direct action sampling. More...
 
virtual void useGibbsConfidence (bool gibbs)
 Add Gibbs sampling for confidences. More...
 
virtual void useReliabilityEstimate (bool ri)
 Use the reliability estimate method for action selection. More...
 
virtual void saveState (FILE *f)
 Save the current evaluations in text format to a file. More...
 

Protected Member Functions

int confMax (real *Qs, real *vQs, real p=1.0)
 Confidence-based Gibbs sampling. More...
 
int confSample (real *Qs, real *vQs)
 Directly sample from action value distribution. More...
 
int softMax (real *Qs)
 Softmax Gibbs sampling. More...
 
int eGreedy (real *Qs)
 e-greedy sampling More...
 
int argMax (real *Qs)
 Get ID of maximum action. More...
 

Protected Attributes

enum LearningMethod learning_method
 learning method to use; More...
 
int n_states
 number of states More...
 
int n_actions
 number of actions More...
 
real ** Q
 state-action evaluation More...
 
real ** e
 eligibility trace More...
 
realeval
 evaluation of current aciton More...
 
realsample
 sampling output More...
 
real pQ
 previous Q More...
 
int ps
 previous state More...
 
int pa
 previous action More...
 
real r
 reward More...
 
real temp
 scratch More...
 
real tdError
 temporal difference error More...
 
bool smax
 softmax option More...
 
bool pursuit
 pursuit option More...
 
real ** P
 pursuit action probabilities More...
 
real gamma
 Future discount parameter. More...
 
real lambda
 Eligibility trace decay. More...
 
real alpha
 learning rate More...
 
real expected_r
 Expected reward. More...
 
real expected_V
 Expected state return. More...
 
int n_samples
 number of samples for above expected r and V More...
 
int min_el_state
 min state ID to search for eligibility More...
 
int max_el_state
 max state ID to search for eligibility More...
 
bool replacing_traces
 Replacing instead of accumulating traces. More...
 
bool forced_learning
 Force agent to take supplied action. More...
 
bool confidence
 Confidence estimates option. More...
 
bool confidence_eligibility
 Apply eligibility traces to confidence. More...
 
bool reliability_estimate
 reliability estimates option More...
 
enum ConfidenceDistribution confidence_distribution
 Distribution to use for confidence sampling. More...
 
bool confidence_uses_gibbs
 Additional gibbs sampling for confidence. More...
 
real zeta
 Confidence smoothing. More...
 
real ** vQ
 variance estimate for Q More...
 

Detailed Description

Discrete policies with reinforcement learning.

This class implements a discrete policy using the Sarsa( \(\lambda\)) or Q( \(\lambda\)) Reinforcement Learning algorithm. After creating an instance of the algorithm were the number of actions and states are specified, you can call the SelectAction() method at every time step to dynamically select an action according to the given state. At the same time you must provide a reinforcement, which should be large (or positive) for when the algorithm is doing well, and small (or negative) when it is not. The algorithm will try and maximise the total reward received.

Parameters:

n_states: The number of discrete states, or situations, that the agent is in. You should create states that are relevant to the task.

n_actions: The number of different things the agent can do at every state. Currently we assume that all actions are usable at all states. However if an action a_i is not usable in some state s, then you can make the client side of the code for state s map state a_i to some usable state a_j, meaning that when the agent selects action a_i in state s, the result would be that of taking one of the usable actions. This will make the two actions equivalent. Alternatively when the agent selects an unusable action at a particular state, you could have a random outcome of actions. The algorithm can take care of that, since it assumes no relationship between the same action at different states.

alpha: Learning rate. Controls how fast the evaluation is changed at every time step. Good values are between 0.01 and 0.1. Lower than 0.01 makes learning too slow, higher than 0.1 makes learning unstable, particularly for high gamma and lambda.

gamma: The algorithm will maximise the exponentially decaying sum of all future rewards, were the base of the exponent is gamma. If gamma=0, then the algorithm will always favour short-term rewards over long-term ones. Setting gamma close to 1 will make long-term rewards more important.

lambda: This controls how much the expected value of reinforcement is close to the observed value of reinforcement. Another view is that of it controlling the eligibility traces e. Eligibility traces perform temporal credit assignment to previous actions/states. High values of lambda (near 1) can speed up learning. With lambda=0, only the currently selected action/state pair evaluation is updated. With lambda>0 all state/action pair evaluations taken in the past are updated, with most recent pairs updated more than pairs further in the past. Ultimately the optimal value of lambda will depend on the task.

softmax, randomness: If this is false, then the algorithm selects the best possible action all the time, but with probability 'randomness' selects a completely random action at each timestep. If this is true, then the algorithm selects actions stochastically, with probability of selecting each action proportional to how better it seems to be than the others. A high randomness (>10.0) will create more or less equal probabilities for actions, while a low randomness (<0.1) will make the system almost deterministic.

init_eval: This is the initial evaluation for actions. It pays to set this to optimistic values (i.e. higher than the reinforcement to be given), so that the algorithm becomes always 'disappointed' and tries to explore as much of the possible combinations as it can.

Member functions:

setQLearning(): Use the Q( \(\lambda\)) algorithm

setELearning(): Use the E( \(\lambda\)) algorithm

setSarsa(): Use the Sarsa( \(\lambda\)) algorithm

All the above algorithms attempt to approximate the optimal value function by using an update of the form

\[ Q_{t+1}(s,a) = Q_{t}(s,a) + \alpha (r_{t+1} + \gamma E\{Q(s'|\pi)\} - Q_{t}(s,a)). \]

The difference lies in how the expectation operator is approximated. In the case of SARSA, it is done by directly sampling the current policy, i.e. the expected value of Q for the next state is the current evaluation for the action actually taken in the next state. In Q-learning the expected value is that of the greedy action in the next state (with some minor complications to accommodate eligibility traces). E-learning is the most general case. The simplest implementation works by just replacing the single sample from Q with \(\sum_{b} Q(s',b) P(a'=b|s')\). This lies somewhere in between SARSA and Q-learning.

setPursuit (bool pursuit): If true, use pursuit methods to determine the best possible action. This enforces maximum exploration initially, and maximum exploitation of estimates later. I am not sure of the convergence properties of this method, however, when used in conjuction with Sarsa or Q-learning.

useConfidenceEstimates(bool confidence, float zeta): Use confidence estimates for the estimated parameters. This allows automatic exploration-exploitation tradeoffs. The zeta parameter controls how smooth the estimates of the confidence are, lower values for smoother. (defaults to 0.01). Now, given estimates Q_1 and Q_2 for actions 1,2 respectively we assume a Laplacian distribution centered around Q_1 and Q_2, with variance equal to v_1 and v_2. With Gibbs sampling, the probability of selecting action Q_1 is then

\[ P[1]=exp(Q_1/\sqrt{v_1})/(exp(Q_1/\sqrt{v_1})+exp(Q_2/\sqrt{v2})) \]

.

However it is possible to perform direct sampling.

setForcedLearning(bool forced);

Definition at line 144 of file policy.h.

Constructor & Destructor Documentation

◆ DiscretePolicy()

DiscretePolicy::DiscretePolicy ( int  n_states,
int  n_actions,
real  alpha = 0.1,
real  gamma = 0.8,
real  lambda = 0.8,
bool  softmax = false,
real  randomness = 0.1,
real  init_eval = 0.0 
)

Create a new discrete policy.

  • n_states Number of states for the agent
  • n_actions Number of actions.
  • alpha Learning rate.
  • gamma Discount parameter.
  • lambda Eligibility trace decay.
  • softmax Use softmax if true (can be overridden later)
  • randomness Amount of randomness.
  • init_eval Initial evaluation of actions.

Definition at line 42 of file policy.cpp.

◆ ~DiscretePolicy()

DiscretePolicy::~DiscretePolicy ( )
virtual

Kill the agent and free everything.

Delete policy.

Definition at line 155 of file policy.cpp.

Here is the call graph for this function:

Member Function Documentation

◆ argMax()

int DiscretePolicy::argMax ( real Qs)
protected

Get ID of maximum action.

Definition at line 816 of file policy.cpp.

Here is the call graph for this function:

◆ confMax()

int DiscretePolicy::confMax ( real Qs,
real vQs,
real  p = 1.0 
)
protected

Confidence-based Gibbs sampling.

Definition at line 715 of file policy.cpp.

Here is the call graph for this function:

◆ confSample()

int DiscretePolicy::confSample ( real Qs,
real vQs 
)
protected

Directly sample from action value distribution.

Definition at line 749 of file policy.cpp.

Here is the call graph for this function:

◆ eGreedy()

int DiscretePolicy::eGreedy ( real Qs)
protected

e-greedy sampling

Definition at line 802 of file policy.cpp.

Here is the call graph for this function:

◆ getLastActionValue()

virtual real DiscretePolicy::getLastActionValue ( )
inlinevirtual

Get the vale of the last action taken.

Reimplemented in ANN_Policy.

Definition at line 195 of file policy.h.

◆ getTDError()

virtual real DiscretePolicy::getTDError ( )
inlinevirtual

Get the temporal difference error of the previous action.

Definition at line 193 of file policy.h.

◆ loadFile()

void DiscretePolicy::loadFile ( char *  f)
virtual

Load policy from a file.

Definition at line 484 of file policy.cpp.

Here is the call graph for this function:

◆ Reset()

void DiscretePolicy::Reset ( )
virtual

Use at the end of every episode, after agent has entered the absorbing state.

Reimplemented in ANN_Policy.

Definition at line 474 of file policy.cpp.

◆ saveFile()

void DiscretePolicy::saveFile ( char *  f)
virtual

Save policy to a file.

Definition at line 550 of file policy.cpp.

◆ saveState()

void DiscretePolicy::saveState ( FILE *  f)
virtual

Save the current evaluations in text format to a file.

The values are saved as triplets (Q, P, vQ). The columns are ordered by actions and the rows by state number.

Definition at line 128 of file policy.cpp.

◆ SelectAction()

int DiscretePolicy::SelectAction ( int  s,
real  r,
int  forced_a = -1 
)
virtual

Select an action a, given state s and reward from previous action.

Optional argument a forces an action if setForcedLearning() has been called with true.

Two algorithms are implemented, both of which converge. One of them calculates the value of the current policy, while the other that of the optimal policy.

Sarsa ( \(\lambda\)) algorithmic description:

  1. Take action \(a\), observe \(r, s'\)
  2. Choose \(a'\) from \(s'\) using some policy derived from \(Q\)
  3. \(\delta = r + \gamma Q(s',a') - Q(s,a)\)
  4. \(e(s,a) = e(s,a)+ 1\), depending on trace settings
  5. for all \(s,a\) :

    \[ Q_{t}(s,a) = Q_{t-1}(s,a) + \alpha \delta e_{t}(s,a), \]

    where \(e_{t}(s,a) = \gamma \lambda e_{t-1}(s,a)\)
      end
    
  6. \(a = a'\) (we will take this action at the next step)
  7. \(s = s'\)

Watkins Q (l) algorithmic description:

  1. Take action \(a\), observe \(r\), \(s'\)
  2. Choose \(a'\) from \(s'\) using some policy derived from \(Q\)
  3. \(a* = \arg \max_b Q(s',b)\)
  1. \(\delta = r + \gamma Q(s',a^*) - Q(s,a)\)
  2. \(e(s,a) = e(s,a)+ 1\), depending on eligibility traces
  3. for all \(s,a\) :

    \[ Q(s,a) = Q(s,a)+\alpha \delta e(s,a) \]

    if \((a'=a*)\) then \(e(s,a)\) = \(\gamma \lambda e(s,a)\) else \(e(s,a) = 0\) end
  4. \(a = a'\) (we will take this action at the next step)
  5. \(s = s'\)

The most general algorithm is E-learning, currently under development, which is defined as follows:

  1. Take action \(a\), observe \(r\), \(s'\)
  2. Choose \(a'\) from \(s'\) using some policy derived from \(Q\)
  3. \(\delta = r + \gamma E{Q(s',a^*)|\pi} - Q(s,a)\)
  4. \(e(s,a) = e(s,a)+ 1\), depending on eligibility traces
  5. for all \(s,a\) :

    \[ Q(s,a) = Q(s,a)+\alpha \delta e(s,a) \]

    \(e(s,a)\) = \(\gamma \lambda e(s,a) P(a|s,\pi) \)
  6. \(a = a'\) (we will take this action at the next step)
  7. \(s = s'\)

Note that we also cut off the eligibility traces that have fallen below 0.1

Definition at line 283 of file policy.cpp.

Here is the call graph for this function:

◆ setConfidenceDistribution()

void DiscretePolicy::setConfidenceDistribution ( enum ConfidenceDistribution  cd)
virtual

Set the distribution for direct action sampling.

Definition at line 684 of file policy.cpp.

◆ setELearning()

void DiscretePolicy::setELearning ( )
virtual

Set the algorithm to ELearning mode.

Definition at line 604 of file policy.cpp.

◆ setForcedLearning()

void DiscretePolicy::setForcedLearning ( bool  forced)
virtual

Set forced learning (force-feed actions)

Definition at line 639 of file policy.cpp.

◆ setGamma()

void DiscretePolicy::setGamma ( real  gamma)
virtual

Set the gamma of the sum to be maximised.

Definition at line 656 of file policy.cpp.

◆ setLearningRate()

virtual void DiscretePolicy::setLearningRate ( real  alpha)
inlinevirtual

Set the learning rate.

Definition at line 191 of file policy.h.

◆ setPursuit()

void DiscretePolicy::setPursuit ( bool  pursuit)
virtual

Use Pursuit for action selection.

Definition at line 618 of file policy.cpp.

◆ setQLearning()

void DiscretePolicy::setQLearning ( )
virtual

Set the algorithm to QLearning mode.

Definition at line 598 of file policy.cpp.

◆ setRandomness()

void DiscretePolicy::setRandomness ( real  epsilon)
virtual

Set randomness for action selection. Does not affect confidence mode.

Definition at line 645 of file policy.cpp.

◆ setReplacingTraces()

void DiscretePolicy::setReplacingTraces ( bool  replacing)
virtual

Use Pursuit for action selection.

Definition at line 629 of file policy.cpp.

◆ setSarsa()

void DiscretePolicy::setSarsa ( )
virtual

Set the algorithm to SARSA mode.

A unified framework for action selection.

Definition at line 611 of file policy.cpp.

◆ softMax()

int DiscretePolicy::softMax ( real Qs)
protected

Softmax Gibbs sampling.

Definition at line 783 of file policy.cpp.

Here is the call graph for this function:

◆ useConfidenceEstimates()

bool DiscretePolicy::useConfidenceEstimates ( bool  confidence,
real  zeta = 0.01,
bool  confidence_eligibility = false 
)
virtual

Set to use confidence estimates for action selection, with variance smoothing zeta.

Variance smoothing currently uses a very simple method to estimate the variance.

Definition at line 580 of file policy.cpp.

◆ useGibbsConfidence()

void DiscretePolicy::useGibbsConfidence ( bool  gibbs)
virtual

Add Gibbs sampling for confidences.

This can be used in conjuction with any confidence distribution, however it mostly makes sense for SINGULAR.

Definition at line 704 of file policy.cpp.

◆ useReliabilityEstimate()

void DiscretePolicy::useReliabilityEstimate ( bool  ri)
virtual

Use the reliability estimate method for action selection.

Definition at line 673 of file policy.cpp.

◆ useSoftmax()

void DiscretePolicy::useSoftmax ( bool  softmax)
virtual

Set action selection to softmax.

Definition at line 662 of file policy.cpp.

Member Data Documentation

◆ alpha

real DiscretePolicy::alpha
protected

learning rate

Definition at line 166 of file policy.h.

◆ confidence

bool DiscretePolicy::confidence
protected

Confidence estimates option.

Definition at line 174 of file policy.h.

◆ confidence_distribution

enum ConfidenceDistribution DiscretePolicy::confidence_distribution
protected

Distribution to use for confidence sampling.

Definition at line 177 of file policy.h.

◆ confidence_eligibility

bool DiscretePolicy::confidence_eligibility
protected

Apply eligibility traces to confidence.

Definition at line 175 of file policy.h.

◆ confidence_uses_gibbs

bool DiscretePolicy::confidence_uses_gibbs
protected

Additional gibbs sampling for confidence.

Definition at line 178 of file policy.h.

◆ e

real** DiscretePolicy::e
protected

eligibility trace

Definition at line 152 of file policy.h.

◆ eval

real* DiscretePolicy::eval
protected

evaluation of current aciton

Definition at line 153 of file policy.h.

◆ expected_r

real DiscretePolicy::expected_r
protected

Expected reward.

Definition at line 167 of file policy.h.

◆ expected_V

real DiscretePolicy::expected_V
protected

Expected state return.

Definition at line 168 of file policy.h.

◆ forced_learning

bool DiscretePolicy::forced_learning
protected

Force agent to take supplied action.

Definition at line 173 of file policy.h.

◆ gamma

real DiscretePolicy::gamma
protected

Future discount parameter.

Definition at line 164 of file policy.h.

◆ lambda

real DiscretePolicy::lambda
protected

Eligibility trace decay.

Definition at line 165 of file policy.h.

◆ learning_method

enum LearningMethod DiscretePolicy::learning_method
protected

learning method to use;

Definition at line 148 of file policy.h.

◆ max_el_state

int DiscretePolicy::max_el_state
protected

max state ID to search for eligibility

Definition at line 171 of file policy.h.

◆ min_el_state

int DiscretePolicy::min_el_state
protected

min state ID to search for eligibility

Definition at line 170 of file policy.h.

◆ n_actions

int DiscretePolicy::n_actions
protected

number of actions

Definition at line 150 of file policy.h.

◆ n_samples

int DiscretePolicy::n_samples
protected

number of samples for above expected r and V

Definition at line 169 of file policy.h.

◆ n_states

int DiscretePolicy::n_states
protected

number of states

Definition at line 149 of file policy.h.

◆ P

real** DiscretePolicy::P
protected

pursuit action probabilities

Definition at line 163 of file policy.h.

◆ pa

int DiscretePolicy::pa
protected

previous action

Definition at line 157 of file policy.h.

◆ pQ

real DiscretePolicy::pQ
protected

previous Q

Definition at line 155 of file policy.h.

◆ ps

int DiscretePolicy::ps
protected

previous state

Definition at line 156 of file policy.h.

◆ pursuit

bool DiscretePolicy::pursuit
protected

pursuit option

Definition at line 162 of file policy.h.

◆ Q

real** DiscretePolicy::Q
protected

state-action evaluation

Definition at line 151 of file policy.h.

◆ r

real DiscretePolicy::r
protected

reward

Definition at line 158 of file policy.h.

◆ reliability_estimate

bool DiscretePolicy::reliability_estimate
protected

reliability estimates option

Definition at line 176 of file policy.h.

◆ replacing_traces

bool DiscretePolicy::replacing_traces
protected

Replacing instead of accumulating traces.

Definition at line 172 of file policy.h.

◆ sample

real* DiscretePolicy::sample
protected

sampling output

Definition at line 154 of file policy.h.

◆ smax

bool DiscretePolicy::smax
protected

softmax option

Definition at line 161 of file policy.h.

◆ tdError

real DiscretePolicy::tdError
protected

temporal difference error

Definition at line 160 of file policy.h.

◆ temp

real DiscretePolicy::temp
protected

scratch

Definition at line 159 of file policy.h.

◆ vQ

real** DiscretePolicy::vQ
protected

variance estimate for Q

Definition at line 180 of file policy.h.

◆ zeta

real DiscretePolicy::zeta
protected

Confidence smoothing.

Definition at line 179 of file policy.h.


The documentation for this class was generated from the following files: