Discrete policies with reinforcement learning. More...

#include <policy.h>

Inheritance diagram for DiscretePolicy:

Public Member Functions
	DiscretePolicy (int n_states, int n_actions, real alpha=0.1, real gamma=0.8, real lambda=0.8, bool softmax=false, real randomness=0.1, real init_eval=0.0)
	Create a new discrete policy. More...

virtual	~DiscretePolicy ()
	Kill the agent and free everything. More...

virtual void	setLearningRate (real alpha)
	Set the learning rate. More...

virtual real	getTDError ()
	Get the temporal difference error of the previous action. More...

virtual real	getLastActionValue ()
	Get the vale of the last action taken. More...

virtual int	SelectAction (int s, real r, int forced_a=-1)
	Select an action a, given state s and reward from previous action. More...

virtual void	Reset ()
	Use at the end of every episode, after agent has entered the absorbing state. More...

virtual void	loadFile (char *f)
	Load policy from a file. More...

virtual void	saveFile (char *f)
	Save policy to a file. More...

virtual void	setQLearning ()
	Set the algorithm to QLearning mode. More...

virtual void	setELearning ()
	Set the algorithm to ELearning mode. More...

virtual void	setSarsa ()
	Set the algorithm to SARSA mode. More...

virtual bool	useConfidenceEstimates (bool confidence, real zeta=0.01, bool confidence_eligibility=false)
	Set to use confidence estimates for action selection, with variance smoothing zeta. More...

virtual void	setForcedLearning (bool forced)
	Set forced learning (force-feed actions) More...

virtual void	setRandomness (real epsilon)
	Set randomness for action selection. Does not affect confidence mode. More...

virtual void	setGamma (real gamma)
	Set the gamma of the sum to be maximised. More...

virtual void	setPursuit (bool pursuit)
	Use Pursuit for action selection. More...

virtual void	setReplacingTraces (bool replacing)
	Use Pursuit for action selection. More...

virtual void	useSoftmax (bool softmax)
	Set action selection to softmax. More...

virtual void	setConfidenceDistribution (enum ConfidenceDistribution cd)
	Set the distribution for direct action sampling. More...

virtual void	useGibbsConfidence (bool gibbs)
	Add Gibbs sampling for confidences. More...

virtual void	useReliabilityEstimate (bool ri)
	Use the reliability estimate method for action selection. More...

virtual void	saveState (FILE *f)
	Save the current evaluations in text format to a file. More...

Protected Member Functions
int	confMax (real Qs, real vQs, real p=1.0)
	Confidence-based Gibbs sampling. More...

int	confSample (real Qs, real vQs)
	Directly sample from action value distribution. More...

int	softMax (real *Qs)
	Softmax Gibbs sampling. More...

int	eGreedy (real *Qs)
	e-greedy sampling More...

int	argMax (real *Qs)
	Get ID of maximum action. More...

Protected Attributes
enum LearningMethod	learning_method
	learning method to use; More...

int	n_states
	number of states More...

int	n_actions
	number of actions More...

real **	Q
	state-action evaluation More...

real **	e
	eligibility trace More...

real *	eval
	evaluation of current aciton More...

real *	sample
	sampling output More...

real	pQ
	previous Q More...

int	ps
	previous state More...

int	pa
	previous action More...

real	r
	reward More...

real	temp
	scratch More...

real	tdError
	temporal difference error More...

bool	smax
	softmax option More...

bool	pursuit
	pursuit option More...

real **	P
	pursuit action probabilities More...

real	gamma
	Future discount parameter. More...

real	lambda
	Eligibility trace decay. More...

real	alpha
	learning rate More...

real	expected_r
	Expected reward. More...

real	expected_V
	Expected state return. More...

int	n_samples
	number of samples for above expected r and V More...

int	min_el_state
	min state ID to search for eligibility More...

int	max_el_state
	max state ID to search for eligibility More...

bool	replacing_traces
	Replacing instead of accumulating traces. More...

bool	forced_learning
	Force agent to take supplied action. More...

bool	confidence
	Confidence estimates option. More...

bool	confidence_eligibility
	Apply eligibility traces to confidence. More...

bool	reliability_estimate
	reliability estimates option More...

enum ConfidenceDistribution	confidence_distribution
	Distribution to use for confidence sampling. More...

bool	confidence_uses_gibbs
	Additional gibbs sampling for confidence. More...

real	zeta
	Confidence smoothing. More...

real **	vQ
	variance estimate for Q More...

Detailed Description

Discrete policies with reinforcement learning.

This class implements a discrete policy using the Sarsa( \(\lambda\)) or Q( \(\lambda\)) Reinforcement Learning algorithm. After creating an instance of the algorithm were the number of actions and states are specified, you can call the SelectAction() method at every time step to dynamically select an action according to the given state. At the same time you must provide a reinforcement, which should be large (or positive) for when the algorithm is doing well, and small (or negative) when it is not. The algorithm will try and maximise the total reward received.

Parameters:

n_states: The number of discrete states, or situations, that the agent is in. You should create states that are relevant to the task.

n_actions: The number of different things the agent can do at every state. Currently we assume that all actions are usable at all states. However if an action a_i is not usable in some state s, then you can make the client side of the code for state s map state a_i to some usable state a_j, meaning that when the agent selects action a_i in state s, the result would be that of taking one of the usable actions. This will make the two actions equivalent. Alternatively when the agent selects an unusable action at a particular state, you could have a random outcome of actions. The algorithm can take care of that, since it assumes no relationship between the same action at different states.

alpha: Learning rate. Controls how fast the evaluation is changed at every time step. Good values are between 0.01 and 0.1. Lower than 0.01 makes learning too slow, higher than 0.1 makes learning unstable, particularly for high gamma and lambda.

gamma: The algorithm will maximise the exponentially decaying sum of all future rewards, were the base of the exponent is gamma. If gamma=0, then the algorithm will always favour short-term rewards over long-term ones. Setting gamma close to 1 will make long-term rewards more important.

lambda: This controls how much the expected value of reinforcement is close to the observed value of reinforcement. Another view is that of it controlling the eligibility traces e. Eligibility traces perform temporal credit assignment to previous actions/states. High values of lambda (near 1) can speed up learning. With lambda=0, only the currently selected action/state pair evaluation is updated. With lambda>0 all state/action pair evaluations taken in the past are updated, with most recent pairs updated more than pairs further in the past. Ultimately the optimal value of lambda will depend on the task.

softmax, randomness: If this is false, then the algorithm selects the best possible action all the time, but with probability 'randomness' selects a completely random action at each timestep. If this is true, then the algorithm selects actions stochastically, with probability of selecting each action proportional to how better it seems to be than the others. A high randomness (>10.0) will create more or less equal probabilities for actions, while a low randomness (<0.1) will make the system almost deterministic.

init_eval: This is the initial evaluation for actions. It pays to set this to optimistic values (i.e. higher than the reinforcement to be given), so that the algorithm becomes always 'disappointed' and tries to explore as much of the possible combinations as it can.

Member functions:

setQLearning(): Use the Q( \(\lambda\)) algorithm

setELearning(): Use the E( \(\lambda\)) algorithm

setSarsa(): Use the Sarsa( \(\lambda\)) algorithm

All the above algorithms attempt to approximate the optimal value function by using an update of the form

\[ Q_{t+1}(s,a) = Q_{t}(s,a) + \alpha (r_{t+1} + \gamma E\{Q(s'|\pi)\} - Q_{t}(s,a)). \]

The difference lies in how the expectation operator is approximated. In the case of SARSA, it is done by directly sampling the current policy, i.e. the expected value of Q for the next state is the current evaluation for the action actually taken in the next state. In Q-learning the expected value is that of the greedy action in the next state (with some minor complications to accommodate eligibility traces). E-learning is the most general case. The simplest implementation works by just replacing the single sample from Q with \(\sum_{b} Q(s',b) P(a'=b|s')\). This lies somewhere in between SARSA and Q-learning.

setPursuit (bool pursuit): If true, use pursuit methods to determine the best possible action. This enforces maximum exploration initially, and maximum exploitation of estimates later. I am not sure of the convergence properties of this method, however, when used in conjuction with Sarsa or Q-learning.

useConfidenceEstimates(bool confidence, float zeta): Use confidence estimates for the estimated parameters. This allows automatic exploration-exploitation tradeoffs. The zeta parameter controls how smooth the estimates of the confidence are, lower values for smoother. (defaults to 0.01). Now, given estimates Q_1 and Q_2 for actions 1,2 respectively we assume a Laplacian distribution centered around Q_1 and Q_2, with variance equal to v_1 and v_2. With Gibbs sampling, the probability of selecting action Q_1 is then

\[ P[1]=exp(Q_1/\sqrt{v_1})/(exp(Q_1/\sqrt{v_1})+exp(Q_2/\sqrt{v2})) \]

.

However it is possible to perform direct sampling.

setForcedLearning(bool forced);

Definition at line 144 of file policy.h.

Constructor & Destructor Documentation

◆ DiscretePolicy()

DiscretePolicy::DiscretePolicy	(	int	n_states,
		int	n_actions,
		real	alpha = `0.1`,
		real	gamma = `0.8`,
		real	lambda = `0.8`,
		bool	softmax = `false`,
		real	randomness = `0.1`,
		real	init_eval = `0.0`
	)

Create a new discrete policy.

n_states Number of states for the agent
n_actions Number of actions.
alpha Learning rate.
gamma Discount parameter.
lambda Eligibility trace decay.
softmax Use softmax if true (can be overridden later)
randomness Amount of randomness.
init_eval Initial evaluation of actions.

Definition at line 42 of file policy.cpp.

◆ ~DiscretePolicy()

DiscretePolicy::~DiscretePolicy ( )

virtual

Kill the agent and free everything.

Delete policy.

Definition at line 155 of file policy.cpp.

Here is the call graph for this function:

Member Function Documentation

◆ argMax()

int DiscretePolicy::argMax ( real * Qs )

protected

Get ID of maximum action.

Definition at line 816 of file policy.cpp.

Here is the call graph for this function:

◆ confMax()

int DiscretePolicy::confMax	(	real *	Qs,
		real *	vQs,
		real	p = `1.0`
	)

protected

Confidence-based Gibbs sampling.

Definition at line 715 of file policy.cpp.

Here is the call graph for this function:

◆ confSample()

int DiscretePolicy::confSample	(	real *	Qs,
		real *	vQs
	)

protected

Directly sample from action value distribution.

Definition at line 749 of file policy.cpp.

Here is the call graph for this function:

◆ eGreedy()

int DiscretePolicy::eGreedy ( real * Qs )

protected

e-greedy sampling

Definition at line 802 of file policy.cpp.

Here is the call graph for this function:

◆ getLastActionValue()

virtual real DiscretePolicy::getLastActionValue ( )

inlinevirtual

Get the vale of the last action taken.

Reimplemented in ANN_Policy.

Definition at line 195 of file policy.h.

◆ getTDError()

virtual real DiscretePolicy::getTDError ( )

inlinevirtual

Get the temporal difference error of the previous action.

Definition at line 193 of file policy.h.

◆ loadFile()

void DiscretePolicy::loadFile ( char * f )

virtual

Load policy from a file.

Definition at line 484 of file policy.cpp.

Here is the call graph for this function:

◆ Reset()

void DiscretePolicy::Reset ( )

virtual

Use at the end of every episode, after agent has entered the absorbing state.

Reimplemented in ANN_Policy.

Definition at line 474 of file policy.cpp.

◆ saveFile()

void DiscretePolicy::saveFile ( char * f )

virtual

Save policy to a file.

Definition at line 550 of file policy.cpp.

◆ saveState()

void DiscretePolicy::saveState ( FILE * f )

virtual

Save the current evaluations in text format to a file.

The values are saved as triplets (Q, P, vQ). The columns are ordered by actions and the rows by state number.

Definition at line 128 of file policy.cpp.

◆ SelectAction()

int DiscretePolicy::SelectAction	(	int	s,
		real	r,
		int	forced_a = `-1`
	)

virtual

Select an action a, given state s and reward from previous action.

Optional argument a forces an action if setForcedLearning() has been called with true.

Two algorithms are implemented, both of which converge. One of them calculates the value of the current policy, while the other that of the optimal policy.

Sarsa ( \(\lambda\)) algorithmic description:

Take action \(a\), observe \(r, s'\)
Choose \(a'\) from \(s'\) using some policy derived from \(Q\)
\(\delta = r + \gamma Q(s',a') - Q(s,a)\)
\(e(s,a) = e(s,a)+ 1\), depending on trace settings
for all \(s,a\) :
\[ Q_{t}(s,a) = Q_{t-1}(s,a) + \alpha \delta e_{t}(s,a), \]
where \(e_{t}(s,a) = \gamma \lambda e_{t-1}(s,a)\)
```
  end
```
\(a = a'\) (we will take this action at the next step)
\(s = s'\)

Watkins Q (l) algorithmic description:

Take action \(a\), observe \(r\), \(s'\)
Choose \(a'\) from \(s'\) using some policy derived from \(Q\)
\(a* = \arg \max_b Q(s',b)\)

\(\delta = r + \gamma Q(s',a^*) - Q(s,a)\)
\(e(s,a) = e(s,a)+ 1\), depending on eligibility traces
for all \(s,a\) :
\[ Q(s,a) = Q(s,a)+\alpha \delta e(s,a) \]
if \((a'=a*)\) then \(e(s,a)\) = \(\gamma \lambda e(s,a)\) else \(e(s,a) = 0\) end
\(a = a'\) (we will take this action at the next step)
\(s = s'\)

The most general algorithm is E-learning, currently under development, which is defined as follows:

Take action \(a\), observe \(r\), \(s'\)
Choose \(a'\) from \(s'\) using some policy derived from \(Q\)
\(\delta = r + \gamma E{Q(s',a^*)|\pi} - Q(s,a)\)
\(e(s,a) = e(s,a)+ 1\), depending on eligibility traces
for all \(s,a\) :
\[ Q(s,a) = Q(s,a)+\alpha \delta e(s,a) \]
\(e(s,a)\) = \(\gamma \lambda e(s,a) P(a|s,\pi) \)
\(a = a'\) (we will take this action at the next step)
\(s = s'\)

Note that we also cut off the eligibility traces that have fallen below 0.1

Definition at line 283 of file policy.cpp.

Here is the call graph for this function:

◆ setConfidenceDistribution()

void DiscretePolicy::setConfidenceDistribution ( enum ConfidenceDistribution cd )

virtual

Set the distribution for direct action sampling.

Definition at line 684 of file policy.cpp.

◆ setELearning()

void DiscretePolicy::setELearning ( )

virtual

Set the algorithm to ELearning mode.

Definition at line 604 of file policy.cpp.

◆ setForcedLearning()

void DiscretePolicy::setForcedLearning ( bool forced )

virtual

Set forced learning (force-feed actions)

Definition at line 639 of file policy.cpp.

◆ setGamma()

void DiscretePolicy::setGamma ( real gamma )

virtual

Set the gamma of the sum to be maximised.

Definition at line 656 of file policy.cpp.

◆ setLearningRate()

virtual void DiscretePolicy::setLearningRate ( real alpha )

inlinevirtual

Set the learning rate.

Definition at line 191 of file policy.h.

◆ setPursuit()

void DiscretePolicy::setPursuit ( bool pursuit )

virtual

Use Pursuit for action selection.

Definition at line 618 of file policy.cpp.

◆ setQLearning()

void DiscretePolicy::setQLearning ( )

virtual

Set the algorithm to QLearning mode.

Definition at line 598 of file policy.cpp.

◆ setRandomness()

void DiscretePolicy::setRandomness ( real epsilon )

virtual

Set randomness for action selection. Does not affect confidence mode.

Definition at line 645 of file policy.cpp.

◆ setReplacingTraces()

void DiscretePolicy::setReplacingTraces ( bool replacing )

virtual

Use Pursuit for action selection.

Definition at line 629 of file policy.cpp.

◆ setSarsa()

void DiscretePolicy::setSarsa ( )

virtual

Set the algorithm to SARSA mode.

A unified framework for action selection.

Definition at line 611 of file policy.cpp.

◆ softMax()

int DiscretePolicy::softMax ( real * Qs )

protected

Softmax Gibbs sampling.

Definition at line 783 of file policy.cpp.

Here is the call graph for this function:

◆ useConfidenceEstimates()

bool DiscretePolicy::useConfidenceEstimates	(	bool	confidence,
		real	zeta = `0.01`,
		bool	confidence_eligibility = `false`
	)

virtual

Set to use confidence estimates for action selection, with variance smoothing zeta.

Variance smoothing currently uses a very simple method to estimate the variance.

Definition at line 580 of file policy.cpp.

◆ useGibbsConfidence()

void DiscretePolicy::useGibbsConfidence ( bool gibbs )

virtual

Add Gibbs sampling for confidences.

This can be used in conjuction with any confidence distribution, however it mostly makes sense for SINGULAR.

Definition at line 704 of file policy.cpp.

◆ useReliabilityEstimate()

void DiscretePolicy::useReliabilityEstimate ( bool ri )

virtual

Use the reliability estimate method for action selection.

Definition at line 673 of file policy.cpp.

◆ useSoftmax()

void DiscretePolicy::useSoftmax ( bool softmax )

virtual

Set action selection to softmax.

Definition at line 662 of file policy.cpp.

Member Data Documentation

◆ alpha

real DiscretePolicy::alpha

protected

learning rate

Definition at line 166 of file policy.h.

◆ confidence

bool DiscretePolicy::confidence

protected

Confidence estimates option.

Definition at line 174 of file policy.h.

◆ confidence_distribution

enum ConfidenceDistribution DiscretePolicy::confidence_distribution

protected

Distribution to use for confidence sampling.

Definition at line 177 of file policy.h.

◆ confidence_eligibility

bool DiscretePolicy::confidence_eligibility

protected

Apply eligibility traces to confidence.

Definition at line 175 of file policy.h.

◆ confidence_uses_gibbs

bool DiscretePolicy::confidence_uses_gibbs

protected

Additional gibbs sampling for confidence.

Definition at line 178 of file policy.h.

◆ e

real** DiscretePolicy::e

protected

eligibility trace

Definition at line 152 of file policy.h.

◆ eval

real* DiscretePolicy::eval

protected

evaluation of current aciton

Definition at line 153 of file policy.h.

◆ expected_r

real DiscretePolicy::expected_r

protected

Expected reward.

Definition at line 167 of file policy.h.

◆ expected_V

real DiscretePolicy::expected_V

protected

Expected state return.

Definition at line 168 of file policy.h.

◆ forced_learning

bool DiscretePolicy::forced_learning

protected

Force agent to take supplied action.

Definition at line 173 of file policy.h.

◆ gamma

real DiscretePolicy::gamma

protected

Future discount parameter.

Definition at line 164 of file policy.h.

◆ lambda

real DiscretePolicy::lambda

protected

Eligibility trace decay.

Definition at line 165 of file policy.h.

◆ learning_method

enum LearningMethod DiscretePolicy::learning_method

protected

learning method to use;

Definition at line 148 of file policy.h.

◆ max_el_state

int DiscretePolicy::max_el_state

protected

max state ID to search for eligibility

Definition at line 171 of file policy.h.

◆ min_el_state

int DiscretePolicy::min_el_state

protected

min state ID to search for eligibility

Definition at line 170 of file policy.h.

◆ n_actions

int DiscretePolicy::n_actions

protected

number of actions

Definition at line 150 of file policy.h.

◆ n_samples

int DiscretePolicy::n_samples

protected

number of samples for above expected r and V

Definition at line 169 of file policy.h.

◆ n_states

int DiscretePolicy::n_states

protected

number of states

Definition at line 149 of file policy.h.

◆ P

real** DiscretePolicy::P

protected

pursuit action probabilities

Definition at line 163 of file policy.h.

◆ pa

int DiscretePolicy::pa

protected

previous action

Definition at line 157 of file policy.h.

◆ pQ

real DiscretePolicy::pQ

protected

previous Q

Definition at line 155 of file policy.h.

◆ ps

int DiscretePolicy::ps

protected

previous state

Definition at line 156 of file policy.h.

◆ pursuit

bool DiscretePolicy::pursuit

protected

pursuit option

Definition at line 162 of file policy.h.

◆ Q

real** DiscretePolicy::Q

protected

state-action evaluation

Definition at line 151 of file policy.h.

◆ r

real DiscretePolicy::r

protected

reward

Definition at line 158 of file policy.h.

◆ reliability_estimate

bool DiscretePolicy::reliability_estimate

protected

reliability estimates option

Definition at line 176 of file policy.h.

◆ replacing_traces

bool DiscretePolicy::replacing_traces

protected

Replacing instead of accumulating traces.

Definition at line 172 of file policy.h.

◆ sample

real* DiscretePolicy::sample

protected

sampling output

Definition at line 154 of file policy.h.

◆ smax

bool DiscretePolicy::smax

protected

softmax option

Definition at line 161 of file policy.h.

◆ tdError

real DiscretePolicy::tdError

protected

temporal difference error

Definition at line 160 of file policy.h.

◆ temp

real DiscretePolicy::temp

protected

scratch

Definition at line 159 of file policy.h.

◆ vQ

real** DiscretePolicy::vQ

protected

variance estimate for Q

Definition at line 180 of file policy.h.

◆ zeta

real DiscretePolicy::zeta

protected

Confidence smoothing.

Definition at line 179 of file policy.h.

The documentation for this class was generated from the following files:

src/libs/learning/policy.h
src/libs/learning/policy.cpp

Public Member Functions

Protected Member Functions

Protected Attributes

Detailed Description

Constructor & Destructor Documentation

◆ DiscretePolicy()

◆ ~DiscretePolicy()

Member Function Documentation

◆ argMax()

◆ confMax()

◆ confSample()

◆ eGreedy()

◆ getLastActionValue()

◆ getTDError()

◆ loadFile()

◆ Reset()

◆ saveFile()

◆ saveState()

◆ SelectAction()

◆ setConfidenceDistribution()

◆ setELearning()

◆ setForcedLearning()

◆ setGamma()

◆ setLearningRate()

◆ setPursuit()

◆ setQLearning()

◆ setRandomness()

◆ setReplacingTraces()

◆ setSarsa()

◆ softMax()

◆ useConfidenceEstimates()

◆ useGibbsConfidence()

◆ useReliabilityEstimate()

◆ useSoftmax()

Member Data Documentation

◆ alpha

◆ confidence

◆ confidence_distribution

◆ confidence_eligibility

◆ confidence_uses_gibbs

◆ e

◆ eval

◆ expected_r

◆ expected_V

◆ forced_learning

◆ gamma

◆ lambda

◆ learning_method

◆ max_el_state

◆ min_el_state

◆ n_actions

◆ n_samples

◆ n_states

◆ P

◆ pa

◆ pQ

◆ ps

◆ pursuit

◆ Q

◆ r

◆ reliability_estimate

◆ replacing_traces

◆ sample

◆ smax

◆ tdError

◆ temp

◆ vQ

◆ zeta