# Natural Policy Gradient

`LyceumAI.NaturalPolicyGradient`

— Type```
NaturalPolicyGradient{DT<:AbstractFloat}(args...; kwargs...) -> NaturalPolicyGradient
NaturalPolicyGradient(args...; kwargs...) -> NaturalPolicyGradient
```

Construct an instance of `NaturalPolicyGradient`

with `args`

and `kwargs`

, where `DT <: AbstractFloat`

is the element type used for pre-allocated buffers, which defaults to Float32.

In the following explanation of the `NaturalPolicyGradient`

constructor, we use the following notation/definitions:

`dim_o = length(obsspace(env))`

`dim_a = length(actionspace(env))`

- "terminal" (e.g. terminal observation) refers to timestep
`T + 1`

for a length`T`

trajectory.

**Arguments**

`env_tconstructor`

: a function with signature`env_tconstructor(n)`

that returns`n`

instances of`T`

, where`T <: AbstractEnvironment`

.`policy`

: a function mapping observations to actions, with the following signatures:`policy(obs::AbstractVector)`

–>`action::AbstractVector`

, where`size(obs) == (dim_o, )`

and`size(action) == (dim_a, )`

.`policy(obs::AbstractMatrix)`

–>`action::AbstractMatrix`

, where`size(obs) == (dim_o, N)`

and`size(action) == (dim_a, N)`

.

`value`

: a function mapping observations to scalar rewards, with the following signatures:`value(obs::AbstractVector)`

–>`reward::Real`

, where`size(obs) == (dim_o, )`

`value(obs::AbstractMatrix)`

–>`reward::AbstractVector`

, where`size(obs) == (dim_o, N)`

and`size(reward) == (N, )`

.

`valuefit!`

: a function with signature`valuefit!(value, obs::AbstractMatrix, returns::AbstractVector)`

, where`size(obs) == (dim_o, N)`

and`size(returns) == (N, )`

, that fits`value`

to`obs`

and`returns`

.

**Keywords**

`Hmax::Integer`

: Maximum trajectory length for environments rollouts.`N::Integer`

: Total number of data samples used for each policy gradient step.`Nmean::Integer`

: Total number of data samples for the mean policy (without stochastic noise). Mean rollouts are used for evaluating`policy`

and not used to improve`policy`

in any form.`norm_step_size::Real`

: Scaling for the applied gradient update after gradient normalization has occured. This process makes training much more stable to step sizes; see equation 5 in this paper for more details.`gamma::Real`

: Reward discount, applied as`gamma^(t - 1) * reward[t]`

.`gaelambda::Real`

: Generalized Advantage Estimate parameter, balances bias and variance when computing advantages. See this paper for details.`max_cg_iter::Integer`

: Maximum number of Conjugate Gradient iterations when estimating`natural_gradient = alpha * inv(FIM) * gradient`

, where`FIM`

is the Fisher Information Matrix.`cg_tol::Real`

: Numerical tolerance for Conjugate Gradient convergence.`whiten_advantages::Bool`

: if`true`

, apply statistical whitening to calculated advantages (resulting in`mean(returns) ≈ 0 && std(returns) ≈ 1`

).`bootstrapped_nstep_returns::Bool`

: if`true`

, bootstrap the returns calculation starting`value(terminal_observation)`

instead of 0. See "Reinforcement Learning" by Sutton & Barto for further information.`value_feature_op`

: a function with the below signatures that transforms environment observations to a set of "features" to be consumed by`value`

and`valuefit!`

:`value_feature_op(observations::AbstractVector{<:AbstractMatrix}) --> AbstractMatrix`

`value_feature_op(terminal_observations::AbstractMatrix, trajlengths::Vector{<:Integer}) --> AbstractMatrix`

, where`observations`

is a vector of observations from each trajectory,`terminal_observations`

has size`(dim_o, number_of_trajectories)`

, and`trajlengths`

contains the lengths of each trajectory (such that`trajlengths[i] == size(observations[i], 2)`

).

For some continuous control tasks, one may consider the following notes when applying `NaturalPolicyGradient`

to new tasks and environments:

- For two policies that both learn to complete a task satisfactorially, the larger one may not perform significantly better. A minimum amount of representational power is necessary, but larger networks may not offer quantitative benefits. The same goes for the value function approximator.
`Hmax`

needs to be sufficiently long for the correct behavior to emerge;`N`

needs to be sufficiently large that the agent samples useful data. They may also be surprisingly small for simple tasks. These parameters are the main tunables when applying`NaturalPolicyGradient`

.- One might consider the
`norm_step_size`

and`max_cg_iter`

parameters as the next most important when initially testing`NaturalPolicyGradient`

on new tasks, assuming`Hmax`

and`N`

are appropriately chosen for the task.`gamma`

has interaction with`Hmax`

, while the default value for`gaelambda`

has been empirically found to be stable for a wide range of tasks.

For more details, see Algorithm 1 in Towards Generalization and Simplicity in Continuous Control.