Natural Policy Gradient

LyceumAI.NaturalPolicyGradientType
NaturalPolicyGradient{DT<:AbstractFloat}(args...; kwargs...) -> NaturalPolicyGradient
NaturalPolicyGradient(args...; kwargs...) -> NaturalPolicyGradient

Construct an instance of NaturalPolicyGradient with args and kwargs, where DT <: AbstractFloat is the element type used for pre-allocated buffers, which defaults to Float32.

In the following explanation of the NaturalPolicyGradient constructor, we use the following notation/definitions:

  • dim_o = length(obsspace(env))
  • dim_a = length(actionspace(env))
  • "terminal" (e.g. terminal observation) refers to timestep T + 1 for a length T trajectory.

Arguments

  • env_tconstructor: a function with signature env_tconstructor(n) that returns n instances of T, where T <: AbstractEnvironment.
  • policy: a function mapping observations to actions, with the following signatures:
    • policy(obs::AbstractVector) –> action::AbstractVector, where size(obs) == (dim_o, ) and size(action) == (dim_a, ).
    • policy(obs::AbstractMatrix) –> action::AbstractMatrix, where size(obs) == (dim_o, N) and size(action) == (dim_a, N).
  • value: a function mapping observations to scalar rewards, with the following signatures:
    • value(obs::AbstractVector) –> reward::Real, where size(obs) == (dim_o, )
    • value(obs::AbstractMatrix) –> reward::AbstractVector, where size(obs) == (dim_o, N) and size(reward) == (N, ).
  • valuefit!: a function with signature valuefit!(value, obs::AbstractMatrix, returns::AbstractVector), where size(obs) == (dim_o, N) and size(returns) == (N, ), that fits value to obs and returns.

Keywords

  • Hmax::Integer: Maximum trajectory length for environments rollouts.
  • N::Integer: Total number of data samples used for each policy gradient step.
  • Nmean::Integer: Total number of data samples for the mean policy (without stochastic noise). Mean rollouts are used for evaluating policy and not used to improve policy in any form.
  • norm_step_size::Real: Scaling for the applied gradient update after gradient normalization has occured. This process makes training much more stable to step sizes; see equation 5 in this paper for more details.
  • gamma::Real: Reward discount, applied as gamma^(t - 1) * reward[t].
  • gaelambda::Real: Generalized Advantage Estimate parameter, balances bias and variance when computing advantages. See this paper for details.
  • max_cg_iter::Integer: Maximum number of Conjugate Gradient iterations when estimating natural_gradient = alpha * inv(FIM) * gradient, where FIM is the Fisher Information Matrix.
  • cg_tol::Real: Numerical tolerance for Conjugate Gradient convergence.
  • whiten_advantages::Bool: if true, apply statistical whitening to calculated advantages (resulting in mean(returns) ≈ 0 && std(returns) ≈ 1).
  • bootstrapped_nstep_returns::Bool: if true, bootstrap the returns calculation starting value(terminal_observation) instead of 0. See "Reinforcement Learning" by Sutton & Barto for further information.
  • value_feature_op: a function with the below signatures that transforms environment observations to a set of "features" to be consumed by value and valuefit!:
    • value_feature_op(observations::AbstractVector{<:AbstractMatrix}) --> AbstractMatrix
    • value_feature_op(terminal_observations::AbstractMatrix, trajlengths::Vector{<:Integer}) --> AbstractMatrix, where observations is a vector of observations from each trajectory, terminal_observations has size (dim_o, number_of_trajectories), and trajlengths contains the lengths of each trajectory (such that trajlengths[i] == size(observations[i], 2)).

For some continuous control tasks, one may consider the following notes when applying NaturalPolicyGradient to new tasks and environments:

  1. For two policies that both learn to complete a task satisfactorially, the larger one may not perform significantly better. A minimum amount of representational power is necessary, but larger networks may not offer quantitative benefits. The same goes for the value function approximator.
  2. Hmax needs to be sufficiently long for the correct behavior to emerge; N needs to be sufficiently large that the agent samples useful data. They may also be surprisingly small for simple tasks. These parameters are the main tunables when applying NaturalPolicyGradient.
  3. One might consider the norm_step_size and max_cg_iter parameters as the next most important when initially testing NaturalPolicyGradient on new tasks, assuming Hmax and N are appropriately chosen for the task. gamma has interaction with Hmax, while the default value for gaelambda has been empirically found to be stable for a wide range of tasks.

For more details, see Algorithm 1 in Towards Generalization and Simplicity in Continuous Control.