homework | report | Algorithm | 代做bash | shell代做 | Machine learning代写 | Python代做 | AI代做 | 代做assignment – ECEN 689: RL: Reinforcement Learning

ECEN 689: RL: Reinforcement Learning

homework | report | Algorithm | 代做bash | shell代做 | Machine learning代写 | Python代做 | AI代做 | 代做assignment – 这是利用Machine learning进行训练的代写, 对Machine learning的流程进行训练解析, 包括了report/Algorithm/bash/shell/Machine learning/Python/AI等方面, 这个项目是assignment代写的代写题目

算法代写 代写算法 Algorithm代写 代写Algorithm  算法作业代写

ECEN 689: RL: Reinforcement Learning
 assignment 4

This homework will introduce you to the implementation of policy gradient algorithms. The learn- ing outcomes of this assignment are:

  1. Introduction to Policy Gradient variants.
  2. Introduction to RL coding pipeline.
  3. Introduction to Deep RL framework PyTorch.

1 Review

LetV=V, 0 =Ex 0 ()[V(x)],whereV(x) =E

[
t=0
tR(xt,at)|x 0 =x]. We have looked

at the following variants of basic policy gradient algorithms in the class:

V=EPtraj,
[
R()
t=
log(xt,at)
]
, (1)
V=
1
(1)
E(x,a)[Q(x,a)log(x,a)], (2)
V=EPtraj,
[
t=
tQ(xt,at)log(xt,at)
]
, (3)
V=
1
(1)
E(x,a)[A(x,a)log(x,a)]. (4)

Here,= (x 0 ,a 0 ,x 1 ,a 1 ,…,xT,aT),at(xt,),xt+1P(|xt,at),x 0 0 () represents one trajectory sampled from the the distributionPtraj,. The cumulative discounted reward of trajec- toryis given byR() =

t=0
tR(xt,at). In (1) we scale the gradient of the log of the policy

using the cumulative reward accumulated in that trajectory. This method results in high variance in the policy gradient update. In (2) and (3) we exploit causality (rewards in the past do not affect the policy) to reduce variance. The variance in the policy gradient can be further reduced by subtracting a baseline (generally the value function) as in (4).

Each variant above is an instance of the policy gradient algorithm. The difference stems from which quantity the expectation is taken over and how the quantities within the expectation are estimated. Since the equations above are expectations, they can not be used directly. A sampling- based estimator of Eqn. 1 that can be coded up as:

V
1
N
N
i=
(
R(i)
(T
t=
log(xit,ait)
))
. (5)

The equation above is an empirical mean of scaled gradient terms. Similar sampling-based estima- tors can be written for all the other variants.

Notation:In the following, we usesfor state instead of our regular notationx. This is to avoid confusing with one of the environments.

2 Environment

For this assignment, we will be studying policy gradient Algorithm using the following two environ- ments:

  1. Point-v0: This is a simple custom 2D reaching environment designed to help us understand the basics of policy gradient algorithms. At the beginning of each episode, a point agent is thrown at a random location on a 2D square (The state space is the 2D grid of points {x,y: 1 x 1 , 1 y 1 }). The goal of the agent is to move to the origin asap. The action space is 2D continuous vector, (dx,dy), that encodes the agent movement along any co-ordinate direction. A multivariate Gaussian distribution with identity co-variance matrix is used as the action distribution. The mean of multivariate Gaussian distribution is given by=> s, where sis the vector obtained by appending 1 to the states(1 is appended to capture bias term often used in Machine learning implementations). This implies that PDF of the policy is given by 21 e
 1
2 (a)>I(a).Note that the policy is linear in randomly
initialized parameters (  N(0, 0. 01 ,size = (|A|,|S|+ 1))). For further details on the
environment, look into the filepointenv.py. A sample rendering of the environment is
provided inassets/Point-v0rendering.
  1. CartPole-v0: This environment from Open AI Gym consists of a pole that is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or1 to the cart. The pole starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. CartPole-v0 definessolvingas getting an average reward of 195.0 over 100 consecutive episodes.

3 Questions

3.1 Point-v

  1. Recall the PDF ofgiven in the environment description. Analytically compute a closed form expression for the gradient of the policy for a given action and state,i.e., gradient of the termlog(a|s). Include the expression in your submission report.
  2. Implement policy gradient variants specified in (1) and (3) for Point-v0. For this part you would have to update the following files:
(a)mainloop.updateparamsmethod inmain.py.
(b)estimatenetgradmethodsolutions/pointmasssolutions.py
The method arguments can be changed as needed. Use a learning rate of 0 .1 for this
environment. Plot the training curves for each of these variants (discounted sum of rewards
per iteration of the training procedure). Summarize your observations. How fast does the
algorithm learn? What average return do we see after 100 iterations of the algorithm? Include
a rendering of the final policy.
  1. Bonus: Do you think including a baseline would improve the performance of these algorithms. Try implementing a simple time dependent baseline as follows: Simply set the baseline to the average of all the returns (R()) among the trajectories collected from the previous iteration. If there are no trajectories long enough at a certain time step, we can set the baseline to 0.

3.2 CartPole-v

  1. Implement the policy gradient described in (1) on the CartPole-v0 environment. Plot the un- discounted cumulative reward per episode. (Note: We do not expect you to solve CartPole-v using this approach)
  2. Implement the policy gradient described in (3) on the CartPole-v0 environment. Plot the un-discounted cumulative reward per episode. (Note: We expect you to solve CartPole-v using this approach)
  3. Implement the policy gradient described in (4) on the CartPole-v0 environment. Plot the un-discounted cumulative reward per episode. (Note: We expect you to solve CartPole-v using this approach)
  4. Summarize your observations on the 3 approaches to implement the REINFORCE algorithm.

4 Coding Pipeline

Summary: Sample trajectoriesEstimate returns/advantages train/update pa- rameters. The basic code flow for this assignment (and reinforcement learning algorithms in general) is included in Fig. 1 and Fig. 2. We define an agent class that includes capabil- ity to interact with the environment and generate batch data of the form{(sit,ait,rit,sit,mask)}, t= 0,….T,i= 1,…,N. (This step is covered by numbers (1) and (2) in Fig. 1). The last term mask is a binary variable that captures end of the episode: it is 0 only when the episode ends and always taking a value 1 otherwise. That is, the agent samplesNtrajectories ofT tines steps each at each iteration of the algorithm. We include a single parameter for the batch size calledmin-batch-sizeinmain.py. The variable batchdefinedmain.pytakes the following form: In each iteration upon collection of thebatchdata, the functionupdateparamsis called to:

  1. estimate the returns/advantages (step (3) in Fig. 1),
  2. compute policy gradient (steps (4) and (5) in Fig. 1),
  3. finally to update policy parameters (step (6) in Fig. 1).

5 Code Setup Instructions

The basic code is implemented in Python 3.9 and PyTorch 1.8.1 and uses OpenAI-Gym. We can think of PyTorch as a wrapper around Python. It is as easy to code in pyTorch as in Python.

Agent
  • Policy
  • Value
Environment
(r, P) (s,a,r,s,mask)
R or Q or A estimation
Estimate loss:
L= ![log.]
Estimate gradient of L and
update policy parameters
(1) (2)
(3)
(4)
(6) (5)
Figure 1: Coding pipeline for policy gradient algorithms
s1 s2 s3 s4.....sT s1 s2 s3 s4 .....sT ... s1 s2 s3 s4.....sT
a1 a2 a3 a4.....aT a1 a2 a3 a4 .....aT ... a1 a2 a3 a4.....aT
r1 r2 r3 r4.....rT r1 r2 r3 r4.....rT ... r1 r2 r3 r4.....rT
BATCH
s 1 s 2 s 3 s 4 ..... s T s 1 s 2 s 3 s 4 ..... s T ... s 1 s 2 s 3 s 4 ..... s T
1 1 1 1 ..... 0 1 1 1 1 ..... 0 ... 1 1 1 1 ..... 0
state
action
reward
state
mask
Figure 2: Batch

It is advisable to set up a virtual environment (like an anaconda environment) for solving this assignment.

Setting up conda environment.A conda environment can be set up as follows:

  1. Anaconda installation:
(a) Download the binaries from anaconda site (on a linux machine, you can usecurlor
wgetto do so:wget anacondabinaryname.sh).
(b) Change the permission for binaries to execute them (be careful withchmodandchown).
(c) Execute the binaries (use:  shell 代做 script代写"> bash anacondabinaryname.sh).
  1. Conda Environment creation with Python 3.9:conda create –name envname –python=3.9. Use this conda cheat sheet for help.

Install the following inside your conda environment. Gym installation.Use the following set of commands to install Gym:

  1. git clone https://github.com/openai/gym
  2. cd gym
  3. pip install -e

PyTorch installation. Useconda install pytorch torchvision -c=conda-forgeto install PyTorch 1.8.1.

IMPORTANT:Remember that while solving coding assignments Stack-Overflow is your best friend. Please search on Stack-Overflow when you hit roadblocks.

6 Submission Details

  1. Submit a zip file consisting of completed code and a report that lists your finding and answers to all the questions above.Do not add your code in the report.
  2. Put the following files in a folder namedLastNameFirstName
    • All the codes
    • A PDF report of your findings.
  3. Zip thefolderin 2 and submit on eCampus.