Python代写|Project4 Deep Q-Network

这是一篇美国的强化学习python代写

 

In this project, you will be coding up the Deep Q-Network (DQN) algorithm, and using it to create a policy for the
same arm dynamics problem from the previous project. We will be using a 2-link arm, with relatively easy to reach
goals.

Project Setup

For this project, we will use Anaconda as our python virtual environment manager. Install Anaconda following the
instructions here.
After you successfully install Anaconda, check out Project 4 from the SVN server:

Install the virtual environment:

cd project4

conda create –name project4 –file spec-file.txt

You can then activate and deactivate the virtual environment anywhere in the terminal with:

conda activate project4

conda deactivate

Important: DO NOT install any other libraries/or dependencies or a different version of the already provided package. The autograder will give you a O if you import lilbraries that are not specified in the spec-file.txt. If you are concerned you may have accidentally imported someething that isn’t in the spec-file.txt, delete your conda environment and re-create it, and then re-run your codeto see if your code still runs without error. You can also test your code on the MechTech lab machines, which don’t have any additional libraries installed.

Starter Code Explanation

In addition to code you are already familiar with from the previous project (i.e. arm dynamics, etc.) we are providing an “Environment” in the ArmEnv class. The environment “wraps around” the arm dynamics and provides the key functions that an RL algorithm expects: reset(..)and step(…). Take a moment to familiarize yourself with these functions!Important notes:

● The ArmEnv expects an action similar to the one used pgreviously: a vector with a torque for every arm joint. Thus,

the native action space for this environment is high-dimensional, and continuous. DQN will require an action space that is 1-dimensional and discrete. You will need to convert between these. For example, you can have an action space of [0, 1, 2,] where each number just represent the identity of an action candidate, and a conversion dictionary {0:[-0.1,-0.1],1:[0.1,0.1],2:[0,0]}.Then,when the Q network output an action 1, it will be converted into [0.1, 0.1] and used by the environment. Note that thisis just an example method to implement the conversion and you do not have to follow the same procedure.

● The observation provided by the environment will comprise the same state vector as before, to which we append the current goal for the end-effector. Since your policy mlust learn to real arbitrary goals, the goal must be provided as part of the observation.

● The maximum episode length of the environment is 200steps. This should be used for both training and testing.

● The reward function of this environment is by default r(s,a)=-dist(pos_ee, goal)^2 where represents the negative square of L2 distance between the current position of theend-effector and the goal position. Using this function, a reference solution of this project solves the task whenit reach returns of about -3 using a maximum episode length of 200.

Instructions

You must edit two classes, QNetwork and TrainDQN.Each is found in a fle of the same name as the class. Details are below.

QNetwork

This class defines the architecture of your network. You must fill in the__init__(..) function which defines your network, and the forward(..) function which performs the forward pass.

Your action space should be discrete, with whatever cardinality you decide, The size of the output layer of your O-Network should thus be the same as the cardinality of your action space. When selecting an action, a policy must choose the one that has the highest estimated Q-value for the current state. As part of the QNetwork class, we are providing the function select_discrete_action(..) which does exactly that.

The arm environment itself however expects a 2-dimensional, continuous action vector. Therefore, when it comes time to send an action to the environment, you must provide the kind of action the environment expects. It is your job to determine how to convert between the discrete action space of your Q-Network and the continuous action space of the arm. You do this by filling in the action_discrete_to_continuous(..) function in your QNetwork. You can expect to call the step function of the environment like this:

self.env.step(self.q_network.action_discrete_to_continuous(discrete_action))

TrainDQN

Here, you must fll in the train(..) function that actually trains your network.

We are providing a helper function called save_model.) that will save the current Q-network. Use this as you see ft.

To set one network equal to another one, you can use code like this:

target_network.load_state_dict(self.q_network.state_dict(O))

If you would like to be graded with a specific seed for the random number generators, make sure to change the default seed in the initialization of the TrainDQN class.

Grading

The script enjoy_dqn.py can be used to test your code. If you run it without any arguments, it will train your Q-Network from scratch, then test it. This is how we will run your code for grading:

python3 enjoy_dqn.py–time_Limit 420

While developing, you can also test a pre-saved model, like so:

python3 enjoy_dqn.py–model_path models/2022-04-10_12-04-17/episode_000100_reward_-114/q_network.pth

Note that this functionality is provided purely to help you develop. When grading, we will NOT be using pre-saved models.In this project, we will limit the time for training within 7 minutes. The time constraints are implemented in both enjoy_dgn.py and train_dqn.py already (please see the_main__of each fle).

You can pass the–gui flag to either script(train_dqn or enjoy_dqn) and then you will also see what the policy is doing. This will greatly slow down training though.

The grader will run five episodes, each with a different goal. For each episode, we will compute the total reward. If your episode reward is higher than what we consider to be an easy target for that goal, you will get 2 points. If it is also higher than what we consider to be a more diffcult target,you will receive an additional point. The max is thus 3 points for each goal, for a total of 15 for the project.