Assignment 4: CS7641 – Machine Learning
Saad Khan November 29, 2015
The purpose of this assignment is to apply some of the techniques learned from reinforcement learning to make decisions i.e. to analyze work of an agent from a machine learning perspective. To be specific, the task is to explore Markov Decision Processes (MDPs) that are different from one another in the sense that first problem consists of large number of states (in this case more than a thousand) while the other has very few states (less than hundred). The analysis performed in this assignment is based on two planning algorithms and one learning algorithm. Details of these are in the following sections.
The implementation of the MDPs along with the analysis was done using code written in Java that used the BURLAP Reinforcement Learning package. The three algorithms used were value iteration and policy iteration from the planning section of BURLAP along with Q-learning from the learning section. Computation times measurements all throughout this report are average of three runs for each algorithmic configuration.
The names I have given to the problems covered in this assignment are ’Maze Solving Problem’ and the ’Sherpa Rescue Problem’. Following sections will briefly introduce these MDPs and analysis will come later.
3.1 Maze Solving Problem
This MDP is a simple maze solution implementation which can practically be applied to bots in computer games, specifically first person shooting (FPS) games. FPS bot artificial intelligence (AI) generally consists of optimal path-finding, picking up and using objects as clues to solve missions. For simplicity, we will only focus on path-finding aspect of these games in this assignment by making an agent find the solution to a maze and get to the end of a mission as soon as possible.
For this problem, to demonstrate the performance of planning and learning algorithms, I have considered a fairly large maze (Maze A) with dimensions of 59 x 59 cells and consisting of 1799 unique states. To analyze performance of the algorithms for different number of states, along with this primary maze (Maze A) I have used two other variants as shown in Figure 1. Maze B has the same dimensions as Maze A but has more wider maze paths and as result has 2398 states. On the other hand, Maze C has around 30% lesser states, at 799, but has the same path width as Maze A. Although, performance of these mazes cannot be compared directly but we will see how they fair against each other.
For each maze and the algorithm tested, the bot is initialized at [0,0] at the bottom left corner of the maze and termination condition is set at top right corner, i.e. [58, 58] for the two larger mazes and [38, 38] for the smaller one. The bot has to make decisions when it reaches an intersection, i.e. a point where it has a choice of going north, south, east or west (the four possible actions the robot might take). These directions are probabilistically set with the probability of success in the intended direction at 0.8 and 0.2 otherwise. GridWorldDomain, available in BURLAP, was used to set the domain as Map and a simple reward function was implemented with goal/termination state reward as 0 while a small negative reward of -0.1 for all other states. Setting up the reward function this way along with the terminal function of single goal (TFGoalCondition) encourages the bot to perform path-finding in a way so that it tries to reach the end of the maze as quickly as possible.
All of the above mazes were generated using the link : http://www.delorie.com/game-room/mazes/genmaze.cgi which was sug- gested by a class mate on piazza. The generated mazes were than manipulated in Excel (to convert to either 1s for walls) or 0s (for paths) before being passed to the BURLAP domain constructor.
3.2 Sherpa Rescue Problem
Most deaths on Mt.Everest and other high peaks are attributed to avalanches, injuries from falls or ice collapse into crevasses. Search and rescue missions to find the casualties are sent but not all of the bodies are located and brought back down, in fact on these missions even the rescue team members have a greater probability of not returning back safely. These search and rescue teams mostly consist of local Tibetans/Nepalese known as Sherpa people. As an avid hiker and a mountain lover, I thought of a scenario where an agent(s) could replace the search and rescue teams for recovering the dead bodies from up there rather than sending human teams with greater risk of loosing lives again. Here an MDP problem can be thought of by considering one such robot which would perform this task. For simplicity, we will only focus on the search part of the team where an agent climbs up the mountain from the south eastern part up to the summit in search of casualties.
To make this a little interesting, I have tried to implement this as a simple grid seen from a birds eye view of a mountain with summit [G] in the center along with few crevasses [red cells] and rock faces [no go black cells]. The primary grid and its variants are shown in Figure 2. To explore different states, I have named them Everest (as the primary grid) with 9 x 9 grid and 71 states (goal state at [4, 4]), K2 with 7 x 7 grid and 43 states (goal state at [3, 3]) and Denali with 5 x 5 grid and 21 states (goal state at [2,2]). Again for simplicity, I have considered these peaks to be in 2D excluding effects of altitude variation.
For this problem the bot will have a choice of going north, south, east or west to reach the summit. These moves are probabilistically set at 0.8 in the desired direction and 0.2 in any undesired direction. A reward function with goal reward of 0 is set up along with small negative reward of -0.04 for all other states except states which are marked red as crevasses. For these states, a harsh negative reward of -2 is set in order for the agent to avoid slipping into a crevasse. Goal is to safely search up till the summit as quickly as possible without falling into any crevasses.
All of these grids representing the three peaks were created manually and the crevasses [red cells] and rock face cliffs [black cells] were placed arbitrarily.
4 Algorithms 4.1 Value Iteration
Value iteration (sometimes called a ’backward function’) is a planning algorithm that computes an optimal MDP policy in a reverse fashion as it starts from the ’goal’ state and works backward to the initial state refining the estimate of the best policy and value. The process of propagating backwards is never ending so an arbitrary end point is chosen to stop the process when the algorithm
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: email@example.com 微信:itcsdx