# python代写|AI代写|Machine Learning-Problem description Grid World

python代写|AI代写|Machine Learning: 这是一个基础的人工智能方面的project

Problem description Grid World is a 2D rectangular grid of size ( Nrows , Ncolumns ) with an agent starting off atone grid cell, moving from cell to cell through the grid, and eventually exiting aftercollecting a reward. This grid environment is described as follows: State space: GridWorld has Nrows Ncolumns distinct states. We use s to denotethe state. The agent starts in the bottom-left cell (row 1, column 1, marked as agreen cell). There exist one or more terminal states (blue cells) that can belocated anywhere in the grid (except the bottom-left cell). There may also bewalls (red cells) that the agent cannot be moved to. An instance of this grid isshown in the figure below. Actions: At every non-terminal state, the agent can either walk or run in any ofthe four directions (Up, Down, Left, and Right), which results in 8 possibleactions: Walk Up, Walk Down, Walk Left, Walk Right, Run Up, RunDown, Run Left, Run Right. At the terminal state, the only possible action isExit. We use A ( s ) to denote the set of all possible actions at state s.

Transition model : GridWorld is stochastic because the actions can beunreliable. In this environment, action Walk X ( X can be Up, Down, Left, orRight) moves the agent one cell in the X direction with probability pwalk , but withprobabilities 0. 5 ( 1 pwalk )and 0. 5 ( 1 pwalk ) moves the agent one cell at anglesof 90 and -90 to the direction X , respectively.Furthermore, action Run X moves the agent two cells in the X direction withprobability prun , but with probabilities 0. 5 ( 1 prun )and 0. 5 ( 1 prun ) moves theagent two cells at angles of 90 and -90 to the direction X , respectively.

If moving in a particular direction causes the agent to bump into a wall, the movementfails, and the agent stays in the same cell " i , j ". We write P ( s | s , a ) to denote theprobability of reaching state s if action a is done in state s. The following examplesillustrate the environment dynamics: Assume that the agent chooses action Walk Up at 4,4 as shown infigure below. This action moves the agent to (5,4) with probability pwalk ,but with probability 0. 5 ( 1 pwalk ), it moves the agent right to 4,5, andwith probability 0. 5 ( 1 pwalk ), it moves the agent left to "4,3. Assume that the agent chooses action Run Up at 4,4 as shown infigure below. This action moves the agent two cells up, but because itcauses the agent to bump into a wall, the agent stays at 4,4 withprobability prun. With probability 0. 5 ( 1 prun ), the agent moves two cellsright to 4,6. Finally, this action moves the agent two cells left withprobability 0. 5 ( 1 prun ), but because of the wall at 4,2, the agent staysat 4,4 with probability 0. 5 ( 1 prun ). Hence, as a result of this action, theagent moves to 4,6 with probability 0. 5 ( 1 prun ) and stays at 4,4 withprobability prun + 0. 5 ( 1 prun ).

Assume that the agent chooses action Walk Right at 4,1 as shown infigure below. Then, the agent moves to 5,1 and 3,1, each withprobability 0. 5 ( 1 pwalk ) and stays at 4,1 with probability pwalk.

Assume that the agent chooses action Run Right at 4,1 as shown infigure below. Then, the agent moves to 2,1 with probability 0. 5 ( 1 prun )and stays at 4,1 with probability prun + 0. 5 ( 1 prun ). Rewards: When the agent takes action a in state s , it receives a reward, R ( s , a ).For all non-terminal states, s : R ( s, Walk X ) =Rwalk (a constant). R ( s, Run X ) = Rrun (a constant).Furthermore, if there are K terminal states, the reward in the k th terminal state, s is R ( s, Exit) = Rterminal ( k ) (a constant).The agent prefers receiving its reward as soon as possible. Therefore, it applies adiscount factor,, to future rewards. For example, if it runs once and then exits from thek th terminal state, it will receive a total reward of Rrun + Rterminal ( k ). If it instead walkstwice and then exits from the k th terminal state, it will receive a total reward ofRwalk + Rwalk + ^2 Rterminal ( k ).

In this assignment, you need to find the action that the agent should take at each stateto maximize its expected reward. You can use any algorithm for this homework.Implementation Notes

1. Ties between the possible actions are broken based on the following preferenceorder:Walk Up > Walk Down > Walk Left > Walk Right > Run Up > Run Down

Run Left > Run Right.

2. The grid will contain at most 1,000,000 cells.
3. As the wall states are never reached, it does not matter what the optimal actionsare at these states. However, for consistency, you should print None as theoptimal action for a wall state.File FormatsInput format: includes two positive integers separated by a comma , where the firstone is the number of rows ( Nrows ) of the grid, and the second one is the number ofcolumns ( Ncolumns ) of the grid. is a number (greater than or equal to zero) that specifiesthe number of wall cells in the grid. contains lines where each lineincludes 2 values separated by a comma ,. The two values in each line indicate therow and column numbers of the gird where a wall is located. is a number (greater than or equal to one) thatspecifies the number of terminal states in the grid. contains lines where each line includes 3 values separated by a comma ,. The firsttwo values in the k -th line indicate the row and column numbers of the grid where thek th terminal state is located, and the third value (a float) denotes Rterminal ( k ). contains two probability values separated by a comma ,,where the first value is pwalk and the second value is prun. contains two floating-point numbers separated by a comma , where thefirst value is Rwalk and the second value is Rrun. which is a number in the interval [0,1].

Output format: contains Nrows lines where each line includes Ncolumnsstrings separated by a comma ,. The j -th string at the i -th line should indicate theoptimal action (one of Walk Up, Walk Down, Walk Left, Walk Right, Run Up,Run Down, Run Left, or Run Right for non-terminal states; Exit for terminal states;or None for wall states) at state Nrows i + 1 , j .GradingYour homework submission will be tested as follows: Your program should take nocommand-line arguments. It should read a text file called input.txt in the currentdirectory that contains a problem definition. It should write a file output.txt with yoursolution. The formats for files input.txt and output.txt are specified below. End-of-lineconvention is Unix (because Vocareum is a Unix system).During grading, 50 test cases will be used. If your homework submission passes all 50test cases, you receive 100 points. For each test case that fails, you will lose 2 points.Note that if your code does not compile, or somehow fails to load and parse input.txt, orwrites an incorrectly formatted output.txt, or no output.txt at all, or OuTpUt.TxT, you willget zero points.Sample Test CasesYou are provided with 3 sample test cases and outputs. Please find them in theHomework 3 folder.Homework Rules

1. You must use Python 2.7 to implement your homework assignment. You areallowed to use standard libraries only. You have to implement any other functionsor methods by yourself.
2. You need to create a file named hw3cs561s2018.py. When you submit thehomework on labs.vocareum.com, the following commands will be executed:python hw3cs561s2018.py
3. Your code must create a file named output.txt and print its output there. Foreach test case, the grading script will put an input.txt file in your work folder,runs your program (which reads input.txt), and check the output.txt filegenerated by your code. The grading script will replace the files automatically, soyou do NOT need to do anything for that part.