AIMA Artificial Intelligence a modern approch

AIMA-exercises is an open-source community of students, instructors and developers. Anyone can add an exercise, suggest answers to existing questions, or simply help us improve the platform. We accept contributions on this github repository.

Exercise 21.2

ChapterĀ complex-decisions-chapter defined a proper policy for an MDP as one that is guaranteed to reach a terminal state. Show that it is possible for a passive ADP agent to learn a transition model for which its policy $\pi$ is improper even if $\pi$ is proper for the true MDP; with such models, the POLICY-EVALUATION step may fail if $\gamma1$. Show that this problem cannot arise if POLICY-EVALUATION is applied to the learned model only at the end of a trial.

View Answer