Hello Peers, Today we are going to share** all week’s assessment and quizzes answers** of the **Sample-based Learning Methods Coursera **course launched by **Coursera **totally **free of cost**✅✅✅. This is a** certification course** for every interested student.

In case you didn’t find this course for free, then you can** apply for financial ads** to get this course for totally free.

*Check out this article – “How to Apply for Financial Ads?”*

**About The Coursera**

**Coursera**, **India’s biggest learning platform** launched millions of free courses for students daily. These courses are from various recognized universities, where industry experts and professors teach in a very well manner and in a more understandable way.

Here, you will find ** Sample-based Learning Methods Coursera Exam Answers** in

**Bold Color**which are given below.

These answers are updated recently and are **100% correct✅** answers of all week, assessment, and final exam answers of **Sample-based Learning Methods Coursera **from **Coursera Free Certification Course.**

Use “Ctrl+F” To Find Any Questions Answer. & For Mobile User, You Just Need To Click On Three dots In Your Browser & You Will Get A “Find” Option There. Use These Option to Get Any Random Questions Answer.

**About Sample-based Learning Methods Coursera Course**

You will discover numerous algorithms in this course that can develop close to optimal policies through interaction with the environment and trial-and-error learning based on the agent’s own experience. The striking thing about learning through actual experience is that it can still lead to the best conduct even if it doesn’t require prior understanding of the dynamics of the environment. We will discuss strong yet intuitively simple Monte Carlo methods as well as Q-learning and other temporal difference learning techniques. We will look at methods that mix model-based planning (like dynamic programming) and temporal difference updates in order to drastically speed up learning as we close up this course.

When this course is over, you will be able to:

- Recognize Monte Carlo and Temporal-Difference Learning as two methods for estimating value functions from samples of experience.
- Recognize the value of exploration when using sampled experience rather than sweeps from dynamic programming within a model.
- Recognize the relationships between TD, Dynamic Programming, and Monte Carlo.
- Use the TD algorithm to implement and apply value function estimation
- Put Expected Sarsa and Q-learning into practise and apply them (these are two TD methods for control) – Recognize the distinction between on-policy and off-policy control
- Gain knowledge about planning through simulated experience (as opposed to classic planning strategies)
- Use Dyna, a model-based approach to RL that relies on simulated experience.
- Carry out an empirical investigation to determine whether utilising Dyna increases sampling efficiency.

**SKILLS YOU WILL GAIN**

- Artificial Intelligence (AI)
- Machine Learning
- Reinforcement Learning
- Function Approximation
- Intelligent Systems

**Course Apply Link – Sample-based Learning Methods Coursera **

**Sample-based Learning Methods Coursera Quiz Answers**

### Week 01 Quiz Answers

Q1. Which approach ensures continual (never-ending) exploration? (**Select all that apply**)

- Exploring starts
- On-policy learning with a
**deterministic**policy - On-policy learning with an \epsilon
*ϵ*-soft policy - Off-Policy learning with an \epsilon
*ϵ*-soft behavior policy and a**deterministic**target policy - Off-Policy learning with an \epsilon
*ϵ*-soft target policy and a**deterministic**behavior policy

Q2. When can Monte Carlo methods, as defined in the course, be applied? (Select all that apply)

- When the problem is
**continuing**and given a batch of data containing sequences of states, actions, and rewards - When the problem is
**continuing**and there is a model that produces samples of the next state and reward - When the problem is
**episodic**and given a batch of data containing sample episodes (sequences of states, actions, and rewards) - When the problem is
**episodic**and there is a model that produces samples of the next state and reward

Q3. Which of the following learning settings are examples of off-policy learning? (Select all that apply)

- Learning the optimal policy while continuing to explore
- Learning from data generated by a human expert

Q4. If a trajectory starts at time t*t* and ends at time T*T*, what is its relative probability under the target policy \pi*π* and the behavior policy b*b*?

Hint: pay attention to the time subscripts of A*A* and S*S* in the answers below.

Hint: Sums and products are not the same things!

- {\displaystyle \prod_{k=t}^{T-1}\frac{\pi(A_k\mid S_k)}{b(A_k\mid S_k)}}
*k*=*t*∏*T*−1*b*(*Ak*∣*Sk*)*π*(*Ak*∣*Sk*) - {\displaystyle \sum_{k=t}^{T-1}\frac{\pi(A_k\mid S_k)}{b(A_k\mid S_k)}}
*k*=*t*∑*T*−1*b*(*Ak*∣*Sk*)*π*(*Ak*∣*Sk*) - {\displaystyle\frac{\pi(A_{T-1}\mid S_{T-1})}{b(A_{T-1}\mid S_{T-1})}}
*b*(*AT*−1∣*ST*−1)*π*(*AT*−1∣*ST*−1) - {\displaystyle\frac{\pi(A_{t}\mid S_{t})}{b(A_{t}\mid S_{t})}}
*b*(*At*∣*St*)*π*(*At*∣*St*)

Q5. When is it possible to determine a policy that is greedy with respect to the value functions v_{\pi}, q_{\pi}*vπ*,*qπ* for the policy \pi*π*? (Select all that apply)

- When state values v_{\pi}
*vπ* and a model are available - When state values v_{\pi}
*vπ* are available but no model is available. - When action values q_{\pi}
*qπ* and a model are available - When action values q_{\pi}
*qπ* are available but no model is available.

Q6. Monte Carlo methods in Reinforcement Learning work by…

Hint: recall we used the term *sweep* in dynamic programming to discuss updating all the states systematically. This is **not** the same as visiting a state.

- Performing
**sweeps**through the state set - Averaging sample returns
- Averaging sample rewards
**Planning**with a model of the environment

Q7. Suppose the state s*s* has been visited three times, with corresponding returns 88, 44, and 33. What is the current Monte Carlo estimate for the value of s*s*?

- 33
- 1515
- 55
- 3.53.5

Q8. When does Monte Carlo prediction perform its first update?

- After the first time step
- After every state is visited at least once
- At the end of the first episode

Q9. In Monte Carlo prediction of state-values, **memory **requirements depend on (Select all that apply).

Hint: think of the two data structures used in the algorithm

- The number of states
- The number of possible actions in each state
- The length of episodes

Q10. In an \epsilon*ϵ*-greedy policy over \mathcal{A}A actions, what is the probability of the highest valued action if there are no other actions with the same value?

- 1-\epsilon1−
*ϵ* - \epsilon
*ϵ* - 1-\epsilon+\frac{\epsilon}{\mathcal{A}}1−
*ϵ*+A*ϵ* - \frac{\epsilon}{\mathcal{A}}A
*ϵ* **Quiz Answers Sample Based Learning****Week 1:***https://technorj.com/wp-content/uploads/2022/07/Quiz_-Graded-Quiz.pdf*

### Week 02 Quiz Answers

Q1. TD(0) is a solution method for:

- Control
- Prediction

Q2. Which of the following methods use bootstrapping? (Select all that apply)

- Dynamic Programming
- Monte Carlo
- TD(0)

Q3. Which of the following is the correct characterization of Dynamic Programming (DP) and Temporal Difference (TD) methods?

- Both TD methods and DP methods require a model: the dynamics function p.
- Neither TD methods nor DP methods require a model: the dynamics function p.
- TD methods require a model, the dynamics function p, but Monte-Carlo methods do not.
- DP methods require a model, the dynamics function p, but TD methods do not.

Q4. Match the algorithm name to its correct update (**select all that apply**)

- TD(0): V(S_t) \leftarrow V(S_t) + \alpha [G_t – V(S_t)]
*V*(*St*)←*V*(*St*)+*α*[*Gt*−*V*(*St*)] - Monte Carlo: V(S_t) \leftarrow V(S_t) + \alpha [G_t – V(S_t)]
*V*(*St*)←*V*(*St*)+*α*[*Gt*−*V*(*St*)] - TD(0): V(S_t) \leftarrow V(S_t) + \alpha [R_{t + 1} + \gamma V(S_{t + 1}) – V(S_t)]
*V*(*St*)←*V*(*St*)+*α*[*Rt*+1+*γV*(*St*+1)−*V*(*St*)] - Monte Carlo: V(S_t) \leftarrow V(S_t) + \alpha [R_{t + 1} + \gamma V(S_{t + 1}) – V(S_t)]
*V*(*St*)←*V*(*St*)+*α*[*Rt*+1+*γV*(*St*+1)−*V*(*St*)]

Q5. Which of the following well-describe Temporal Difference (TD) and Monte-Carlo (MC) methods?

- TD methods can be used in
*continuing*tasks. - MC methods can be used in
*continuing*tasks. - TD methods can be used in
*episodic*tasks. - MC methods can be used in
*episodic*tasks.

Q6. In an episodic setting, we might have different updates depending on whether the next state is terminal or non-terminal. Which of the following TD error calculations are correct?

- S_{t + 1}
*St*+1 is non-terminal: \delta_t = R_{t + 1} + \gamma V(S_{t + 1}) – V(S_t)*δt*=*Rt*+1+*γV*(*St*+1)−*V*(*St*) - S_{t + 1}
*St*+1 is non-terminal: \delta_t = R_{t + 1} – V(S_t)*δt*=*Rt*+1−*V*(*St*) - S_{t + 1}
*St*+1 is terminal: \delta_t = R_{t + 1} + \gamma V(S_{t + 1}) – V(S_t)*δt*=*Rt*+1+*γV*(*St*+1)−*V*(*St*) with V(S_{t + 1}) = 0*V*(*St*+1)=0 - S_{t + 1}
*St*+1 is terminal: \delta_t = R_{t + 1} – V(S_t)*δt*=*Rt*+1−*V*(*St*)

Q7. Suppose we have current estimates for the value of two states: V(A) = 1.0, V(B) = 1.0 in an episodic setting. We observe the following trajectory: A, 0, B, 1, B, 0, T where T is a terminal state. Apply TD(0) with step-size, \alpha = 1*α*=1, and discount factor, \gamma = 0.5*γ*=0.5. What are the value estimates for state A and state B at the end of the episode?

- (1.0, 1.0)
- (0.5, 0)
- (0, 1.5)
- (1, 0)
- (0, 0)

Q8. Which of the following pairs is the correct characterization of the targets used in TD(0) and Monte Carlo?

- TD(0): High Variance Target, Monte Carlo: High Variance Target
- TD(0): High Variance Target, Monte Carlo: Low Variance Target
- TD(0): Low Variance Target, Monte Carlo: High Variance Target
- TD(0): Low Variance Target, Monte Carlo: Low Variance Target

Q9. Suppose you observe the following episodes of the form (State, Reward, …) from a Markov Decision Process with states A and B:

Episodes |
---|

A, 0, B, 0 |

B, 1 |

B, 1 |

B, 1 |

B, 0 |

B, 0 |

B, 1 |

B, 0 |

What would batch Monte Carlo methods give for the estimates V(A) and V(B)? What would batch TD(0) give for the estimates V(A) and V(B)? Use a discount factor, \gamma*γ*, of 1.

For Batch MC: compute the average returns observed from each state. For Batch TD: You can start with state B. What is its expected return? Then figure out V(A) using the temporal difference equation: V(S_t) = E [R_{t+1} + \gamma V(S_{t+1})]*V*(*S**t*)=*E*[*R**t*+1+*γ**V*(*S**t*+1)].

Answers are provided in the following format:

- V^\text{batch-MC}(A)
*V*batch-MC(*A*) is the value for state A*A*under Monte Carlo learning - V^\text{batch-MC}(B)
*V*batch-MC(*B*) is the value of state B*B*under Monte Carlo learning - V^\text{batch-TD}(A)
*V*batch-TD(*A*) is the value of state A*A*under TD learning - V^\text{batch-TD}(B)
*V*batch-TD(*B*) is the value of state B*B*under TD learning

Hint: review example 6.3 in Sutton and Barto; this question is the same, just with different numbers.

- V^\text{batch-MC}(A)=0
*V*batch-MC(*A*)=0 - V^\text{batch-MC}(B)=0.5
*V*batch-MC(*B*)=0.5 - V^\text{batch-TD}(A)=0.5
*V*batch-TD(*A*)=0.5 - V^\text{batch-TD}(B)=0.5
*V*batch-TD(*B*)=0.5 - V^\text{batch-MC}(A)=0
*V*batch-MC(*A*)=0 - V^\text{batch-MC}(B)=0.5
*V*batch-MC(*B*)=0.5 - V^\text{batch-TD}(A)=0
*V*batch-TD(*A*)=0 - V^\text{batch-TD}(B)=0.5
*V*batch-TD(*B*)=0.5 - V^\text{batch-MC}(A)=0
*V*batch-MC(*A*)=0 - V^\text{batch-MC}(B)=0.5
*V*batch-MC(*B*)=0.5 - V^\text{batch-TD}(A)=0
*V*batch-TD(*A*)=0 - V^\text{batch-TD}(B)=0
*V*batch-TD(*B*)=0 - V^\text{batch-MC}(A)=0
*V*batch-MC(*A*)=0 - V^\text{batch-MC}(B)=0.5
*V*batch-MC(*B*)=0.5 - V^\text{batch-TD}(A)=1.5
*V*batch-TD(*A*)=1.5 - V^\text{batch-TD}(B)=0.5
*V*batch-TD(*B*)=0.5 - V^\text{batch-MC}(A)=0.5
*V*batch-MC(*A*)=0.5 - V^\text{batch-MC}(B)=0.5
*V*batch-MC(*B*)=0.5 - V^\text{batch-TD}(A)=0.5
*V*batch-TD(*A*)=0.5 - V^\text{batch-TD}(B)=0.5
*V*batch-TD(*B*)=0.5

Q10. True or False: “Both TD(0) and Monte-Carlo (MC) methods converge to the true value function asymptotically, given that the environment is Markovian.”

- True
- False

Q11. Which of the following pairs is the correct characterization of the TD(0) and Monte-Carlo (MC) methods?

- Both TD(0) and MC are offline methods.
- Both TD(0) and MC are online methods.
- TD(0) is an online method while MC is an offline method.
- MC is an online method while TD(0) is an offline method.

**Sample-based Learning Methods Coursera Quiz Answers Week 2: https://technorj.com/wp-content/uploads/2022/07/Week-2.pdf**

### Week 03 Quiz Answers

Q1. What is the target policy in Q-learning?

- \epsilon
*ϵ*-greedy with respect to the current action-value estimates - Greedy with respect to the current action-value estimates

Q2. Which Bellman equation is the basis for the Q-learning update?

- Bellman equation for state values
- Bellman equation for action values
- Bellman optimality equation for state values
- Bellman optimality equation for action values

Q3. Which Bellman equation is the basis for the Sarsa update?

- Bellman equation for state values
- Bellman equation for action values
- Bellman optimality equation for state values
- Bellman optimality equation for action values

Q4. Which Bellman equation is the basis for the Expected Sarsa update?

- Bellman equation for state values
- Bellman equation for action values
- Bellman optimality equation for state values
- Bellman optimality equation for action values

Q5. Which algorithm’s update requires more computation per step?

- Expected Sarsa
- Sarsa

Q6. Which algorithm has a higher variance target?

- Expected Sarsa
- Sarsa

Q7. Q-learning does not learn about the outcomes of exploratory actions.

- True
- False

Q8. Sarsa, Q-learning, and Expected Sarsa have similar targets on a transition to a terminal state.

- True
- False

Q9. Sarsa needs to wait until the end of an episode before performing its update.

- True
- False

**Sample-based Learning Methods Coursera Quiz Answers Week 3:** https://technorj.com/wp-content/uploads/2022/07/week-3.pdf

### Week 04 Quiz Answers

Q1. Which of the following are the most accurate characterizations of sample models and distribution models? (Select all that apply)

- Both sample models and distribution models can be used to obtain a possible next state and reward, given the current state and action.
- A distribution model can be used as a sample model.
- A sample model can be used to compute the probability of all possible trajectories in an episodic task based on the current state and action.
- A sample model can be used to obtain a possible next state and reward given the current state and action, whereas a distribution model can only be used to compute the probability of this next state and reward given the current state and action.

Q2. Which of the following statements are TRUE for Dyna architecture? (Select all that apply)

- Real experience can be used to improve the value function and policy
- Simulated experience can be used to improve the model
- Real experience can be used to improve the model
- Simulated experience can be used to improve the value function and policy

Q3. Mark all the statements that are TRUE for the tabular Dyna-Q algorithm. (Select all that apply)

- The memory requirements for the model in case of a deterministic environment are quadratic in the number of states
- The environment is assumed to be deterministic.
- The algorithm
**cannot**be extended to stochastic environments. - For a given state-action pair, the model predicts the next state and reward

Q4. Which of the following statements are TRUE? (Select all the apply)

- Model-based methods often suffer more from bias than model-free methods, because of inaccuracies in the model.
- Model-based methods like Dyna typically require more memory than model-free methods like Q-learning.
- When compared with model-free methods, model-based methods are relatively more sample efficient. They can achieve a comparable performance with comparatively fewer environmental interactions.
- The amount of computation per interaction with the environment is larger in the Dyna-Q algorithm (with non-zero planning steps) as compared to the Q-learning algorithm.

Q5. Which of the following is generally the most computationally expensive step of the Dyna-Q algorithm? Assume N>1 planning steps are being performed (e.g., N=20).

- Model learning (step e)
- Direct RL (step d)
- Action selection (step b)
- Planning (Indirect RL; step f)

Q6. What are some possible reasons for a learned model to be inaccurate? (Select all that apply)

- The agent’s policy has changed significantly from the beginning of training.
- There is too much exploration (e.g., epsilon is epsilon-greedy exploration is set to a high value of 0.5)
- The environment has changed.
- The transition dynamics of the environment are stochastic, and only a few transitions have been experienced.

Q7. In search control, which of the following methods is likely to make a Dyna agent perform better in problems with a large number of states (like the rod maneuvering problem in Chapter 8 of the textbook)? Recall that search control is the process that selects the starting states and actions in planning. Also recall the navigation example in the video lectures in which a large number of wasteful updates were being made because of the basic search control procedure in the Dyna-Q algorithm. (Select the best option)

- Select state-action pairs uniformly at random from all previously experienced pairs.
- Start backwards from state-action pairs that have had a non-zero update (e.g., from the state right beside a goal state). This avoids the otherwise wasteful computations from state-action pairs which have had no updates.
- Start with state-action pairs enumerated in a fixed order (e.g., in a gridworld, states top-left to bottom-right, actions up, down, left, right)
- All of these are equally good/bad.

Q8. In the lectures, we saw how the Dyna-Q+ agent found the newly-opened shortcut in the shortcut maze, whereas the Dyna-Q agent didn’t. Which of the following implications drawn from the figure are TRUE? (Select all that apply)

- The Dyna-Q+ agent performs better than the Dyna-Q agent even in the first half of the experiment because of the increased exploration.
- The Dyna-Q agent can never discover shortcuts (i.e., when the environment changes to become better than it was before).
- The difference between Dyna-Q+ and Dyna-Q narrowed slightly over the first part of the experiment. This is because the Dyna-Q+ agent keeps exploring even when the environment isn’t changing.
- None of the above are true.

Q9. Consider the gridworld depicted in the diagram below. There are four actions corresponding to up, down, right, and left movements. Marked is the path taken by an agent in a single episode, ending at a location of high reward, marked by the G. In this example the values were all zero at the start of the episode, and all rewards were zero during the episode except for a positive reward at G.

- Now which of the following figures best depicts the action values that would’ve increased by the end of the episode using
and*one*-step Sarsa? (Select the best option)*500*-step-planning Dyna-Q

Q10. Which of the following are planning methods? (Select all that apply)

- Dyna-Q
- Expected Sarsa
- Value Iteration
- Q-learning

**Sample-based Learning Methods Coursera Quiz Answers Week 4: ***https://technorj.com/wp-content/uploads/2022/07/week-4.pdf*

**Conclusion**

Hopefully, this article will be useful for you to find all the **Week, final assessment, and Peer Graded Assessment Answers of Sample-based Learning Methods Coursera Quiz of Coursera** and grab some premium knowledge with less effort. If this article really helped you in any way then make sure to share it with your friends on social media and let them also know about this amazing training. You can also check out our other course Answers. So, be with us guys we will share a lot more free courses and their exam/quiz solutions also, and follow our Techno-RJ **Blog** for more updates.