Hello Peers, Today we are going to share all week’s assessment and quizzes answers to the Fundamentals of Reinforcement Learning course launched by Coursera totally free of costβ β β . This is a certification course for every interested student.
In case you didn’t find this course for free, then you can apply for financial ads to get this course for totally free.
Check out this article – “How to Apply for Financial Ads?”
About The Coursera
Coursera, India’s biggest learning platform launched millions of free courses for students daily. These courses are from various recognized universities, where industry experts and professors teach in a very well manner and in a more understandable way.
Here, you will find Fundamentals of Reinforcement Learning Exam Answers in Bold Color which are given below.
These answers are updated recently and are 100% correctβ answers of all week, assessment, and final exam answers of Fundamentals of Reinforcement Learning from Coursera Free Certification Course.
Use βCtrl+Fβ To Find Any Questions Answer. & For Mobile User, You Just Need To Click On Three dots In Your Browser & You Will Get A βFindβ Option There. Use These Option to Get Any Random Questions Answer.
About Fundamentals of Reinforcement Learning Course
Reinforcement Learning is a subfield of Machine Learning and a general formalism for AI and automated decision-making. This course will teach you about statistical learning techniques where an agent takes actions and interacts with the world. Understanding the importance and challenges of learning agents that make decisions is very important today, as more and more companies are interested in interactive agents and intelligent decision-making.
This course shows you how Reinforcement Learning works and how it can be used. When this course is over, you’ll be able to: Formalize problems as Markov Decision Processes
- Know basic exploration methods and the tradeoff between exploration and exploitation
- Learn about value functions as a general tool for making the best decisions.
- Understand how to use dynamic programming as a good way to solve an industrial control problem.
This course will teach you the main ideas behind Reinforcement Learning, which are the basis for both old and new algorithms in RL. After you finish this course, you’ll be able to use RL to solve real-world problems where you know or can figure out the MDP.
This is the first course in the Specialization in Reinforcement Learning.
WHAT YOU WILL LEARN
- Formalize problems as Markov Decision Processes
- Understand basic exploration methods and the exploration/exploitation tradeoff
- Understanding value functions, as a general-purpose tool for optimal decision-making
- Know how to implement dynamic programming as an efficient solution approach to an industrial control problem
SKILLS YOU WILL GAIN
- Artificial Intelligence (AI)
- Machine Learning
- Reinforcement Learning
- Function Approximation
- Intelligent Systems
Course Apply Link – Fundamentals of Reinforcement Learning
Fundamentals of Reinforcement Learning Quiz Answers
Checkout This Article: Is Team Viewer Safe For Use in 2022 | All You Need To Know About Team Viewer [Latest UpdateβΌοΈ]
Week 1 Quiz Answers
Quiz 1: Sequential Decision-Making
Q1. What is the incremental rule (sample average) for action values?
- Q_{n+1}= Q_n + \frac{1}{n} [R_n + Q_n]
- Q_{n+1}= Q_n β \frac{1}{n} [R_n β Q_n]
- Q_{n+1}= Q_n + \frac{1}{n} [R_n β Q_n]
- Q_{n+1}= Q_n + \frac{1}{n} [Q_n]
Q2. Equation 2.5 (from the SB textbook, 2nd edition) is a key update rule we will use throughout the Specialization. We discussed this equation extensively in video. This exercise will give you a better hands-on feel for how it works. The blue line is the target that we might estimate with equation 2.5. The red line is our estimate plotted over time.
q_{n+1}=q_n+\alpha_n[R_n -q_n]
Given the estimate update in red, what do you think was the value of the step size parameter we used to update the estimate on each time step?
- 1.0
- 1/2
- 1/8
- 1 / (t β 1)
Q3. Equation 2.5 (from the SB textbook, 2nd edition) is a key update rule we will use throughout the Specialization. We discussed this equation extensively in video. This exercise will give you a better hands-on feel for how it works. The blue line is the target that we might estimate with equation 2.5. The red line is our estimate plotted over time.
q_{n+1}=q_n+\alpha_n[R_n -q_n]
Given the estimate update in red, what do you think was the value of the step size parameter we used to update the estimate on each time step?
- 1 / (t β 1)
- 1/2
- 1/8
- 1.0
Q4. Equation 2.5 (from the SB textbook, 2nd edition) is a key update rule we will use throughout the Specialization. We discussed this equation extensively in video. This exercise will give you a better hands-on feel for how it works. The blue line is the target that we might estimate with equation 2.5. The red line is our estimate plotted over time.
q_{n+1}=q_n+\alpha_n[R_n -q_n]
Given the estimate update in red, what do you think was the value of the step size parameter we used to update the estimate on each time step?
- 1.0
- 1/8
- 1/2
- 1 / (t β 1)
Q5. Equation 2.5 (from the SB textbook, 2nd edition) is a key update rule we will use throughout the Specialization. We discussed this equation extensively in video. This exercise will give you a better hands-on feel for how it works. The blue line is the target that we might estimate with equation 2.5. The red line is our estimate plotted over time.
q_{n+1}=q_n+\alpha_n[R_n -q_n]
Given the estimate update in red, what do you think was the value of the step size parameter we used to update the estimate on each time step?
- 1.0
- 1/2
- 1/8
- 1 / (t β 1)
Q6. What is the exploration/exploitation tradeoff?
- The agent wants to explore to get more accurate estimates of its values. The agent also wants to exploit to get more reward. The agent cannot, however, choose to do both simultaneously.
- The agent wants to explore the environment to learn as much about it as possible about the various actions. That way once it knows every armβs true value it can choose the best one for the rest of the time.
- The agent wants to maximize the amount of reward it receives over its lifetime. To do so it needs to avoid the action it believes is worst to exploit what it knows about the environment. However to discover which arm is truly worst it needs to explore different actions which potentially will lead it to take the worst action at times.
Q7. Why did epsilon of 0.1 perform better over 1000 steps than epsilon of 0.01?
- The 0.01 agent did not explore enough. Thus it ended up selecting a suboptimal arm for longer.
- The 0.01 agent explored too much causing the arm to choose a bad action too often.
- Epsilon of 0.1 is the optimal value for epsilon in general.
Q8. If exploration is so great why did epsilon of 0.0 (a greedy agent) perform better than epsilon of 0.4?
- Epsilon of 0.0 is greedy, thus it will always choose the optimal arm.
- Epsilon of 0.4 doesnβt explore often enough to find the optimal action.
- Epsilon of 0.4 explores too often that it takes many sub-optimal actions causing it to do worse over the long term.
Week 2 Quiz Answers
Checkout This Article: The Importance of Team Viewers in the IT World In 2022 | Advantages & Disadvantages Of Team Viewers
Quiz 1: MDPs Quiz Answers
Q1. The learner and decision maker is the _.
- Environment
- Reward
- State
- Agent
Q2. At each time step the agent takes an _.
- Action
- State
- Environment
- Reward
Q3. Imagine the agent is learning in an episodic problem. Which of the following is true?
- The number of steps in an episode is always the same.
- The number of steps in an episode is stochastic: each episode can have a different number of steps.
- The agent takes the same action at each step during an episode.
Q4. If the reward is always +1 what is the sum of the discounted infinite return when \gamma < 1Ξ³<1
G_t=\sum_{k=0}^{\infty} \gamma^{k}R_{t+k+1}Gtβ=βk=0ββΞ³kRt+k+1β
- Gt=11βΞ³
- G_t=\frac{\gamma}{1-\gamma}Gtβ=1βΞ³Ξ³β
- Infinity.
- G_t=1*\gamma^kGtβ=1βΞ³k
Q5. How does the magnitude of the discount factor (gamma/\gammaΞ³) affect learning?
- With a larger discount factor the agent is more far-sighted and considers rewards farther into the future.
- The magnitude of the discount factor has no effect on the agent.
- With a smaller discount factor the agent is more far-sighted and considers rewards farther into the future.
Q6. Suppose \gamma=0.8Ξ³=0.8 and we observe the following sequence of rewards: R_1 = -3R1β=β3, R_2 = 5R2β=5, R_3=2R3β=2, R_4 = 7R4β=7, and R_5 = 1R5β=1, with T=5T=5. What is G_0G0β? Hint: Work Backwards and recall that G_t = R_{t+1} + \gamma G_{t+1}Gtβ=Rt+1β+Ξ³Gt+1β.
- 12
- -3
- 8.24
- 11.592
- 6.2736
Q7. What does MDP stand for?
- Markov Decision Protocol
- Markov Decision Process
- Markov Deterministic Policy
- Meaningful Decision Process
Q8. Suppose reinforcement learning is being applied to determine moment-by-moment temperatures and stirring rates for a bioreactor (a large vat of nutrients and bacteria used to produce useful chemicals). The actions in such an application might be target temperatures and target stirring rates that are passed to lower-level control systems that, in turn, directly activate heating elements and motors to attain the targets. The states are likely to be thermocouple and other sensory readings, perhaps filtered and delayed, plus symbolic inputs representing the ingredients in the vat and the target chemical. The rewards might be moment-by-moment measures of the rate at which the useful chemical is produced by the bioreactor.
Notice that here each state is a list, or vector, of sensor readings and symbolic inputs, and each action is a vector consisting of a target temperature and a stirring rate.
Is this a valid MDP?
- Yes. Assuming the state captures the relevant sensory information (inducing historical values to account for sensor delays). It is typical of reinforcement learning tasks to have states and actions with such structured representations; the states might be constructed by processing the raw sensor information in a variety of ways.
- No. If the instantaneous sensor readings are non-Markov it is not an MDP: we cannot construct a state different from the sensor readings available on the current time-step.
Q9. Case 1: Imagine that you are a vision system. When you are first turned on for the day, an image floods into your camera. You can see lots of things, but not all things. You canβt see objects that are occluded, and of course you canβt see objects that are behind you. After seeing that first scene, do you have access to the Markov state of the environment?
Case 2: Imagine that the vision system never worked properly: it always returned the same static imagine, forever. Would you have access to the Markov state then? (Hint: Reason about P(S_{t+1} | S_t, β¦, S_0)P(S
t+1= AllWhitePixels)
- You have access to the Markov state in both Case 1 and 2.
- You have access to the Markov state in Case 1, but you donβt have access to the Markov state in Case 2.
- You donβt have access to the Markov state in Case 1, but you do have access to the Markov state in Case 2.
- You donβt have access to the Markov state in both Case 1 and 2.
Q10. What is the reward hypothesis?
- That all of what we mean by goals and purposes can be well thought of as the minimization of the expected value of the cumulative sum of a received scalar signal (called reward)
- That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)
- Ignore rewards and find other signals.
- Always take the action that gives you the best reward at that point.
Q11. Imagine, an agent is in a maze-like gridworld. You would like the agent to find the goal, as quickly as possible. You give the agent a reward of +1 when it reaches the goal and the discount rate is 1.0, because this is an episodic task. When you run the agent its finds the goal, but does not seem to care how long it takes to complete each episode. How could you fix this? (Select all that apply)
- Give the agent a reward of 0 at every time step so it wants to leave.
- Set a discount rate less than 1 and greater than 0, like 0.9.
- Give the agent -1 at each time step.
- Give the agent a reward of +1 at every time step.
Q12. When may you want to formulate a problem as episodic?
- When the agent-environment interaction does not naturally break into sequences. Each new episode begins independently of how the previous episode ended.
- When the agent-environment interaction naturally breaks into sequences. Each sequence begins independently of how the episode ended.
Week 3 Quiz Answers
Quiz 1: [Practice] Value Functions and Bellman Equations Quiz Answers
Q1. A policy is a function which maps _ to _.
- Actions to probability distributions over values.
- Actions to probabilities.
- States to values.
- States to probability distributions over actions.
- States to actions.
Q2. The term βbackupβ most closely resembles the term _ in meaning.
- Value
- Update
- Diagram
Q3. At least one deterministic optimal policy exists in every Markov decision process.
- False
- True
Q4. The optimal state-value function:
- Is not guaranteed to be unique, even in finite Markov decision processes.
- Is unique in every finite Markov decision process.
Q5. Does adding a constant to all rewards change the set of optimal policies in episodic tasks?
- Yes, adding a constant to all rewards changes the set of optimal policies.
- No, as long as the relative differences between rewards remain the same, the set of optimal policies is the same.
Q6. Does adding a constant to all rewards change the set of optimal policies in continuing tasks?
- Yes, adding a constant to all rewards changes the set of optimal policies.
- No, as long as the relative differences between rewards remain the same, the set of optimal policies is the same.
Q7. Select the equation that correctly relates vββ to qββ. Assume Ο is the uniform random policy.
- v_{\ast}(s) = max_a q_{\ast}(s, a)vββ(s)=maxaβqββ(s,a)
- v_{\ast}(s) = \sum_{a, r, sβ} \pi(a | s) p(sβ, r | s, a) [r + q_{\ast}(sβ)]vββ(s)=βa,r,sββΟ(aβ£s)p(sβ,rβ£s,a)[r+qββ(sβ)]
- v_{\ast}(s) = \sum_{a, r, sβ} \pi(a | s) p(sβ, r | s, a) [r + \gamma q_{\ast}(sβ)]vββ(s)=βa,r,sββΟ(aβ£s)p(sβ,rβ£s,a)[r+Ξ³qββ(sβ)]
- v_{\ast}(s) = \sum_{a, r, sβ} \pi(a | s)p(sβ, r | s, a) q_{\ast}(sβ)vββ(s)=βa,r,sββΟ(aβ£s)p(sβ,rβ£s,a)qββ(sβ)
Q8. Select the equation that correctly relates qββ to vββ using four-argument function p
- q_{\ast}(s, a) = \sum_{sβ, r} p(sβ, r | a, s) [r + v_{\ast}(sβ)]qββ(s,a)=βsβ,rβp(sβ,rβ£a,s)[r+vββ(sβ²)]
- q_{\ast}(s, a) = \sum_{sβ, r} p(sβ, r | a, s) \gamma [r + v_{\ast}(sβ)]qββ(s,a)=βsβ,rβp(sβ,rβ£a,s)Ξ³[r+vββ(sβ)]
- q_{\ast}(s, a) = \sum_{sβ, r} p(sβ, r | a, s) [r + \gamma v_{\ast}(sβ)]qββ(s,a)=βsβ,rβp(sβ,rβ£a,s)[r+Ξ³vββ(sβ)]
Q9. Write a policy Οββ in terms of qββ.
- \pi_{\ast}(a|s) = q_{\ast}(s, a)Οββ(aβ£s)=qββ(s,a)
- \pi_{\ast}(a|s) = \max_{aβ} q_{\ast}(s, aβ)Οββ(aβ£s)=maxaββqββ(s,aβ)
- Οβ(a|s)=1 if a=argmaxaβ²qβ(s,aβ²), else 0
Q10. Give an equation for some Οββ in terms of vββ and the four-argument p.
- Οββ(aβ£s)=maxaβββsβ,rβp(sβ,rβ£s,aβ)[r+Ξ³vββ(sβ)]
- \pi_{\ast}(a|s) = \sum_{sβ, r} p(sβ, r | s, a) [ r + \gamma v_{\ast}(sβ)]Οββ(aβ£s)=βsβ,rβp(sβ,rβ£s,a)[r+Ξ³vββ(sβ)]
- \pi_{\ast}(a|s) = 1 \text{ if } v_{\ast}(s) = \max_{aβ} \sum_{sβ, r} p(sβ, r | s, aβ) [ r + \gamma v_{\ast}(sβ)], \text{ else } 0Οββ(aβ£s)=1 if vββ(s)=maxaβββsβ,rβp(sβ,rβ£s,aβ)[r+Ξ³vββ(sβ)], else 0
- \pi_{\ast}(a|s) = 1 \text{ if } v_{\ast}(s) = \sum_{sβ, r} p(sβ, r | s, a) [ r + \gamma v_{\ast}(sβ)], \text{ else } 0Οββ(aβ£s)=1 if vββ(s)=βsβ,rβp(sβ,rβ£s,a)[r+Ξ³vββ(sβ)], else 0
Quiz 2: Value Functions and Bellman Equations Quiz Answers
Q1. function which maps _ to _ is a value function. [Select all that apply]
- Values to states.
- State-action pairs to expected returns.
- States to expected returns.
- Values to actions.
Q2. Consider the continuing Markov decision process shown below. The only decision to be made is in the top state, where two actions are available, left and right. The numbers show the rewards that are received deterministically after each action. There are exactly two deterministic policies, \pi_{\text{left}}Ο
left
β
and \pi_{\text{right}}Ο
right
β
. Indicate the optimal policies if \gamma = 0Ξ³=0? If \gamma = 0.9Ξ³=0.9? If \gamma = 0.5Ξ³=0.5? [Select all that apply]
For \gamma = 0.9, \pi_{\text{left}}Ξ³=0.9,Ο
left
β
For \gamma = 0, \pi_{\text{left}}Ξ³=0,Ο
left
β
For \gamma = 0.9, \pi_{\text{right}}Ξ³=0.9,Ο
right
β
For \gamma = 0, \pi_{\text{right}}Ξ³=0,Ο
right
β
For \gamma = 0.5, \pi_{\text{left}}Ξ³=0.5,Ο
left
β
For \gamma = 0.5, \pi_{\text{right}}Ξ³=0.5,Ο
right
β
Q3. Every finite Markov decision process has __. [Select all that apply]
- A stochastic optimal policy
- A unique optimal policy
- A deterministic optimal policy
- A unique optimal value function
Q4. The _ of the reward for each state-action pair, the dynamics function pp, and the policy \piΟ is _ to characterize the value function v_{\pi}v
Ο
β
. (Remember that the value of a policy \piΟ at state ss is v_{\pi}(s) = \sum_a \pi(a | s) \sum_{sβ,r} p(sβ, r | s, a) [ r + \gamma v_{\pi}(sβ) ]v
Ο
β
(s)=β
a
β
Ο(aβ£s)β
s
β²
,r
β
p(s
β²
,rβ£s,a)[r+Ξ³v
Ο
β
(s
β²
)].)
Mean; sufficient
Distribution; necessary
Q5. The Bellman equation for a given a policy \piΟ: [Select all that apply]
- Holds only when the policy is greedy with respect to the value function.
- Expresses the improved policy in terms of the existing policy.
- Expresses state values v(s)v(s) in terms of state values of successor states.
Q6. An optimal policy:
- Is not guaranteed to be unique, even in finite Markov decision processes.
- Is unique in every Markov decision process.
- Is unique in every finite Markov decision process.
Q7. The Bellman optimality equation for v_{\ast}v
β
β
: [Select all that apply]
Expresses state values v_{\ast}(s)v
β
β
(s) in terms of state values of successor states.
Holds when the policy is greedy with respect to the value function.
Expresses the improved policy in terms of the existing policy.
Holds for v_{\pi}v
Ο
β
, the value function of an arbitrary policy \piΟ.
Holds for the optimal state value function.
Q8. Give an equation for v_{\pi}v
Q10. Let r(s,a)r(s,a) be the expected reward for taking action aa in state ss, as defined in equation 3.5 of the textbook. Which of the following are valid ways to re-express the Bellman equations, using this expected reward function? [Select all that apply]
- v_{\ast}(s) = \max_a [r(s, a) + \gamma \sum_{sβ} p(sβ | s, a) v_{\ast}(sβ)]vββ(s)=maxaβ[r(s,a)+Ξ³βsββp(sββ£s,a)vββ(sβ)]
- q_{\pi}(s, a) = r(s, a) + \gamma \sum_{sβ} \sum_{aβ} p(sβ | s, a) \pi(aβ | sβ)q_{\pi}(sβ, aβ)qΟβ(s,a)=r(s,a)+Ξ³βsβββaββp(sββ£s,a)Ο(aββ£sβ)qΟβ(sβ,aβ)
- v_{\pi}(s) = \sum_a \pi(a | s) [r(s, a) + \gamma \sum_{sβ} p(sβ | s, a) v_{\pi}(sβ)]vΟβ(s)=βaβΟ(aβ£s)[r(s,a)+Ξ³βsββp(sββ£s,a)vΟβ(sβ)]
- q_{\ast}(s, a) = r(s, a) + \gamma \sum_{sβ} p(sβ | s, a) \max_{aβ} q_{\ast}(sβ, aβ)qββ(s,a)=r(s,a)+Ξ³βsββp(sββ£s,a)maxaββqββ(sβ,aβ)
Q11. Consider an episodic MDP with one state and two actions (left and right). The left action has stochastic reward 11 with probability pp and 33 with probability 1-p1βp. The right action has stochastic reward 00 with probability qq and 1010 with probability 1-q1βq. What relationship between pp and qq makes the actions equally optimal?
- 7 + 3p = -10q7+3p=β10q
- 7 + 3p = 10q7+3p=10q
- 7 + 2p = 10q7+2p=10q
- 13 + 3p = -10q13+3p=β10q
- 13 + 2p = 10q13+2p=10q
- 13 + 2p = -10q13+2p=β10q
- 13 + 3p = 10q13+3p=10q
- 7 + 2p = -10q7+2p=β10q
Week 4 Quiz Answers
Quiz 1: Dynamic Programming Quiz Answers
Q1. The value of any state under an optimal policy is _ the value of that state under a non-optimal policy. [Select all that apply]
- Strictly greater than
- Greater than or equal to
- Strictly less than
- Less than or equal to
Q2. If a policy is greedy with respect to the value function for the
equiprobable random policy, then it is guaranteed to be an optimal policy.
- True
- False
Q3. Let v_{\pi}v
- True
- False
Q4. What is the relationship between value iteration and policy iteration? [Select all that apply]
- Value iteration is a special case of policy iteration.
- Policy iteration is a special case of value iteration.
- Value iteration and policy iteration are both special cases of
generalized policy iteration.
Q5. The word synchronous means βat the same timeβ. The word asynchronous means βnot at the same timeβ. A dynamic programming algorithm is: [Select all that apply]
- Asynchronous, if it does not update all states at each iteration.
- Synchronous, if it systematically sweeps the entire state space at each iteration.
- Asynchronous, if it updates some states more than others.
Q6. All Generalized Policy Iteration algorithms are synchronous.
- True
- False
Q7. Which of the following is true?
- Synchronous methods generally scale to large state spaces better than asynchronous methods.
- Asynchronous methods generally scale to large state spaces better than synchronous methods.
Q8. Why are dynamic programming algorithms considered planning methods? [Select all that apply]
- They compute optimal value functions.
- They learn from trial and error interaction.
- They use a model to improve the policy.
Q9. Consider the undiscounted, episodic MDP below. There are four actions possible in each state, A = {up, down, right, left}, which deterministically cause the corresponding state transitions, except that actions that would take the agent off the grid in fact leave the state unchanged. The right half of the figure shows the value of each state under the equiprobable random policy. If \piΟ is the equiprobable random policy, what is q(7,down)?
- q(7,down)=β14
- q(7,down)=β20
- q(7,down)=β21
- q(7,down)=β15
Q10. Consider the undiscounted, episodic MDP below. There are four actions possible in each state, A = {up, down, right, left}, which deterministically cause the corresponding state transitions, except that actions that would take the agent off the grid in fact leave the state unchanged. The right half of the figure shows the value of each state under the equiprobable random policy. If \piΟ is the equiprobable random policy, what is v(15)v(15)? Hint: Recall the Bellman equation v(s) = \sum_a \pi(a | s) \sum_{sβ, r} p(sβ, r | s, a) [r + ]
β
p(sβ,rβ£s,a)[r+Ξ³v(sβ)].
- v(15) = -25v(15)=β25
- v(15) = -22v(15)=β22
- v(15) = -24v(15)=β24
- v(15) = -23v(15)=β23
- v(15) = -21v(15)=β21
Conclusion
Hopefully, this article will be useful for you to find all the Week, final assessment, and Peer Graded Assessment Answers of Fundamentals of Reinforcement Learning Quiz of Coursera and grab some premium knowledge with less effort. If this article really helped you in any way then make sure to share it with your friends on social media and let them also know about this amazing training. You can also check out our other course Answers. So, be with us guys we will share a lot more free courses and their exam/quiz solutions also, and follow our Techno-RJ Blog for more updates.