Hello Peers, Today we are going to share** all week’s assessment and quizzes answers** of the **Probabilistic Graphical Models 2: Inference **course launched by **Coursera **totally **free of cost**β
β
β
. This is a** certification course** for every interested student.

In case you didn’t find this course for free, then you can** apply for financial ads** to get this course for totally free.

*Check out this article – “How to Apply for Financial Ads?”*

**About The Coursera**

**Coursera**, **India’s biggest learning platform** launched millions of free courses for students daily. These courses are from various recognized universities, where industry experts and professors teach in a very well manner and in a more understandable way.

Here, you will find **Probabilistic Graphical Models 2: Inference ****Exam Answers** in **Bold Color** which are given below.

These answers are updated recently and are **100% correctβ
** answers of all week, assessment, and final exam answers of **Probabilistic Graphical Models 2: Inference **from **Coursera Free Certification Course.**

Use βCtrl+Fβ To Find Any Questions Answer. & For Mobile User, You Just Need To Click On Three dots In Your Browser & You Will Get A βFindβ Option There. Use These Option to Get Any Random Questions Answer.

**About Probabilistic Graphical Models 2: Inference Course**

Joint (multivariate) distributions over large numbers of random variables with interactions are encoded in **probabilistic graphical models (PGMs)**, which provide a robust framework for modeling probability in general. These representations rely on ideas from **probability theory, graph algorithms, machine learning**, and other areas that lie at the crossroads of statistics and computer science.

The most cutting-edge techniques in fields as diverse as **medical diagnosis, image analysis, speech recognition, NLP**, and many others are founded on these ideas. In addition, they are used as a starting point for many machine learning issues.

This is the middle class of three classes that make up the whole curriculum. While the first course covered representation, this one will get into the topic of probabilistic inference, or how a **PGM **may be utilized to provide answers to various problems. Even though **PGMs **typically describe distributions with very high dimensions, they are organized in a way that makes it possible to answer questions quickly and easily. This course introduces both exact and approximation techniques for various inference tasks and explores the appropriate contexts in which each should be used. As part of the (strongly suggested) honors track, students complete two practical programming tasks in which they use important routines from some of the most popular exact and approximate algorithms in use today.

**SKILLS YOU WILL GAIN**

- Inference
- Gibbs Sampling
- Markov Chain Monte Carlo (MCMC)
- Belief Propagation

**Course Apply Link – Probabilistic Graphical Models 2: Inference **

**Probabilistic Graphical Models 2: Inference Quiz Answers**

### Week 1 Quiz Answers

#### Quiz 1: Variable Elimination

Q1. Intermediate Factors. Consider running variable elimination on the following Bayesian network over binary variables. Which of the nodes, if eliminated first, results in the largest intermediate factor? By largest factor, we mean the factor with the largest number of entries.

- X_3X

3

β - X_5 X

5

β - X_2X

2

β - X_4X

4

β

Q2. Elimination Orderings. Which of the following characteristics of the variable elimination algorithm are affected by the choice of elimination ordering? You may select 1 or more options.

- Runtime of the algorithm
- Which marginals can be computed correctly
- Memory usage of the algorithm
- Size of the largest intermediate factor

Q3. Marginalization. Suppose we run variable elimination on a Bayesian network where we eliminate all the variables in the network. What number will the algorithm produce?

Enter answer here

Q4. Marginalization. Suppose we run variable elimination on a Markov network where we eliminate all the variables in the network. What number will the algorithm produce?

- 1/Z1/Z, where ZZ is the partition function for the network.
- ZZ, the partition function for the network.
- A positive number, not necessarily between 0 and 1, which depends on the structure of the network.
- A positive number, always between 0 and 1, which depends on the structure of the network.

Q5. Intermediate Factors. If we perform variable elimination on the graph shown below with the variable ordering B,A,C,F,E,DB,A,C,F,E,D, what is the intermediate factor produced by the third step (just before summing out CC)?

- \psi(C,D,E,F)Ο(C,D,E,F)
- \psi(A,B,C,D,F)Ο(A,B,C,D,F)
- \psi(C,D,F)Ο(C,D,F)
- \psi(C,F)Ο(C,F)

Q6. Induced Graphs. If we perform variable elimination on the graph shown below with the variable ordering B,A,C,F,E,DB,A,C,F,E,D, what is the induced graph for the run?

- None of these

Q7. *Time Complexity of Variable Elimination. Consider a Bayesian network taking the form of a chain of nn variables, X_1 \rightarrow X_2 \rightarrow \cdots \rightarrow X_nX

- O(nk^3)
*O*(*nk*3) - O(kn^2)
*O*(*kn*2) - O(k^n)
*O*(*kn*) - O(nk^2)
*O*(*nk*2)

Q8. Time Complexity of Variable Elimination. Suppose we eliminate all the variables in a Markov network using the variable elimination algorithm. Which of the following could affect the runtime of the algorithm? You may select 1 or more options.

- Number of factors in the network
- Number of values each variable can take
- The values of the factor entries (assuming that all entries are still positive)

Q9. Intermediate Factors. If we perform variable elimination on the graph shown below with the variable ordering F,E,D,C,B,AF,E,D,C,B,A, what is the intermediate factor produced by the third step (just before summing out DD)?

- \psi(B,C,D,E,F)Ο(B,C,D,E,F)
- \psi(B,C,D,E)Ο(B,C,D,E)
- \psi(B,C,D)Ο(B,C,D)
- \psi(B,C)Ο(B,C)
- \psi(A,B,C,D)Ο(A,B,C,D)

### Week 2 Quiz Answers

#### Quiz 1: Message Passing in Cluster Graphs

Q1. Cluster Graph Construction. Consider the pairwise MRF, H, shown below with potentials over {A,B}, {B,C}, {A,D}, {B,E}, {C,F}, {D,E} and {E,F}.

Which of the following is/are valid cluster graph(s) for H? (A cluster graph is valid if it satisfies the running intersection property and family preservation. You may select 1 or more options).

Q2. Message Passing in a Cluster Graph.

Suppose we wish to perform inference over the Markov network MM as shown below. Each of the variables X_iX

β

are binary, and the only potentials in the network are the pairwise potentials \phi_{i,j}(X_i, X_j)Ο

βwith one potential for each pair of variables X_i, X_jX

β

connected by an edge in MM. Which of the following expressions correctly computes the message \delta_{3 \rightarrow 6}Ξ΄

β

that cluster C_3

β

will send to cluster C_6

β

during belief propagation? Assume that the variables in the sepsets are equal to the intersection of the variables in the adjacent cliques.

Q3. Message Passing Computation. Consider the Markov network MM from the previous question. If the initial factors in the Markov network MM are of the form as shown in the table below, regardless of the specific value of i, ji,j (we basically wish to encourage variables that are connected by an edge to share the same assignment), compute the message \delta_{3 \rightarrow 6}

β

assuming that it is the first message passed during in loopy belief propagation. Assume that the messages are all initialized to the 1 message, i.e. all the entries are initially set to 1.

Separate the entries of the message with spaces. Order the entries by lexicographic variable order: for example, if the message is over one variable X_iX

Enter answer here

Q4. *Extracting Marginals at Convergence. Given that you can renormalize the messages at any point during belief propagation and still obtain correct marginals, consider the message \delta_{3 \rightarrow 6}Ξ΄

3β6

β

that you computed. Use this observation to compute the final and possibly approximate marginal probability P(X_4 = 1, X_5 = 1)P(X

4

β

=1,X

β

=1) (X_4

β

and X_5

β

are the variables in the previous question) in cluster C_6

β

at convergence (as extracted from the cluster beliefs), giving your answer to 2 decimal places.

Enter answer here

Q5. Family Preservation. Suppose we have a factor P(A \mid C)P(Aβ£C) that we wish to include in our sum-product message passing inference. We should:

- Assign the factor to all cliques that contain AA or CC
- Assign the factor to one clique that contain AA and CC
- Assign the factor to one clique that contain AA or CC
- Assign the factor to all cliques that contain AA and CC

#### Quiz 2: Clique Tree Algorithm

Q1. Message Ordering. In the clique tree below which of the following starting message-passing orders is/are valid? (Note: These are not necessarily full sweeps that result in calibration. You may select 1 or more options.)

*C*1ββ*C*2β,*C*2ββ*C*3β,*C*5ββ*C*3β,*C*3ββ*C*4β- C_4\rightarrow C_3, C_3\rightarrow C_2, C_2\rightarrow C_1
*C*4ββ*C*3β,*C*3ββ*C*2β,*C*2ββ*C*1β - C_4\rightarrow C_3, C_5\rightarrow C_3, C_2\rightarrow C_3
*C*4ββ*C*3β,*C*5ββ*C*3β,*C*2ββ*C*3β - C_1\rightarrow C_2, C_2\rightarrow C_3, C_3\rightarrow C_4, C_3\rightarrow C_5
*C*1ββ*C*2β,*C*2ββ*C*3β,*C*3ββ*C*4β,*C*3ββ*C*5β

Q2. Message Passing in a Clique Tree. In the clique tree above, what is the correct form of the message from clique 3 to clique 2, \delta_{3\rightarrow 2}Ξ΄

- \sum_{B,D,G,H} \psi_3(C_3) \times \delta_{4\rightarrow 3} \times \delta_{5\rightarrow 3}β
*B*,*D*,*G*,*H*β*Ο*3β(*C*3β)Γ*Ξ΄*4β3βΓ*Ξ΄*5β3β - \sum_{G,H} \psi_3(C_3) \times \delta_{4\rightarrow 3} \times \delta_{5\rightarrow 3}β
*G*,*H*β*Ο*3β(*C*3β)Γ*Ξ΄*4β3βΓ*Ξ΄*5β3β - \sum_{B,D} \psi_3(C_3) \times \delta_{4\rightarrow 3} \times \delta_{5\rightarrow 3}β
*B*,*D*β*Ο*3β(*C*3β)Γ*Ξ΄*4β3βΓ*Ξ΄*5β3β - \sum_{B,D} \psi_3(C_3) \times \sum_{D,H} \left(\psi_4(C_4) \times \delta_{4\rightarrow 3} \right)\times \sum_{B,H} \left(\psi_5(C_5) \times \delta_{5\rightarrow 3} \right)β
*B*,*D*β*Ο*3β(*C*3β)Γβ*D*,*H*β(*Ο*4β(*C*4β)Γ*Ξ΄*4β3β)Γβ*B*,*H*β(*Ο*5β(*C*5β)Γ*Ξ΄*5β3β)

Q3. Clique Tree Properties. Consider the following Markov Network over potentials \phi_{A,B}, \phi_{B,C}, \phi_{A,D}, \phi_{B,E}, \phi_{C,F}, \phi_{D,E},Ο

Which of the following properties are necessary for a valid clique tree for the above network, but are NOT satisfied by this graph:

You may select 1 or more options.

- No loops
- Running intersection property
- Node degree less than or equal to 2
- Family preservation

Q4. Cluster Graphs vs. Clique Trees. Suppose that we ran sum-product message passing on a cluster graph GG for a Markov network MM and that the algorithm converged. Which of the following statements is true only if GG is a clique tree and is not necessarily true otherwise?

- GG is calibrated.
- If there are EE edges in GG, there exists a message ordering that guarantees convergence after passing 2E2E messages.
- All the options are true for cluster graphs in general.
- The sepsets in GG are the product of the two messages passed between the clusters adjacent to the sepset.
- The beliefs and sepsets of GG can be used to compute the joint distribution defined by the factors of MM.

Q5. Clique Tree Calibration. Which of the following is true? You may select more than one option.

- If there exists a pair of adjacent cliques that are max-calibrated, then a clique tree is max-calibrated.
- After we complete one upward pass of the max-sum message passing algorithm, the clique tree is max-calibrated.
- If a clique tree is max-calibrated, then all pairs of cliques are max-calibrated.
- If a clique tree is max-calibrated, then all pairs of adjacent cliques are max-calibrated.

### Week 3 Quiz Answers

#### Quiz 1: MAP Message Passing

Q1. **Real-World Applications of MAP Estimation. **Suppose that you are in charge of setting up a soccer league for a bunch of kindergarten kids, and your job is to split the N*N* children into K*K* teams. The parents are very controlling and also uptight about which friends their kids associate with. So some of them bribe you to set up the teams in certain ways.

The parentsβ bribe can take two forms: For some children i*i*, the parent says βI will pay you A_{ij}*A**i**j*β dollars if you put my kid i*i* on the same team as kid j*j*β; in other cases, the parent of child i*i* says βI will pay you B_i*B**i*β dollars if you put my kid on team k*k*.β In our notation, this translates to factor f_{i,j}(x_i,x_j) = A_{ij}\cdot \mathbf{1}\{x_i=x_j\}*f**i*,*j*β(*x**i*β,*x**j*β)=*A**i**j*ββ
**1**{*x**i*β=*x**j*β} or g_i(x_i) = B_i\cdot \mathbf{1}\{x_i=k\}*g**i*β(*x**i*β)=*B**i*ββ
**1**{*x**i*β=*k*}, respectively, where x_i*x**i*β is the assigned team of child i*i* and \mathbf{1}\{\}**1**{} is the indicator function. More formally, if we define x_i*x**i*β to be the assigned team of child i*i*, the amount of money you get for the first type of bribe will be f_{i,j}(x_i,x_j)*f**i*,*j*β(*x**i*β,*x**j*β).

Being greedy and devoid of morality, you want to make as much money as possible from these bribes. What are you trying to find?

- \textrm{argmax}_{\bar{x}} \sum_i g_i(x_i) + \sum_{i,j} f_{i,j}(x_i,x_j)argmax
*x*Λββ*i*β*gi*β(*xi*β)+β*i*,*j*β*fi*,*j*β(*xi*β,*xj*β) - \textrm{argmax}_{\bar{x}} \prod_i g_i(x_i) \cdot \prod_{i,j} f_{i,j}(x_i,x_j)argmax
*x*Λββ*i*β*gi*β(*xi*β)β β*i*,*j*β*fi*,*j*β(*xi*β,*xj*β) - \textrm{argmax}_{\bar{x}} \sum_i g_i(x_i)argmax
*x*Λββ*i*β*gi*β(*xi*β) - \textrm{argmax}_{\bar{x}} \prod_i g_i(x_i)argmax
*x*Λββ*i*β*gi*β(*xi*β)

Q2. ***Decoding MAP Assignments. **You want to find the optimal solution to the above problem using a clique tree over a set of factors \phi*Ο*. How could you accomplish this such that you are guaranteed to find the optimal solution? (Ignore issues of tractability, and assume that if you specify a set of factors \phi*Ο*, you will be given a valid clique tree of minimum tree width.)

- Set \phi_{i,j} = f_{i,j}
*Οi*,*j*β=*fi*,*j*β, \phi_i = g_i*Οi*β=*gi*β, get the clique tree over this set of factors, run max-sum message passing on this clique tree, and decode the marginals. - Set \phi_{i,j} = f_{i,j}
*Οi*,*j*β=*fi*,*j*β, \phi_i = g_i*Οi*β=*gi*β, get the clique tree, run sum product message passing, and decode the marginals. - Set \phi_{i,j} = \exp(f_{i,j})
*Οi*,*j*β=exp(*fi*,*j*β), \phi_i = \exp(g_i)*Οi*β=exp(*gi*β), get the clique tree over this set of factors, run max-sum message passing on this clique tree, and decode the marginals. - Set \phi_{i,j} = \exp(f_{i,j})
*Οi*,*j*β=exp(*fi*,*j*β), \phi_i = \exp(g_i)*Οi*β=exp(*gi*β), get the clique tree, run sum-product message passing, and decode the marginals. - The optimal solution is not guaranteed to be found in this manner using clique trees.

#### Quiz 2: Sampling Methods

Q1. Forward Sampling. One strategy for obtaining an estimate to the conditional probability P({\bf y} \mid {\bf e})P(yβ£e) is by using forward sampling to estimate P({\bf y}, {\bf e})P(y,e) and P({\bf e})P(e) separately and then computing the ratio. We can use the Hoeffding Bound to obtain a bound on both the numerator and the denominator. Assume M is large. When does the resulting bound provide meaningful guarantees? Think about the difference between the true value and our estimate. Recall that we need M \geq

βto get an additive error bound \epsilonΟ΅ that holds with probability 1-\delta1βΞ΄ for our estimate.

- It always provides meaningful guarantees.
- It provides a meaningful guarantee, but only when \deltaΞ΄ is small relative to P({\bf e})P(e) and P({\bf y}, {\bf e})P(y,e)
- It provides a meaningful guarantee, but only when \epsilonΟ΅ is small relative to P({\bf e})P(e) and P({\bf y}, {\bf e})P(y,e)
- It never provides a meaningful guarantee.

Q2. Rejecting Samples. Consider the process of rejection sampling to generate samples from the posterior distribution P(X \mid e)P(Xβ£e). If we want to obtain MM samples, what is the expected number of samples that would need to be drawn from P(X)P(X)?

- M \cdot (1 -P(e))Mβ (1βP(e))
- M \cdot P(e)Mβ P(e)
- M / P(e)M/P(e)
- M / (1 β P(e))M/(1βP(e))
- M \cdot P(X \mid e)Mβ P(Xβ£e)
- M \cdot (1 β P(X \mid e))Mβ (1βP(Xβ£e))

Q3. Stationary Distributions. Consider the simple Markov chain shown in the figure below. By definition, a stationary distribution \piΟ for this chain must satisfy which of the following properties? You may select 1 or more options.

Q4. *Gibbs Sampling in a Bayesian Network. Suppose we have the Bayesian network shown in the image below.

If we are sampling the variable X_{23}X

23

β

as a substep of Gibbs sampling, what is the closed form equation for the distribution we should use over the value x_{23}βx

23

β²

β

? By closed form, we mean that all computation such as summations are tractable and that we have access to all terms without requiring extra computation.

- P(x_{23}β \mid x_{22}, x_{24})P(x_{15} \mid x_{23}β, x_{14}, x_{9}, x_{25})
*P*(*x*23β²ββ£*x*22β,*x*24β)*P*(*x*15ββ£*x*23β²β,*x*14β,*x*9β,*x*25β) - P(x_{23}β \mid x_{-23})
*P*(*x*23β²ββ£*x*β23β) where x_{-23}*x*β23β is all variables except x_{23}*x*23β - P(x_{23}β \mid x_{22}, x_{24})
*P*(*x*23β²ββ£*x*22β,*x*24β) - {\Large \frac{P(x_{23}β \mid x_{22}, x_{24})P(x_{15} \mid x_{23}β, x_{14}, x_{9}, x_{25})}{\sum_{x_9β,x_{14}β,x_{22}β,x_{24}β,x_{25}β}P(x_{23}β \mid x_{22}β, x_{24}β)P(x_{15}β \mid x_{23}β, x_{14}β, x_{9}β, x_{25}β)}}β
*x*9β²β²β,*x*14β²β²β,*x*22β²β²β,*x*24β²β²β,*x*25β²β²ββ*P*(*x*23β²ββ£*x*22β²β²β,*x*24β²β²β)*P*(*x*15β²β²ββ£*x*23β²β,*x*14β²β²β,*x*9β²β²β,*x*25β²β²β)*P*(*x*23β²ββ£*x*22β,*x*24β)*P*(*x*15ββ£*x*23β²β,*x*14β,*x*9β,*x*25β)β - {\Large \frac{P(x_{23}β \mid x_{22}, x_{24})P(x_{15} \mid x_{23}β, x_{14}, x_{9}, x_{25})}{\sum_{x_{23}β}P(x_{23}β \mid x_{22}, x_{24})P(x_{15} \mid x_{23}β, x_{14}, x_{9}, x_{25})}}β
*x*23β²β²ββ*P*(*x*23β²β²ββ£*x*22β,*x*24β)*P*(*x*15ββ£*x*23β²β²β,*x*14β,*x*9β,*x*25β)*P*(*x*23β²ββ£*x*22β,*x*24β)*P*(*x*15ββ£*x*23β²β,*x*14β,*x*9β,*x*25β)β

Q5. Gibbs Sampling. Suppose we are running the Gibbs sampling algorithm on the Bayesian network X\rightarrow Y\rightarrow ZXβYβZ. If the current sample is \langle x_0, y_0, z_0 \rangleβ¨x

- P(y_1 \mid x_0, z_0)
*P*(*y*1ββ£*x*0β,*z*0β) - P(x_0, z_0 \mid y_1)
*P*(*x*0β,*z*0ββ£*y*1β) - P(x_0, y_1, z_0)
*P*(*x*0β,*y*1β,*z*0β) - P(y_1 | x_0)
*P*(*y*1ββ£*x*0β)

Q6. Collecting Samples. Assume we have a Markov chain that we have run for a sufficient burn-in time, and now wish to collect samples and use them to estimate the probability that

- No, once we collect one sample, we have to continue running the chain in order to βre-mixβ it before we get another sample.
- Yes, and if we collect mm consecutive samples, we can use the Hoeffding bound to provide (high-probability) bounds on the error in our estimated probability.
- Yes, that would give a correct estimate of the probability. However, we cannot apply the Hoeffding bound to estimate the error in our estimate.
- No, Markov chains are only good for one sample; we have to restart the chain (and burn-in) before we can collect another sample.

Q7. Markov Chain Mixing. Which of the following classes of chains would you expect to have the shortest mixing time in general?

- Markov chains for networks with nearly deterministic potentials.
- Markov chains with distinct regions in the state space that are connected by low probability transitions.
- Markov chains with many distinct and peaked probability modes.
- Markov chains where state spaces are well connected and transitions between states have high probabilities.

#### Quiz 3: Sampling Methods PA Quiz

Q1. This quiz is a companion quiz to the Sampling Methods Programming Assignment. Please refer to the writeup for the programming assignment for instructions on how to complete this quiz.

Letβs run an experiment using our Gibbs sampling method. As before, use the toy image network and set the on-diagonal weight of the pairwise factor (in ConstructToyNetwork.m) to be 1.0 and the off-diagonal weight to be 0.1. Now run Gibbs sampling a few times, first initializing the state to be all 1βs and then initializing the state to be all 2βs. What effect does the initial assignment have on the accuracy of Gibbs sampling? Why does this effect occur?

- The initial state is not an important factor in our result as Gibbs can make large moves of multiple variables to quickly escape this bad state.
- The initial state has a significant impact on the result of our sampling, which makes sense as strong correlation makes mixing time long and we remain close to the initial assignment for a long time.
- The initial state has a significant impact on the result as, though our chain mixes quickly, it will mix to a distribution far from the actual distribution and close to the initial assignment.
- The initial state has a significant impact on the result of our sampling as Gibbs will never switch variables because the pairwise potentials enforce strong agreement so we are in a local optima.

Q2. Set the on-diagonal weight of our toy image network to 1 and off-diagonal weight to .2. Now visualize multiple runs with each of Gibbs, MHUniform, Swendsen-Wang variant 1, and Swendsen-Wang variant 2 using VisualizeMCMCMarginals.m (see TestToy.m for how to do this). How do the mixing times of these chains compare? How do the final marginals compare to the exact marginals? Why?

- The Swendsen-Wang variants outperform the other approaches, with faster mixing and better final marginals. This is likely due to the block-flipping nature of Swendsen-Wang which allows us to flip blocks and quickly mix in environments with strong agreeing potentials.
- All variants perform poorly in the case of strong pairwise potentials. All algorithms are subject to positive feedback loops with the tight loops in our grid and strong pairwise agreement potentials, preventing appropriate mixing.
- Having strong pairwise potentials enforcing agreement is not a problem for any of these sampling methods and all perform equally well β mixing quickly and ending up close to the final marginals.
- Gibbs outperforms the other variants in this instance. Gibbs has some issues with strong pairwise potentials, but is not nearly as bad as MH where blocks end up stuck with the same level so we cannot mix appropriately.

Q3. Set the on-diagonal weight of our toy image network to .5 and off- diagonal weight to .5. Now visualize multiple runs with each of Gibbs, MHUniform, Swendsen-Wang variant 1, and Swendsen-Wang variant 2 using VisualizeMCMCMarginals.m (see TestToy.m for how to do this). How do the mixing times of these chains compare? How do the final marginals compare to the exact marginals? Why?

- All variants perform equally well. They all mix quickly and have very low variance throughout their runs β remaining close to the true marginals. This is because the pairwise marginals do not force us into preferring agreement when we should not.
- Gibbs and MHUniform perform very well and are somewhat better than the Swendsen-Wang variants. This is because the first two variants use local moves so the local marginals remained consistently close the the true marginals, while SW allows big swings over multiple variables that perturb the distribution.
- Gibbs performs poorly relative to the other variants β exhibiting slower mixing time and marginals further from the exact ones. This difference is likely due to the Gibbs strong global dependence that prevents it from acting appropriately unless all variables are relatively well synced to their true marginals.
- Swendsen-Wang outperforms the other variants, though all perform relatively well. SW is better because its larger block moves allow for faster mixing and mean it reaches marginal estimates closer to the true marginals faster.

Q4. When creating our proposal distribution for Swendsen-Wang, if you set all the q_{i,j}

- Switching q_{i,j}
*qi*,*j*β to 0 is equivalent to a randomized variant of Gibbs sampling where we are allowed to take a random, rather than fixed, order. - Switching q_{i,j}
*qi*,*j*β to 0 is equivalent to MH-Uniform. - Switching q_{i,j}
*qi*,*j*β to 0 is equivalent to the first variant of Swendsen-Wang. - Switching q_{i,j}
*qi*,*j*β to 0 leaves us without a valid proposal distribution and is not a feasible sampling algorithm.

#### Quiz 4: Inference in Temporal Models

Q1. Unrolling DBNs. Which independencies hold in the unrolled network for the following 2-TBN for all tt?

(Hint: it may be helpful to draw the unrolled DBN for several slices)

- (Weather^t \perp Velocity^t \mid Weather^{(t-1)}, Obs^{1β¦t})(
*Weathert*β₯*Velocityt*β£*Weather*(*t*β1),*Obs*1β¦*t*) - (Weather^t \perp Velocity^t \mid Obs^{1β¦t})(
*Weathert*β₯*Velocityt*β£*Obs*1β¦*t*) - None of these
- (Weather^t \perp Location^t \mid Velocity^t, Obs^{1β¦t})(
*Weathert*β₯*Locationt*β£*Velocityt*,*Obs*1β¦*t*) - (Failure^t \perp Location^t \mid Obs^{1β¦t})(
*Failuret*β₯*Locationt*β£*Obs*1β¦*t*) - (Failure^t \perp Velocity^t \mid Obs^{1β¦t})(
*Failuret*β₯*Velocityt*β£*Obs*1β¦*t*)

Q2. *Limitations of Inference in DBNs. What makes inference in DBNs difficult?

- Standard clique tree inference cannot be applied to a DBN
- As tt grows large, we generally lose independencies of the form (X^{(t)} \perp Y^{(t)} \mid
- As tt grows large, we generally lose all independencies in the ground network
- In many networks, maintaining an exact belief state over the variables requires a full joint distribution over all variables in each time slice

Q3. Entanglement in DBNs. Which of the following are consequences of entanglement in Dynamic Bayesian Networks over discrete variables?

- The belief state never factorizes.
- All variables in the unrolled DBN become correlated.
- The size of an exact representation of the belief state is exponentially large in the number of variables.
- The size of an exact representation of the belief state is quadratic in the number of variables.

### Week 5

#### Quiz 1: Inference Final Exam

Q1. Reparameterization. Suppose we have a calibrated clique tree TT and calibrated cluster graph GG for the same Markov network, and have thrown away the original factors. Now we wish to reconstruct the joint distribution over all the variables in the network only from the beliefs and sepsets. Is it possible for us to do so from the beliefs and sepsets in TT? Separately, is it possible for us to do so from the beliefs and sepsets in GG?

It is possible in GG but not in TT.

It is possible in both TT and GG

It is not possible in TT or GG.

It is possible in TT but not in GG.

Q2. *Markov Network Construction. Consider the unrolled network for the plate model shown below, where we have nn students and mm courses. Assume that we have observed the grade of all students in all courses. In general, what does a pairwise Markov network that is a minimal I-map for the conditional distribution look like? (Hint: the factors in the network are the CPDs reduced by the observed grades. We are interested in modeling the conditional distribution, so we do not need to explicitly include the Grade variables in this new network. Instead, we model their effect by appropriately choosing the factor values in the new network.)

A fully connected graph with instantiations of the Difficulty and Intelligence variables.

Impossible to tell without more information on the exact grades observed.

A fully connected bipartite graph where instantiations of the Difficulty variables are on one side and instantiations of the Intelligence variables are on the other side.

A graph over instantiations of the Difficulty variables and instantiations of the Intelligence variables, not necessarily bipartite; there could be edges between different Difficulty variables, and there could also be edges between different Intelligence variables.

A bipartite graph where instantiations of the Difficulty variables are on one side and instantiations of the Intelligence variables are on the other side. In general, this graph will not be fully connected.

Q3. **Clique Tree Construction. Consider a pairwise Markov network that consists of a graph with mm variables on one side and nn on the others. This graph is bipartite but fully connected, in that each of the mm variables on the one side is connected to all and only the nn variables on the other side. Define the size of a clique to be the number of variables in the clique. There exists a clique tree T^*T

β

for the pairwise Markov network such that the size of the largest clique in T^*T β is the smallest amongst all possible clique trees for this network. What is the size of the largest sepset in T^*T

β

?

Note: if youβre wondering why we would ever care about this, remember that the complexity of inference depends on the number of entries in the largest factor produced in the course of message passing, which in turn, is affected by the size of the largest clique in the network, amongst other things.

Hint: Use the relationship between sepsets and conditional independence to derive a lower bound for the size of the largest sepset, then construct a clique tree that achieves this bound.

\max(m,n)+1max(m,n)+1

mnmn

m+nm+n

\min(m,n)+1min(m,n)+1

\min(m,n)min(m,n)

mn + 1mn+1

\max(m,n)max(m,n)

m+n+1m+n+1

Q4. Uses of Variable Elimination. Which of the following quantities can be computed using the sum-product variable elimination algorithm? (In the options, let XX be a set of query variables, and EE be a set of evidence variables in the respective networks.) You may select 1 or more options.

P(X)P(X) in a Markov network

The partition function for a Markov network

P(X)P(X) in a Bayesian network

The most likely assignment to the variables in a Bayesian network.

Q5. *Time Complexity of Variable Elimination. Consider a Bayesian network taking the form of a chain of nn variables, X_1 \rightarrow X_2 \rightarrow \cdots \rightarrow X_nX

1

β

βX

2

β

ββ―βX

n

β

, where each of the X_iX

i

β

can take on kk values. Assume we eliminate the X_iX

i

β

starting from X_2X

2

β

, going to X_3, \ldots, X_nX

3

β

,β¦,X

n

β

and then back to X_1X

1

β

. What is the computational cost of running variable elimination with this ordering?

O(nk)O(nk)

O(kn^2)O(kn

2

)

O(k^n)O(k

n

)

O(nk^3)O(nk

3

)

Q6. *Numerical Issues in Belief Propagation. In practice, one of the issues that arises when we propagate messages in a clique tree is that when we multiply many small numbers, we quickly run into the precision limits of floating-point numbers, resulting in arithmetic underflow. One possible approach for addressing this problem is to renormalize each message, as itβs passed, such that its entries sum to 1. Assume that we do not store the renormalization factor at each step. Which of the following statements describes the consequence of this approach?

We will be unable to extract the partition function, but the variable marginals that are obtained from renormalizing the beliefs at each clique will still be correct.

This does not change the results of the algorithm: when the clique tree is calibrated, we can obtain from it both the partition function and the correct marginals.

This renormalization will give rise to incorrect marginals at calibration.

Calibration will not even be achieved using this scheme.

Q7. Convergence in Belief Propagation. Suppose we ran belief propagation on a cluster graph GG and a clique tree TT for the same Markov network that is a perfect map for a distribution PP. Assume that both GG and TT are valid, i.e., they satisfy family preservation and the running intersection property. Which of the following statements regarding the algorithm are true? You may select 1 or more options.

Assuming the algorithm converges, if a variable XX appears in two clusters in GG, the marginals P(X)P(X) computed from the two cluster beliefs must agree.

If the algorithm converges, the final clique beliefs in TT, when renormalized to sum to 1, are true marginals of PP.

If the algorithm converges, the final cluster beliefs in GG, when renormalized to sum to 1, are true marginals of PP.

Assuming the algorithm converges, if a variable XX appears in two cliques in TT, the marginals P(X)P(X) computed from the the two clique beliefs must agree.

Q8. Metropolis-Hastings Algorithm. Assume we have an n \times nnΓn grid-structured MRF over the variables X_{i,j}X

i,j

β

. Let \bf{X_i} = {X_{i,1}, \ldots, X_{i,n}}X

i

β

={X

i,1

β

,β¦,X

i,n

β

} and Xβi=XβXi. Consider the following instance of the Metropolis-Hastings algorithm: at each step, we take our current assignment \bf{x_{-i}}x

βi

β

and use exact inference to compute the conditional probability P(\bf{X_i} \mid \bf{x_{-i}})P(X

i

β

β£x

βi

β

). We then sample \bf{x_i}βx

i

β

β²

from this posterior distribution, and use that as our proposal. What is the correct acceptance probability for this proposal?

Hint: what is the relationship between this and Gibbs sampling?

Q9. *Value of Information. In the influence diagram on the right, when does performing LabTest have value? That is, when would you want to observe the LabTest variable?

Hint: Think about when information is valuable in making a decision.

- When there is some treatment t
*t*such that V(D, t)*V*(*D*,*t*) is different for different diseases D*D*. - When there is some disease d
*d*such that argmax_t V(d, t) β argmax_t \sum_d P(d) V(d, t)*argmaxt*β*V*(*d*,*t*)ξ β=*argmaxt*ββ*d*β*P*(*d*)*V*(*d*,*t*) - When there is some lab value l
*l*such that argmax_t \sum_d P(d | l) V(d, t) β argmax_t \sum_d P(d) V(d, t)*argmaxt*ββ*d*β*P*(*d*β£*l*)*V*(*d*,*t*)ξ β=*argmaxt*ββ*d*β*P*(*d*)*V*(*d*,*t*) - When P(D | L)
*P*(*D*β£*L*) is different from P(D)*P*(*D*).

Q10. *Belief Propagation.

Say you had a probability distribution

β

encoded in a set of factors \PhiΞ¦, and that you constructed a loopy cluster graph CC to do inference in it. While you were performing loopy belief propagation on this graph, lightning struck and your computer shut down; to your horror, when you booted it back up, the only information you could recover were the graph structure CC and the cluster beliefs at the current iteration. (For each cluster, the cluster belief is its initial potential multiplied by all incoming messages. You donβt have access to the sepset beliefs, the messages, or the original factors \PhiΞ¦.) Assume the lightning struck before you had finished, i.e., the graph is not yet calibrated. Can you still recover the original distribution P_\PhiP

Ξ¦ from this? Why?

- We can reconstruct the original distribution by taking the product of cluster beliefs and normalizing it.
- We can reconstruct the (unnormalized) original distribution by taking the ratio of the product of cluster beliefs to sepset beliefs, and the sepset beliefs can be obtained by marginalizing the cluster beliefs.
- We canβt reconstruct the (unnormalized) original distribution because we donβt have the sepset beliefs to compute the ratio of the product of cluster beliefs to sepset beliefs.
- We canβt reconstruct the original distribution because we were preforming loopy belief propagation, and the reparameterization property doesnβt hold when itβs loopy.

**Conclusion**

Hopefully, this article will be useful for you to find all theΒ **Week, final assessment, and Peer Graded Assessment Answers of Probabilistic Graphical Models 2: Inference Quiz of Coursera**Β and grab some premium knowledge with less effort. If this article really helped you in any way then make sure to share it with your friends on social media and let them also know about this amazing training. You can also check out our other courseΒ Answers.Β So, be with us guys we will share a lot more free courses and their exam/quiz solutions also, and follow ourΒ Techno-RJΒ **Blog**Β for more updates.