# Machine Learning: Clustering & Retrieval Coursera Quiz Answers 2022 [💯Correct Answer]

Hello Peers, Today we will share all week’s assessment and quiz answers of the Machine Learning: Clustering & Retrieval course launched by Coursera free of cost✅✅✅. This is a certification course for every interested student.

In case you didn’t find this course for free, then you can apply for financial ads to get this course for totally free.

Coursera, India’s biggest learning platform launched millions of free courses for students daily. These courses are from various recognized universities, where industry experts and professors teach in a very well manner and in a more understandable way.

Here, you will find Machine Learning: Clustering & Retrieval Exam Answers in Bold Color which are given below.

These answers are updated recently and are 100% correct✅ answers of all week, assessment, and final exam answers of Machine Learning: Clustering & Retrieval from Coursera Free Certification Course.

Use “Ctrl+F” To Find Any Questions Answer. & For Mobile User, You Just Need To Click On Three dots In Your Browser & You Will Get A “Find” Option There. Use These Option to Get Any Random Questions Answer.

## About Machine Learning: Clustering & Retrieval Course

In this third case study, locating related documents, you will investigate similarity-based retrieval techniques. This course will also look at structured representations for representing documents in a corpus, such as clustering and mixed membership models like latent Dirichlet allocation (LDA). You will use expectation maximization (EM) to discover how to cluster documents and scale the approaches using MapReduce.

Course Apply Link – Machine Learning: Clustering & Retrieval

## Machine Learning: Clustering & Retrieval Quiz Answers

#### Quiz 1: Representations and metrics

Question 1: Consider three data points with two features as follows: Among the three points, which two are closest to each other in terms of having the ​smallest Euclidean distance?

• A and B
• A and C
• B and C

Question 2: Consider three data points with two features as follows: Among the three points, which two are closest to each other in terms of having the ​largest cosine similarity (or equivalently, ​smallest cosine distance)?

• A and B
• A and C
• B and C

Question 3: Consider the following two sentences.

• Sentence 1: The quick brown fox jumps over the lazy dog.
• Sentence 2: A quick brown dog outpaces a quick fox.

Compute the Euclidean distance using word counts. To compute word counts, turn all words into lower case and strip all punctuation, so that “The” and “the” are counted as the same token. That is, document 1 would be represented as

x=[# the,# a,# quick,# brown,# fox,# jumps,# over,# lazy,# dog,# outpaces]

where # word is the count of that word in the document.

Question 4: Consider the following two sentences.

• Sentence 1: The quick brown fox jumps over the lazy dog.
• Sentence 2: A quick brown dog outpaces a quick fox.

Recall that

cosine distance = 1 – cosine similarity = 1- \frac{x^T y}{||x|| ||y||}1−∣∣x∣∣∣∣y∣∣xTy

Compute the cosine distance between sentence 1 and sentence 2 using word counts. To compute word counts, turn all words into lower case and strip all punctuation, so that “The” and “the” are counted as the same token. That is, document 1 would be represented as

x=[# the,# a,# quick,# brown,# fox,# jumps,# over,# lazy,# dog,# outpaces]

where # word is the count of that word in the document.

Question 5: (True/False) For positive features, cosine similarity is always between 0 and 1.

• True
• False

Question 6: Which of the following does not describe the word count document representation? (Note: this is different from TF-IDF document representation.)

• Ignores the order of the words
• Assigns a high score to a frequently occurring word
• Penalizes words that appear in every document

Question 1: Among the words that appear in both Barack Obama and Francisco Barrio, take the 5 that appear most frequently in Obama. How many of the articles in the Wikipedia dataset contain all of those 5 words?

Question 2: Measure the pairwise distance between the Wikipedia pages of Barack Obama, George W. Bush, and Joe Biden. Which of the three pairs has the smallest distance?

• Between Obama and Biden
• Between Obama and Bush
• Between Biden and Bush

Question 3: Collect all words that appear both in Barack Obama and George W. Bush pages. Out of those words, find the 10 words that show up most often in Obama’s page. Which of the following is NOT one of the 10 words?

• the
• presidential
• in
• act
• his

Question 4: Among the words that appear in both Barack Obama and Phil Schiliro, take the 5 that have largest weights in Obama. How many of the articles in the Wikipedia dataset contain all of those 5 words?

Question 5: Compute the Euclidean distance between TF-IDF features of Obama and Biden. Round your answer to 3 decimal places. Use American-style decimals (e.g. 110.921).

#### Quiz 3: KD-trees

Question 1: Which of the following is not true about KD-trees?

• It divides the feature space into nested axis-aligned boxes.
• It can be used only for approximate nearest neighbor search but not for exact nearest neighbor search.
• It prunes parts of the feature space away from consideration by inspecting smallest possible distances that can be achieved.
• The query time scales sublinearly with the number of data points and exponentially with the number of dimensions.
• It works best in low to medium-dimension settings.

Question 2: Questions 2, 3, 4, and 5 involve training a KD-tree on the following dataset:

Train a KD-tree by hand as follows:

• First split using X1 and then using X2. Alternate between X1 and X2 in order.
• Use “middle-of-the-range” heuristic for each split. Take the maximum and minimum of the coordinates of the member points.
• Keep subdividing until every leaf node contains two or fewer data points.

What is the split value used for the first split? Enter the exact value, as you are expected to obtain a finite number of decimals. Use American-style decimals (e.g. 0.026).

Question 3: Refer to Question 2 for context.

What is the split value used for the second split? Enter the exact value, as you are expected to obtain a finite number of decimals. Use American-style decimals (e.g. 0.026).

Question 4: Refer to Question 2 for context.

Given a query point (-3, 1.5), which of the data points belong to the same leaf node as the query point? Choose all that apply.

• Data point 1
• Data point 2
• Data point 3
• Data point 4
• Data point 5
• Data point 6

Question 5: Refer to Question 2 for context.

Perform backtracking with the query point (-3, 1.5) to perform exact nearest neighbor search. Which of the data points would be pruned from the search? Choose all that apply.

Hint: Assume that each node in the KD-tree remembers the tight bound on the coordinates of its member points, as follows: • Data point 1
• Data point 2
• Data point 3
• Data point 4
• Data point 5
• Data point 6

#### Quiz 4: Locality Sensitive Hashing

Question 1: (True/False) Like KD-trees, Locality Sensitive Hashing lets us compute exact nearest neighbors while inspecting only a fraction of the data points in the training set.

• True
• False

Question 2: (True/False) Given two data points with high cosine similarity, the probability that a randomly drawn line would separate the two points is small.

• True
• False

Question 3: (True/False) The true nearest neighbor of the query is guaranteed to fall into the same bin as the query.

• True
• False

Question 4: (True/False) Locality Sensitive Hashing is more efficient than KD-trees in high dimensional setting.

• True
• False

Question 5: Suppose you trained an LSH model and performed a lookup using the bin index of the query. You notice that the list of candidates returned are not at all similar to the query item. Which of the following changes would not produce a more relevant list of candidates?

• Use multiple tables.
• Increase the number of random lines/hyperplanes.
• Inspect more neighboring bins to the bin containing the query.
• Decrease the number of random lines/hyperplanes.

#### Quiz 5: Implementing Locality Sensitive Hashing from scratch

Question 1: What is the document ID of Barack Obama’s article?

Question 2: Which bin contains Barack Obama’s article? Enter its integer index.

Question 3: Examine the bit representations of the bins containing Barack Obama and Joe Biden. In how many places do they agree?

• 16 out of 16 places (Barack Obama and Joe Biden fall into the same bin)
• 15 out of 16 places
• 13 out of 16 places
• 11 out of 16 places
• 9 out of 16 places

Question 4: Refer to the section “Effect of nearby bin search”. What was the smallest search radius that yielded the correct nearest neighbor for Obama, namely Joe Biden?

Question 5: Suppose our goal was to produce 10 approximate nearest neighbors whose average distance from the query document is within 0.01 of the average for the true 10 nearest neighbors. For Barack Obama, the true 10 nearest neighbors are on average about 0.77. What was the smallest search radius for Barack Obama that produced an average distance of 0.78 or better?

#### Quiz 1: k-means

Question 1: (True/False) k-means always converges to a local optimum.

• True
• False

Question 2: (True/False) The clustering objective is non-increasing throughout a run of k-means.

• True
• False

Question 3: (True/False) Running k-means with a larger value of k always enables a lower possible final objective value than running k-means with smaller k.

• True
• False

Question 4: (True/False) Any initialization of the centroids in k-means is just as good as any other.

• True
• False

Question 5: (True/False) Initializing centroids using k-means++ guarantees convergence to a global optimum.

• True
• False

Question 6: (True/False) Initializing centroids using k-means++ costs more than random initialization in the beginning, but can pay off eventually by speeding up convergence.

• True
• False

Question 7: (True/False) Using k-means++ can only influence the number of iterations to convergence, not the quality of the final assignments (i.e., objective value at convergence).

• True
• False

Question 8: Consider the following dataset:

Perform k-means with k=2 until the cluster assignment does not change between successive iterations. Use the following initialization for the centroids:

Which of the five data points changed its cluster assignment most often during the k-means run?

• Data point 1
• Data point 2
• Data point 3
• Data point 4
• Data point 5

Question 9: Suppose we initialize k-means with the following centroids

Which of the following best describes the cluster assignment in the first iteration of k-means?

#### Quiz 2: Clustering text data with K-means

Question 1: : (True/False) The clustering objective (heterogeneity) is non-increasing for this example.

• True
• False

Question 2: Let’s step back from this particular example. If the clustering objective (heterogeneity) would ever increase when running K-means, that would indicate: (choose one)

• K-means algorithm got stuck in a bad local minimum
• There is a bug in the K-means code
• All data points consist of exact duplicates
• Nothing is wrong. The objective should generally go down sooner or later.

Question 3: Refer to the output of K-means for K=3 and seed=0. Which of the three clusters contains the greatest number of data points in the end?

• Cluster #0
• Cluster #1
• Cluster #2

Question 4: Another way to capture the effect of changing initialization is to look at the distribution of cluster assignments. Compute the size (# of member data points) of clusters for each of the multiple runs of K-means.

Look at the size of the largest cluster (most # of member data points) across multiple runs, with seeds 0, 20000, …, 120000. What is the maximum value this quantity takes?

Question 5: Refer to the section “Visualize clusters of documents”. Which of the 10 clusters above contains the greatest number of articles?

• Cluster 0: artists, books, him/his
• Cluster 4: music, orchestra, symphony
• Cluster 5: female figures from various fields
• Cluster 7: law, courts, justice

Question 6: Refer to the section “Visualize clusters of documents”. Which of the 10 clusters above contains the least number of articles?

• Cluster 1: film, theater, tv, actor
• Cluster 3: elections, ministers
• Cluster 6: composers, songwriters, singers, music producers
• Cluster 7: law, courts, justice
• Cluster 8: football

Question 7: Another sign of too large K is having lots of small clusters. Look at the distribution of cluster sizes (by number of member data points). How many of the 100 clusters have fewer than 236 articles, i.e. 0.4% of the dataset?

#### Quiz 3: MapReduce for k-means

Question 1: Suppose we are operating on a 1D vector. Which of the following operation is not data parallel over the vector elements?

• Add a constant to every element.
• Multiply the vector by a constant.
• Increment the vector by another vector of the same dimension.
• Compute the average of the elements.
• Compute the sign of each element.

Question 2: (True/False) A single mapper call can emit multiple (key,value) pairs.

• True
• False

Question 3: (True/False) More than one reducer can emit (key,value) pairs with the same key simultaneously.

• True
• False

Question 4: (True/False) Suppose we are running k-means using MapReduce. Some mappers may be launched for a new k-means iteration even if some reducers from the previous iteration are still running.

• True
• False

Question 5: Consider the following list of binary operations. Which can be used for the reduce step of MapReduce? Choose all that apply.

Hints: The reduce step requires a binary operator that satisfied both of the following conditions.

• Commutative: OP(x_1,x_2) = OP(x_2,x_1)OP(x1​,x2​)=OP(x2​,x1​)
• Associative: OP(OP(x_1, x_2), x_3) = OP(x_1, OP(x_2, x_3))OP(OP(x1​,x2​),x3​)=OP(x1​,OP(x2​,x3​))
• OP1(x1,x2)=max(x1,x2)
• OP2(x1,x2)=x1+x2−2
• OP3(x1,x2)=3×1+2×2
• OP4(x1,x2)=x21+x2
• OP5(x1,x2)=(x1+x2)/2

#### Quiz 1: EM for Gaussian mixtures

Question 1: (True/False) While the EM algorithm maintains uncertainty about the cluster assignment for each observation via soft assignments, the model assumes that every observation comes from only one cluster.

• True
• False

Question 2: (True/False) In high dimensions, the EM algorithm runs the risk of setting cluster variances to zero.

• True
• False

Question 3: In the EM algorithm, what do the E step and M step represent, respectively?

• Estimate cluster responsibilities, Maximize likelihood over parameters
• Estimate likelihood over parameters, Maximize cluster responsibilities
• Estimate number of parameters, Maximize likelihood over parameters
• Estimate likelihood over parameters, Maximize number of parameters

Question 4: Suppose we have data that come from a mixture of 6 Gaussians (i.e., that is the true data structure). Which model would we expect to have the highest log-likelihood after fitting via the EM algorithm?

• A mixture of Gaussians with 2 component clusters
• A mixture of Gaussians with 4 component clusters
• A mixture of Gaussians with 6 component clusters
• A mixture of Gaussians with 7 component clusters
• A mixture of Gaussians with 10 component clusters

Question 5: Which of the following correctlydescribes the differences between EM for mixtures of Gaussians and k-means? Choose all that apply.

• k-means often gets stuck in a local minimum, while EM tends not to
• EM is better at capturing clusters of different sizes and orientations
• EM is better at capturing clusters with overlaps
• EM is less prone to overfitting than k-means
• k-means is equivalent to running EM with infinitesimally small diagonal covariances.

Question 6: Suppose we are running the EM algorithm. After an E-step, we obtain the following responsibility matrix:

Which is the most probable cluster for data point 3?

• Cluster A
• Cluster B
• Cluster C

Question 7: Suppose we are running the EM algorithm. After an E-step, we obtain the following responsibility matrix:

Suppose also that the data points are as follows:

Let us compute the new mean for Cluster A. What is the Z coordinate of the new mean? Round your answer to 3 decimal places.

Question 8: Which of the following contour plots describes a Gaussian distribution with diagonal covariance? Choose all that apply. • (1)
• (2)
• (3)
• (4)
• (5)

Question 9: Suppose we initialize EM for mixtures of Gaussians (using full covariance matrices) with the following clusters:

Which of the following best describes the updated clusters after the first iteration of EM?

#### Quiz 2: Implementing EM for Gaussian mixtures

Question 1: What is the weight that EM assigns to the first component after running the above codeblock? Round your answer to 3 decimal places.

Question 2: Using the same set of results, obtain the mean that EM assigns the second component. What is the mean in the first dimension? Round your answer to 3 decimal places.

Question 3: Using the same set of results, obtain the covariance that EM assigns the third component. What is the variance in the first dimension? Round your answer to 3 decimal places.

Question 4: Is the loglikelihood plot monotonically increasing, monotonically decreasing, or neither?

• Monotonically increasing
• Monotonically decreasing
• Neither

Question 5: Calculate the likelihood (score) of the first image in our data set (img) under each Gaussian component through a call to multivariate_normal.pdf. Given these values, what cluster assignment should we make for this image?

• Cluster 0
• Cluster 1
• Cluster 2
• Cluster 3

Question 6: Four of the following images are not in the list of top 5 images in the first cluster. Choose these four.

#### Quiz 3: Clustering text data with Gaussian mixtures

Question 1: Select all the topics that have a cluster in the model created above.

• Baseball
• Soccer/football
• Music
• Politics
• Law
• Finance

Question 2: Try fitting EM with the random initial parameters you created above. What is the final loglikelihood that the algorithm converges to? Choose the range that contains this value.

• Less than 2.2e9
• Between 2.2e9 and 2.3e9
• Between 2.3e9 and 2.4e9
• Between 2.4e9 and 2.5e9
• Greater than 2.5e9

Question 3: Is the final loglikelihood larger or smaller than the final loglikelihood we obtained above when initializing EM with the results from running k-means?

• Initializing EM with k-means led to a larger final loglikelihood
• Initializing EM with k-means led to a smaller final loglikelihood

Question 4: For the above model, out_random_init, use the visualize_EM_clusters method you created above. Are the clusters more or less interpretable than the ones found after initializing using k-means?

• More interpretable
• Less interpretable

#### Quiz 1: Latent Dirichlet Allocation

Question 1: (True/False) According to the assumptions of LDA, each document in the corpus contains words about a single topic.

• True
• False

Question 2: (True/False) Using LDA to analyze a set of documents is an example of a supervised learning task.

• True
• False

Question 3: (True/False) When training an LDA model, changing the ordering of words in a document does not affect the overall joint probability.

• True
• False

Question 4: (True/False) Suppose in a trained LDA model two documents have no topics in common (i.e., one document has 0 weight on any topic with non-zero weight in the other document). As a result, a single word in the vocabulary cannot have high probability of occurring in both documents.

• True
• False

Question 5: (True/False) Topic models are guaranteed to produce weights on words that are coherent and easily interpretable by humans.

• True
• False

#### Quiz 2: Learning LDA model via Gibbs sampling

Question 1: (True/False) Each iteration of Gibbs sampling for Bayesian inference in topic models is guaranteed to yield a higher joint model probability than the previous sample.

• True
• False

Question 2: (Check all that are true) Bayesian methods such as Gibbs sampling can be advantageous because they

• Account for uncertainty over parameters when making predictions
• Are faster than methods such as EM
• Maximize the log probability of the data under the model
• Regularize parameter estimates to avoid extreme values

Question 3: For the standard LDA model discussed in the lectures, how many parameters are required to represent the distributions defining the topics?

• [# unique words]
• [# unique words] * [# topics]
• [# documents] * [# unique words]
• [# documents] * [# topics]

Question 4: Suppose we have a collection of documents, and we are focusing our analysis to the use of the following 10 words. We ran several iterations of collapsed Gibbs sampling for an LDA model with K=2 topics and alpha=10.0 and gamma=0.1 (with notation as in the collapsed Gibbs sampling lecture). The corpus-wide assignments at our most recent collapsed Gibbs iteration are summarized in the following table of counts:

We also have a single document ii with the following topic assignments for each word:

Suppose we want to re-compute the topic assignment for the word “manager”. To sample a new topic, we need to compute several terms to determine how much the document likes each topic, and how much each topic likes the word “manager”. The following questions will all relate to this situation.

First, using the notation in the slides, what is the value of m_{\text{manager}, 1}mmanager,1​ (i.e., the number of times the word “manager” has been assigned to topic 1)?

Question 5: Consider the situation described in Question 4.

What is the value of \sum_w m_{w, 1}∑wmw,1​, where the sum is taken over all words in the vocabulary?

Question 6: Consider the situation described in Question 4.

Following the notation in the slides, what is the value of n_{i, 1}ni,1​ for this document ii (i.e., the number of words in document ii assigned to topic 1)?

Question 7: In the situation described in Question 4, “manager” was assigned to topic 2. When we remove that assignment prior to sampling, we need to decrement the associated counts.

After decrementing, what is the value of n_{i, 2}ni,2​?

Question 8: In the situation described in Question 4, “manager” was assigned to topic 2. When we remove that assignment prior to sampling, we need to decrement the associated counts.

After decrementing, what is the value of m_{manager, 2}mmanager,2​?

#### Quiz 3: Modeling text topics with Latent Dirichlet Allocation

Question 1: Identify the top 3 most probable words for the first topic.

• institute
• university
• president
• board
• game
• coach

Question 2: What is the sum of the probabilities assigned to the top 50 words in the 3rd topic? Round your answer to 3 decimal places.

Question 3: What is the topic most closely associated with the article about former US President George W. Bush? Use the average results from 100 topic predictions.

Question 4: What are the top 3 topics corresponding to the article about English football (soccer) player Steven Gerrard? Use the average results from 100 topic predictions.

• international athletics
• team sports
• general music
• Great Britain and Australia
• science and research

Question 5: What was the value of alpha used to fit our original topic model?

Question 6: What was the value of gamma used to fit our original topic model? Remember that Turi Create uses “beta” instead of “gamma” to refer to the hyperparameter that influences topic distributions over words.

Question 7: How many topics are assigned a weight greater than 0.3 or less than 0.05 for the article on Paul Krugman in the low alpha model? Use the average results from 100 topic predictions.

Question 8: How many topics are assigned a weight greater than 0.3 or less than 0.05 for the article on Paul Krugman in the high alpha model? Use the average results from 100 topic predictions.

Question 9: For each topic of the low gamma model, compute the number of words required to make a list with total probability 0.5. What is the average number of words required across all topics? (HINT: use the get_topics() function from Turi Create with the cdf_cutoff argument.)

Question 10: For each topic of the high gamma model, compute the number of words required to make a list with total probability 0.5. What is the average number of words required across all topics? (HINT: use the get_topics() function from Turi Create with the cdf_cutoff argument).

#### Quiz 1: Modeling text data with a hierarchy of clusters

Question 1: Which diagram best describes the hierarchy right after splitting the athletes cluster?

Question 2: Let us bipartition the clusters female figures and politicians & government officials. Which diagram best describes the resulting hierarchy of clusters for the non-athletes?

Note. The cluster for the athletes and artists are not shown to save space.

Finding Similar Documents in Case Studies

A reader is interested in a specific news story, and you wish to recommend similar stories. What is the correct definition of similarity? What if there are millions of additional documents? Do you have to browse through all other documents every time you want to obtain a new document? How do you group documents that are similar? How can you learn about new and evolving subjects included in the documents?

In this third case study, locating related documents, you will investigate similarity-based retrieval techniques. This course will also look at structured representations for representing documents in a corpus, such as clustering and mixed membership models like latent Dirichlet allocation (LDA). You will use expectation maximization (EM) to discover how to cluster documents and scale the approaches using MapReduce.

Learning Objectives: After this course, you will be able to:

• -Build a document retrieval system based on k-nearest neighbors.
• -Determine multiple text data similarity metrics.
• -Use KD-trees to reduce computations in the k-nearest neighbor search.
• -Use locality-sensitive hashing to compute approximate nearest neighbors.
• -Consider the differences between supervised and unsupervised learning activities.
• -Use k-means to group papers by topic.
• -Explain how to use MapReduce to parallelize k-means.
• -Investigate probabilistic clustering algorithms based on mixture models.
• -Use expectation maximization to fit a Gaussian mixture model (EM).
• -Use latent Dirichlet allocation to do mixed membership modeling (LDA).
• -Explain the steps of a Gibbs sampler and how to use the results to derive conclusions.
• -Compare and contrast non-convex optimization initialization strategies.
• -Python should be used to implement these strategies.

SKILLS YOU WILL GAIN

• Data Clustering Algorithms
• K-Means Clustering
• Machine Learning
• K-D Tree

### Conclusion  