diff --git a/dev/lablecture_day4_clustering.key b/dev/lablecture_day4_clustering.key index e6cea5a..4d9f54d 100755 Binary files a/dev/lablecture_day4_clustering.key and b/dev/lablecture_day4_clustering.key differ diff --git a/docs/index.html b/docs/index.html index 5e57a26..bd169ef 100644 --- a/docs/index.html +++ b/docs/index.html @@ -314,7 +314,7 @@

Day 4

13:00 - 14:00 Take-home exam simulation - +Exam 14:00 - 16:00 diff --git a/docs/lab/lab_day3_pca.html b/docs/lab/lab_day3_pca.html index b731759..4b4e0ab 100644 --- a/docs/lab/lab_day3_pca.html +++ b/docs/lab/lab_day3_pca.html @@ -192,7 +192,7 @@

On this page

Download datasets here or from Canvas.

R script: Code

-

Lab Lecture

+

Lab Lecture: Slides

Exercise 1: Food

In the first exercise, we explore a low-dimensional dataset, Food.txt.

@@ -985,7 +985,7 @@

Exercise 3 R script: [Code](https://github.com/ocbe-uio/course_med3007/blob/main/lab/code/MED3007_Lab2_pca.R) -Lab Lecture +Lab Lecture: [Slides](presentation/Lab_pca.pdf) diff --git a/docs/lab/lab_day4_clustering.html b/docs/lab/lab_day4_clustering.html index aafac1e..4b81ddb 100644 --- a/docs/lab/lab_day4_clustering.html +++ b/docs/lab/lab_day4_clustering.html @@ -153,17 +153,24 @@

On this page

@@ -191,35 +198,215 @@

On this page

Download datasets here, or from Canvas.

R script: Code

-

Presentation: Slides

-
-

Exercise 1: NCI60

-

We look at the NCI60 data again. First load the dataset.

+

Presentation: Slides

+
+

Exercise 1: Food

+

We use the same Food.txt data to illustrate two concepts: hierarchical clustering, and heatmap.

+

This is not a genomics dataset, but for the ease of interpretability, we use it for teaching purposes.

+

Let us load the dataset.

+
+
food <- read.table('data/Food.txt', header=T)
+# we change the name from pulses to a more common name, legume
+colnames(food)[7] <- 'Legume'
+head(food) # print first 6 lines 
+
+
               Meat Pigs Eggs Milk Fish Cereals Legume Fruit
+Albania        10.1  1.4  0.5  8.9  0.2    42.3    5.5   1.7
+Austria         8.9 14.0  4.3 19.9  2.1    28.0    1.3   4.3
+Belg.Lux.      13.5  9.3  4.1 17.5  4.5    26.6    2.1   4.0
+Bulgaria        7.8  6.0  1.6  8.3  1.2    56.7    3.7   4.2
+Czechoslovakia  9.7 11.4  2.8 12.5  2.0    34.3    1.1   4.0
+Denmark        10.6 10.8  3.7 25.0  9.9    21.9    0.7   2.4
+
+
+

We scale the data (also called standardize, or normalize sometimes) so that each column (feature, variable) has 0 mean and 1 variance. We call the scaled data food_s.

-
library(ISLR)
-# or, load('data/NCI60.RData')
-nci.labs <- NCI60$labs # Sample labels (tissue type)
-nci.data <- NCI60$data # Gene expression data set
+
food_s <- scale(food)
+head(food_s) # print first 6 lines
+
+
                      Meat       Pigs       Eggs        Milk        Fish
+Albania         0.08126490 -1.8299828 -2.2437259 -1.15570645 -1.20028213
+Austria        -0.27725673  1.6636208  1.2335002  0.39161231 -0.64187467
+Belg.Lux.       1.09707621  0.3604512  1.0504883  0.05401549  0.06348211
+Bulgaria       -0.60590157 -0.5545403 -1.2371605 -1.24010566 -0.90638347
+Czechoslovakia -0.03824231  0.9427184 -0.1390890 -0.64931122 -0.67126454
+Denmark         0.23064892  0.7763564  0.6844645  1.10900556  1.65053488
+                  Cereals     Legume       Fruit
+Albania         0.9159176  1.2227536 -1.35040507
+Austria        -0.3870690 -0.8923886  0.09091397
+Belg.Lux.      -0.5146342 -0.4895043 -0.07539207
+Bulgaria        2.2280161  0.3162641  0.03547862
+Czechoslovakia  0.1869740 -0.9931096 -0.07539207
+Denmark        -0.9428885 -1.1945517 -0.96235764
+
+
+
+

Distances

+

To do hierarchical clustering, the most convenient command is hclust(). As input you would need a distance between the subjects (patients, or countries in this example). We do it on the scaled data.

+

The command to compute pair-wise distance is dist(). By default, the distance being computed is the Euclidean distance (details optional). Euclidean distance is possibly the most commonly used metric, but there are others. See ?dist() to find out more options.

+

We can present the pair-wise distances in a matrix format. You can see that this matrix is symmetric, with 0 on the diagonal - this should be intuitive: the distance between A - B is the same as B - A, and the distance between A and itself is 0.

+
+
# compute distance
+food_dist <- dist(food_s)
+# round(food_dist, digits = 2) # try this yourself to see what it does
+# alternatively, look at this as a matrix
+food_dist_matrix <- as.matrix(food_dist)
+round(food_dist_matrix[1:5, 1:5], digits = 2) # first 5 row 5 col
+
+
               Albania Austria Belg.Lux. Bulgaria Czechoslovakia
+Albania           0.00    5.95      5.13     2.77           4.44
+Austria           5.95    0.00      2.11     4.71           1.98
+Belg.Lux.         5.13    2.11      0.00     4.45           2.20
+Bulgaria          2.77    4.71      4.45     0.00           3.17
+Czechoslovakia    4.44    1.98      2.20     3.17           0.00
+
+
+
+
+
+ +
+
+Optional: Euclidean disance +
+
+
+

You can check the Euclidean distance between Albania and Austria is indeed 5.95. This distance is the square root of the sum of squared differences between two subjects in all their measurements.

+
+
+
round(food_s[1:2,], digits = 2) # we only keep first 2 digits
+
+
         Meat  Pigs  Eggs  Milk  Fish Cereals Legume Fruit
+Albania  0.08 -1.83 -2.24 -1.16 -1.20    0.92   1.22 -1.35
+Austria -0.28  1.66  1.23  0.39 -0.64   -0.39  -0.89  0.09
+
+
# take the data for two countries each 
+albania <- round(food_s[1,], digits = 2)
+austria <- round(food_s[2,], digits = 2)
+# compute difference between each col
+d <- albania - austria
+d
+
+
   Meat    Pigs    Eggs    Milk    Fish Cereals  Legume   Fruit 
+   0.36   -3.49   -3.47   -1.55   -0.56    1.31    2.11   -1.44 
+
+
# euclidean distance: square each element, sum together, and take a square root
+sqrt(sum(d^2)) 
+
+
[1] 5.942096
+
+
+

Hierarchical clustering

+

Now that we have computed the distance food_dist, we plug it in the clustering algorithm, hclust().

+

We try the complete linkage method, by specifying method = 'complete'. The result is saved as hc.complete. You can visualize it, and add label of the country names to make it easier to read.

+
+
hc.complete <- hclust(food_dist, method="complete")
+plot(hc.complete, labels=rownames(food), main="Complete Linkage", xlab="", sub="")
+
+

+
+
+
+
+

Linkage, dissimilarity, scaling

+

Hierarchical clustering is a class of methods, and there are a variety of options to set.

+
    +
  • Linkage (by seting method inside hclust()): complete, single, average
  • +
  • Dissimilarity: Euclidean, correlation, …
  • +
  • Scaling: scaled data (mean 0 variance 1) or unscaled, original data
  • +
+

There is no definite guide on which combination works the best, hence you can try them out and see what could make most sense. Again, in unsupervised learning data do not have outcome labels, so the interpretation is left for the domain experts to make.

+
+
# single linkage
+hc.single <- hclust(food_dist, method="single")
+plot(hc.single, labels=rownames(food), main="Single Linkage", xlab="", sub="")
+
+

+
+
# average linkage
+hc.average <- hclust(food_dist, method="average")
+plot(hc.average, labels=rownames(food), main="Average Linkage", xlab="", sub="")
+
+

+
+
+
+
# unscaled data, complete linkage
+hc.unscaled <- hclust(dist(food), method="complete")
+plot(hc.unscaled, labels=rownames(food), main="Complete linkage with unscaled features", xlab="", sub="")
+
+

+
+
+
+
# correlation as dissimiarity, rather than euclidean distance
+dd <- as.dist(1-cor(t(food_s))) # compute the metric
+hc.corr <- hclust(dd, method="complete") # cluster
+plot(hc.corr, labels=rownames(food), main="Complete linkage with correlation-based distance", xlab="", sub="")
+
+

+
+
+
+
+

Heatmap

+

Heatmap is a visualization tool to plot data of similar values in similar colors, so that you can identify visualy if there is any pattern. It can also be combined with hierarchical clustering - this is actually the default outcome: dendrograms are displayed for both rows and column.

+
+
# make heatmap on the scaled data
+heatmap(food_s)
+
+

+
+
+

To preserve the original ordering of the columns and rows, you can specify Rowv = NA, Colv = NA.

+
+
# no clustering for row or col, this preserves the original ordering
+heatmap(food_s, Rowv = NA, Colv = NA)
+
+

+
+
+

Can also only do clustering for row only (or column only).

+
+
# only clustering for row
+heatmap(food_s, Colv = NA)
+
+

+
+
+
+
+
+

Exercise 2: NCI60

+

We look at the NCI60 data again. First load the dataset.

+
+
library(ISLR)
+# or, load('data/NCI60.RData')
+nci.labs <- NCI60$labs # Sample labels (tissue type)
+nci.data <- NCI60$data # Gene expression data set
+
+
+

Hierarchical clustering

We start by scaling the data, and calculate the distance matrix (using the Euclidean distance), and then investigate different linkage methods.

-
# Scale the data to zero mean and unit variance:
-sd.data <- scale(nci.data)
-
-# Calculate the distance matrix 
-# equivalent: data.dist <- dist(sd.data, method="euclidean")
-data.dist <- dist(sd.data)
+
# Scale the data to zero mean and unit variance:
+sd.data <- scale(nci.data)
+
+# Calculate the distance matrix 
+# equivalent: data.dist <- dist(sd.data, method="euclidean")
+data.dist <- dist(sd.data)

Next we perform hierarchical clustering with distance matrix as input. The function we use is hclust(). We specify the linkage method to be complete.

Once the result is saved in hc.complete object, you can plot the dendrogram.

-
# Perform clustering
-hc.complete <- hclust(data.dist, method="complete")
-
-# names(hc.complete)
-plot(hc.complete, labels=nci.labs, main="Complete Linkage", xlab="", sub="")
+
# Perform clustering
+hc.complete <- hclust(data.dist, method="complete")
+
+# names(hc.complete)
+plot(hc.complete, labels=nci.labs, main="Complete Linkage", xlab="", sub="")

@@ -227,30 +414,30 @@

Hierarchical clust

The object hc.complete contains a lot of information. To get the information, you can use the $ operator.

You should refer to the documentation for hclust() to see a complete list of output. Use ?hclust to get the documentation on how to use the function.

-
hc.complete$dist.method # distance method
+
hc.complete$dist.method # distance method
[1] "euclidean"
-
# hc.complete$merge  # order of aggregations of samples / clusters
-# hc.complete$height # distance at which aggregations happen
-# hc.complete$labels # labels (numeric, since we don't know the original categories!)
-# hc.complete$method
-# hc.complete$call
+
# hc.complete$merge  # order of aggregations of samples / clusters
+# hc.complete$height # distance at which aggregations happen
+# hc.complete$labels # labels (numeric, since we don't know the original categories!)
+# hc.complete$method
+# hc.complete$call

We can try different linkage methods and see how the clustering results differ. Change the method argument in the function, and plot the results.

-
hc.average <- hclust(data.dist, method="average")
-hc.single <- hclust(data.dist, method="single")
-
-plot(hc.average, labels=nci.labs, main="Average Linkage", xlab="", sub="")
-plot(hc.single, labels=nci.labs,  main="Single Linkage", xlab="", sub="")
+
hc.average <- hclust(data.dist, method="average")
+hc.single <- hclust(data.dist, method="single")
+
+plot(hc.average, labels=nci.labs, main="Average Linkage", xlab="", sub="")
+plot(hc.single, labels=nci.labs,  main="Single Linkage", xlab="", sub="")

Now we focus on complete linkage only.

First, we use cutree() to compare the results when the data are separated into either 2 or 4 clusters.

-
# Compare 2 clusters and 4 clusters:
-hc.clusters <- cutree(hc.complete, c(2, 4))
-head(hc.clusters) # print first 6 results
+
# Compare 2 clusters and 4 clusters:
+hc.clusters <- cutree(hc.complete, c(2, 4))
+head(hc.clusters) # print first 6 results
   2 4
 V1 1 1
@@ -260,8 +447,8 @@ 

Hierarchical clust V5 1 2 V6 1 2

-
# cross tabulation
-table(hc.clusters[,"2"], hc.clusters[,"4"])
+
# cross tabulation
+table(hc.clusters[,"2"], hc.clusters[,"4"])
   
      1  2  3  4
@@ -271,25 +458,25 @@ 

Hierarchical clust

It is more straightforward to check the results with a dendrogram.

-
# visualize the cuts
-# how do you know where to draw the line? check height
-heights <- hc.complete$height
-tail(heights, 4)  # print the last 4
+
# visualize the cuts
+# how do you know where to draw the line? check height
+heights <- hc.complete$height
+tail(heights, 4)  # print the last 4
[1] 137.5633 141.2472 142.9218 162.2074
-
plot(hc.complete, labels=nci.labs, main="Complete Linkage", xlab="", sub="")
-abline(h=140, col="red")  # 4 clusters
-abline(h=150, col="blue") # 2 clusters
+
plot(hc.complete, labels=nci.labs, main="Complete Linkage", xlab="", sub="")
+abline(h=140, col="red")  # 4 clusters
+abline(h=150, col="blue") # 2 clusters
-

+

The way to interpret the height variable is simple: it is where two clusters are merged into one. For example, the largest cluster corresponds to the last value of height (162.2) - if you check the figure, it is exactly where the horizontal line is merging the two groups. Similarly, 142.9 is where three groups became two, 141.2 is where four groups became three. If you draw a line at 140, it points out the four clusters.

How are the labels distributed between clusters? We can focus on 4 cluster situation, and use table() to list out which cancer types is merged in which of the four clusters.

For example, breast cancer appears in all but 3rd cluster; melanoma only appears in the first clsuter; so on so forth.

-
table(hc.clusters[,"4"], nci.labs)
+
table(hc.clusters[,"4"], nci.labs)
   nci.labs
     BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA MCF7A-repro MCF7D-repro
@@ -308,16 +495,16 @@ 

Hierarchical clust

Finally, we see what happens if we use unscaled data instead of scaled data, or if we use a correlation-based distance metric instead of the Euclidean distance.

Compare the dendrograms: How different are the resulting clusterings? Do you recognise subclusters that are consistent?

-
# Compare scaled data versus non-scaled data:
-hc.unscaled <- hclust(dist(nci.data), method="complete")
-plot(hc.unscaled, labels=nci.labs, main="Complete linkage with unscaled features", xlab="", sub="")
+
# Compare scaled data versus non-scaled data:
+hc.unscaled <- hclust(dist(nci.data), method="complete")
+plot(hc.unscaled, labels=nci.labs, main="Complete linkage with unscaled features", xlab="", sub="")

-
# Compare Euclidean distance with correlation-based distance:
-dd <- as.dist(1-cor(t(sd.data)))
-hc.corr <- hclust(dd, method="complete")
-plot(hc.corr, labels=nci.labs, main="Complete linkage with correlation-based distance", xlab="", sub="")
+
# Compare Euclidean distance with correlation-based distance:
+dd <- as.dist(1-cor(t(sd.data)))
+hc.corr <- hclust(dd, method="complete")
+plot(hc.corr, labels=nci.labs, main="Complete linkage with correlation-based distance", xlab="", sub="")

@@ -329,9 +516,9 @@

K-means clustering

In contrast to the hierarchical clustering which requires a distance as input, with K-means you would provide the data matrix. The data matrix can be scaled (centered and with unit variance), or unscaled.

In this example we use scaled data computed from before, sd.data.

-
set.seed(4) # set random seed
-km.out4 <- kmeans(sd.data, centers = 4, nstart=20)
-km.out4$cluster
+
set.seed(4) # set random seed
+km.out4 <- kmeans(sd.data, centers = 4, nstart=20)
+km.out4$cluster
 V1  V2  V3  V4  V5  V6  V7  V8  V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 
   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4 
@@ -346,10 +533,10 @@ 

K-means clustering

Read the help file ?kmeans to understand what the argument nstart=20 does. Comparing an analysis with nstart=20 versus nstart=1 demonstrates how the cluster results can be improved if we allow more evaluations with different randomly chosen starting centroids.

Set a different random seed, say 3 (as long as it’s different from the one you used before), and run the analysis again. This time we use a different nstart

-
# different starting centroids improve the clustering:
-set.seed(3)
-km.out <- kmeans(sd.data, centers = 4, nstart=1)
-km.out$cluster
+
# different starting centroids improve the clustering:
+set.seed(3)
+km.out <- kmeans(sd.data, centers = 4, nstart=1)
+km.out$cluster # cluster label
 V1  V2  V3  V4  V5  V6  V7  V8  V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 
   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
@@ -360,18 +547,14 @@ 

K-means clustering

V61 V62 V63 V64 2 2 2 2
-
km.out$tot.withinss
-
-
[1] 344566.9
-

Compare with hierarchical clustering

-
# we can directly compare the k-means result (along rows)
-# with the hierarchical clustering result (along columns)
-table(km.out4$cluster, hc.clusters[,"4"], deparse.level=2)
+
# we can directly compare the k-means result (along rows)
+# with the hierarchical clustering result (along columns)
+table(km.out4$cluster, hc.clusters[,"4"], deparse.level=2)
               hc.clusters[, "4"]
 km.out4$cluster  1  2  3  4
@@ -387,49 +570,49 @@ 

Compa

Visualize clusters

We can visualise the K-means clustering results of high-dimensional data by using PCA for dimension reduction. We plot the first two principal components and colour the data points (= individual cell lines) by their assigned cluster from K-means.

-
# first, run PCA again on the NCI60 data
-pr.out <- prcomp(nci.data, scale=TRUE)
-
-# more cluster options
-km.out2 <- kmeans(sd.data, 2, nstart=20)
-km.out3 <- kmeans(sd.data, 3, nstart=20)
-
-# we can now visualise the K-Means results by labelling the data points
-# in a plot of the scores of the first 2 principal components:
-par(mfrow=c(1,3))
-plot(pr.out$x[,1:2], col=(km.out2$cluster+1), main="K-Means with K=2",
-     xlab="PC 1", ylab="PC 2", pch=20)
-plot(pr.out$x[,1:2], col=(km.out3$cluster+1), main="K-Means with K=3",
-     xlab="PC 1",  ylab="PC 2", pch=20)
-plot(pr.out$x[,1:2], col=(km.out4$cluster+1), main="K-Means with K=4",
-     xlab="PC 1", ylab="PC 2", pch=20)
+
# first, run PCA again on the NCI60 data
+pr.out <- prcomp(nci.data, scale=TRUE)
+
+# more cluster options
+km.out2 <- kmeans(sd.data, 2, nstart=20)
+km.out3 <- kmeans(sd.data, 3, nstart=20)
+
+# we can now visualise the K-Means results by labelling the data points
+# in a plot of the scores of the first 2 principal components:
+par(mfrow=c(1,3))
+plot(pr.out$x[,1:2], col=(km.out2$cluster+1), main="K-Means with K=2",
+     xlab="PC 1", ylab="PC 2", pch=20)
+plot(pr.out$x[,1:2], col=(km.out3$cluster+1), main="K-Means with K=3",
+     xlab="PC 1",  ylab="PC 2", pch=20)
+plot(pr.out$x[,1:2], col=(km.out4$cluster+1), main="K-Means with K=4",
+     xlab="PC 1", ylab="PC 2", pch=20)

Compare with the plot from Exercise 2 yesterday (left panel) along with the cancer type labels. The clusters from K-means seem to correspond decently to partition the data into groups.

-
par(mfrow=c(1,1))
-
-Cols=function(vec){
-  cols=rainbow(length(unique(vec)))
-  return(cols[as.numeric(as.factor(vec))])
-}
-plot(pr.out$x[,1:2], col=Cols(nci.labs), pch=19,xlab="PC 1",ylab=" PC 2")
-legend('topleft', col=rainbow(length(unique(nci.labs))), legend=unique(nci.labs), bty='n', lwd=2, cex=.6)
+
par(mfrow=c(1,1))
+
+Cols=function(vec){
+  cols=rainbow(length(unique(vec)))
+  return(cols[as.numeric(as.factor(vec))])
+}
+plot(pr.out$x[,1:2], col=Cols(nci.labs), pch=19,xlab="PC 1",ylab=" PC 2")
+legend('topleft', col=rainbow(length(unique(nci.labs))), legend=unique(nci.labs), bty='n', lwd=2, cex=.6)
-

+

-
-

Heatmap

+
+

Heatmap

A heatmap is another way to visualize the clusters from the data. We use the principal components rather than the raw data, as PCs are already explaining a large amount of variability in the over 6000 features.

Similar values are presented with similar colors.

-
## We use the scores of the PCA on the NCI60 data, to reduce dimension
-scores <- pr.out$x
-scores[1:5, 1:5] # first 5 pc, first 5 measurements
+
## We use the scores of the PCA on the NCI60 data, to reduce dimension
+scores <- pr.out$x
+scores[1:5, 1:5] # first 5 pc, first 5 measurements
         PC1       PC2         PC3         PC4        PC5
 V1 -19.68245  3.527748  -9.7354382   0.8177816 -12.511081
@@ -438,33 +621,33 @@ 

Heatmap

V4 -42.48098 -9.691742 -0.8830921 -3.4180227 -41.938370 V5 -54.98387 -5.158121 -20.9291076 -15.7253986 -10.361364
-
#  default choices
-heatmap(pr.out$x)
+
#  default choices
+heatmap(pr.out$x)

You can remove the dendrogram on the PCs, only keeping the ones for cancer types. Now you see that the PCs have kept their original order from 1 to 64.

-
# hc.corr is the result from hclust. check the section on hierarchical clustering
-heatmap(pr.out$x, Rowv = as.dendrogram(hc.corr), Colv = NA)
+
# hc.corr is the result from hclust. check the section on hierarchical clustering
+heatmap(pr.out$x, Rowv = as.dendrogram(hc.corr), Colv = NA)

You can also reduce the number of PCs, and add titles to the plot annd y-axis.

-
par(cex.main = .7)
-heatmap(pr.out$x[,1:40], Rowv = as.dendrogram(hc.corr), Colv = NA,
-        labRow = nci.labs, main = 'Heatmap of the scores of the first 40 PCs on the NCI60 data')
+
par(cex.main = .7)
+heatmap(pr.out$x[,1:40], Rowv = as.dendrogram(hc.corr), Colv = NA,
+        labRow = nci.labs, main = 'Heatmap of the scores of the first 40 PCs on the NCI60 data')

-
-

Exercise 2: Gene expression data

+
+

Exercise 3: Gene expression data

(CH12Ex13 from statistical learning)

We use the Ch12Ex13.csv data to repeat some of the clustering analysis we did.

The first 20 samples are from healthy patients, while the second 20 are from a diseased group.

@@ -472,36 +655,36 @@

Exercise 2

Carry out both hierarchical clustering and K-means clustering. You should choose the most meaningful number of clusters (think about how many groups of patients we have!). Compare the results.

Note: remember that the data has genes on the rows and patients on the columns. You need to transpose the data so that the orders are reversed.

-
# load in the data using read.csv(). You will need to select header=F.
-data <- read.csv("data/Ch12Ex13.csv", header=FALSE)
-dim(data)
+
# load in the data using read.csv(). You will need to select header=F.
+data <- read.csv("data/Ch12Ex13.csv", header=FALSE)
+dim(data)
[1] 1000   40
-
# transpose the data, so that we have each row is one patient (subject)
-data <- t(data) 
+
# transpose the data, so that we have each row is one patient (subject)
+data <- t(data) 

Now the first 20 rows are measurements from healthy patients (group 0), and 21-50 rows are the disease patients (group 1). We can denote this information in a vector like this.

-
true.groups <- c( rep(0,20), rep(1,20))
+
true.groups <- c( rep(0,20), rep(1,20))
-
-

Hierarchical clustering

+
+

Hierarchical clustering

You can use different linkage options and distance metrics of your choosing. For example, with complete linkage the code is like this.

-
data.dist <- dist(data) # need to compute the distance matrix
-hclust.df <- hclust(data.dist, method="complete" )
-#alternatives:
-#hclust.df <- hclust( D, method="average" )
-#hclust.df <- hclust( D, method="single" )
+
data.dist <- dist(data) # need to compute the distance matrix
+hclust.df <- hclust(data.dist, method="complete" )
+#alternatives:
+#hclust.df <- hclust( D, method="average" )
+#hclust.df <- hclust( D, method="single" )

We can keep 2 clusters with cutree. Then do a cross tabulation of the true labels and clustered results: how well do they correspond?

-
# find the clusters
-predicted <- cutree( hclust.df, k=2 )
-
-# How well does our clustering predict health vs. diseased
-table(predicted, true.groups )
+
# find the clusters
+predicted <- cutree( hclust.df, k=2 )
+
+# How well does our clustering predict health vs. diseased
+table(predicted, true.groups )
         true.groups
 predicted  0  1
@@ -514,9 +697,9 @@ 

Hierarchical clu

K-means

Now you can use K-means to identify 2 clusters.

-
predicted.kmean <- kmeans(data, 2, nstart=20)$cluster
-# agreement with true label
-table(predicted.kmean, true.groups )
+
predicted.kmean <- kmeans(data, 2, nstart=20)$cluster
+# agreement with true label
+table(predicted.kmean, true.groups )
               true.groups
 predicted.kmean  0  1
@@ -817,386 +1000,565 @@ 

K-means

} }); diff --git a/docs/lab/lab_day4_clustering_files/figure-html/hc-food-complete-1.png b/docs/lab/lab_day4_clustering_files/figure-html/hc-food-complete-1.png new file mode 100644 index 0000000..3a8a155 Binary files /dev/null and b/docs/lab/lab_day4_clustering_files/figure-html/hc-food-complete-1.png differ diff --git a/docs/lab/lab_day4_clustering_files/figure-html/hc-food-cor-1.png b/docs/lab/lab_day4_clustering_files/figure-html/hc-food-cor-1.png new file mode 100644 index 0000000..c6a0472 Binary files /dev/null and b/docs/lab/lab_day4_clustering_files/figure-html/hc-food-cor-1.png differ diff --git a/docs/lab/lab_day4_clustering_files/figure-html/hc-food-heatmap1-1.png b/docs/lab/lab_day4_clustering_files/figure-html/hc-food-heatmap1-1.png new file mode 100644 index 0000000..a5b73b3 Binary files /dev/null and b/docs/lab/lab_day4_clustering_files/figure-html/hc-food-heatmap1-1.png differ diff --git a/docs/lab/lab_day4_clustering_files/figure-html/hc-food-heatmap2-1.png b/docs/lab/lab_day4_clustering_files/figure-html/hc-food-heatmap2-1.png new file mode 100644 index 0000000..29197c8 Binary files /dev/null and b/docs/lab/lab_day4_clustering_files/figure-html/hc-food-heatmap2-1.png differ diff --git a/docs/lab/lab_day4_clustering_files/figure-html/hc-food-heatmap3-1.png b/docs/lab/lab_day4_clustering_files/figure-html/hc-food-heatmap3-1.png new file mode 100644 index 0000000..99dac6e Binary files /dev/null and b/docs/lab/lab_day4_clustering_files/figure-html/hc-food-heatmap3-1.png differ diff --git a/docs/lab/lab_day4_clustering_files/figure-html/hc-food-linkage-1.png b/docs/lab/lab_day4_clustering_files/figure-html/hc-food-linkage-1.png new file mode 100644 index 0000000..8d417d8 Binary files /dev/null and b/docs/lab/lab_day4_clustering_files/figure-html/hc-food-linkage-1.png differ diff --git a/docs/lab/lab_day4_clustering_files/figure-html/hc-food-linkage-2.png b/docs/lab/lab_day4_clustering_files/figure-html/hc-food-linkage-2.png new file mode 100644 index 0000000..f9a8fa2 Binary files /dev/null and b/docs/lab/lab_day4_clustering_files/figure-html/hc-food-linkage-2.png differ diff --git a/docs/lab/lab_day4_clustering_files/figure-html/hc-food-unscaled-1.png b/docs/lab/lab_day4_clustering_files/figure-html/hc-food-unscaled-1.png new file mode 100644 index 0000000..d0de0f0 Binary files /dev/null and b/docs/lab/lab_day4_clustering_files/figure-html/hc-food-unscaled-1.png differ diff --git a/docs/lab/lab_day4_clustering_files/figure-html/pca-nci-keeppc-1.png b/docs/lab/lab_day4_clustering_files/figure-html/pca-nci-keeppc-1.png deleted file mode 100644 index 4278cb2..0000000 Binary files a/docs/lab/lab_day4_clustering_files/figure-html/pca-nci-keeppc-1.png and /dev/null differ diff --git a/docs/lab/lab_day4_clustering_files/figure-html/pca-nci-plot2-1.png b/docs/lab/lab_day4_clustering_files/figure-html/pca-nci-plot2-1.png deleted file mode 100644 index 960cf47..0000000 Binary files a/docs/lab/lab_day4_clustering_files/figure-html/pca-nci-plot2-1.png and /dev/null differ diff --git a/docs/lab/lab_day4_clustering_files/figure-html/pca-nci-plotpc-1.png b/docs/lab/lab_day4_clustering_files/figure-html/pca-nci-plotpc-1.png deleted file mode 100644 index bfc6f02..0000000 Binary files a/docs/lab/lab_day4_clustering_files/figure-html/pca-nci-plotpc-1.png and /dev/null differ diff --git a/docs/lab/lab_day4_clustering_files/figure-html/clust-nci60-cutree-1.png b/docs/lab/lab_day4_clustering_files/figure-html/unnamed-chunk-18-1.png similarity index 100% rename from docs/lab/lab_day4_clustering_files/figure-html/clust-nci60-cutree-1.png rename to docs/lab/lab_day4_clustering_files/figure-html/unnamed-chunk-18-1.png diff --git a/docs/lab/lab_day4_clustering_files/figure-html/unnamed-chunk-14-1.png b/docs/lab/lab_day4_clustering_files/figure-html/unnamed-chunk-25-1.png similarity index 100% rename from docs/lab/lab_day4_clustering_files/figure-html/unnamed-chunk-14-1.png rename to docs/lab/lab_day4_clustering_files/figure-html/unnamed-chunk-25-1.png diff --git a/docs/lab/lab_day4_clustering_files/figure-html/unnamed-chunk-7-1.png b/docs/lab/lab_day4_clustering_files/figure-html/unnamed-chunk-7-1.png deleted file mode 100644 index 6b9de14..0000000 Binary files a/docs/lab/lab_day4_clustering_files/figure-html/unnamed-chunk-7-1.png and /dev/null differ diff --git a/docs/lab/overview.html b/docs/lab/overview.html index 7090ed8..0eef877 100644 --- a/docs/lab/overview.html +++ b/docs/lab/overview.html @@ -245,7 +245,7 @@

Lab notes and R sc Day 4 Day 4: Clustering Code - +Slides @@ -579,7 +579,7 @@

Lab notes and R sc | Day 2 | [Day 2: Multiple testing](lab_day2_testing.qmd) | [Code](https://github.com/ocbe-uio/course_med3007/blob/main/lab/code/MED3007_Lab1.R), [Code (solution)](https://github.com/ocbe-uio/course_med3007/blob/main/lab/code/MED3007_Lab1_exercise_solution.R) | | | Day 3 | [Day 3: Principal Component Analysis](lab_day3_pca.qmd) |[Code](https://github.com/ocbe-uio/course_med3007/blob/main/lab/code/MED3007_Lab2_pca.R) | [Slides](presentation/Lab_pca.pdf) | -| Day 4 | [Day 4: Clustering](lab_day4_clustering.qmd) |[Code](https://github.com/ocbe-uio/course_med3007/blob/main/lab/code/MED3007_Lab3_clustering.R) | | +| Day 4 | [Day 4: Clustering](lab_day4_clustering.qmd) |[Code](https://github.com/ocbe-uio/course_med3007/blob/main/lab/code/MED3007_Lab3_clustering.R) | [Slides](presentation/Lab_clustering.pdf)|

diff --git a/docs/lab/presentation/Lab_clustering.pdf b/docs/lab/presentation/Lab_clustering.pdf new file mode 100644 index 0000000..ea019d2 Binary files /dev/null and b/docs/lab/presentation/Lab_clustering.pdf differ diff --git a/docs/search.json b/docs/search.json index 0cb8c14..155b23f 100644 --- a/docs/search.json +++ b/docs/search.json @@ -46,7 +46,7 @@ "href": "index.html", "title": "MED3007: Statistical Principles in Genomics", "section": "", - "text": "Welcome!\nThis is the course website for MED3007: Statistical Principles in Genomics at the Faculty of Medicine, University of Oslo.\nThe course website is developed by the instructors of the course, hosted for free and public accessible on Github. Course material can be found in the github repository.\n\nStructure\n\nGet Started provides some information about software installation, data download and code.\nR Lab and Code hosts the lab session exercises and code.\n\n\n\n\n\n\n\nOfficial course webpage\n\n\n\nPlease refer to the official course page by University of Oslo for information related to application, evaluation and other administrative matters.\n\n\n\n\nPreparation before the course starts\nYou should have R installed on your computer before the course. More on preparation please read here.\nIf you have trouble setting it up with your own laptop, you can also use the PC in the PC lab.\n\n\n\nSchedule\nYou can find the time and place on the official course schedule at the University of Oslo course website.\nIf there is a conflict of information, please refer to the official schedule.\n\nDay 1\n\n\n\nTime\nTopic\nCourse material\n\n\n\n\n9:00 - 9:45\nIntroduction to the course\nLecture\n\n\n10:00 - 11:30\nLab: Introduction to R and Rstudio\nLab: intro to R, Code\n\n\n13:00 - 14:00\nData screening and multiple testing\nLecture\n\n\n14:00 - 16:00\nQ&A\n\n\n\n\n\n\nDay 2\n\n\n\nTime\nTopic\nCourse material\n\n\n\n\n9:00 - 9:45\nLab: Data screening and multiple testing\n\n\n\n10:00 - 11:30\nExercises\nLab: multiple testing, Code, Code (solution)\n\n\n13:00 - 14:00\nData visualization, dimensional reduction\nLecture\n\n\n14:00 - 16:00\nQ&A\n\n\n\n\n\n\nDay 3\n\n\n\nTime\nTopic\nCourse material\n\n\n\n\n9:00 - 9:45\nLab: Data visualization, dimensional reduction\n\n\n\n10:00 - 11:30\nExercises\nLab: PCA, Code\n\n\n13:00 - 14:00\nClustering\nLecture\n\n\n14:00 - 16:00\nQ&A\n\n\n\n\n\n\nDay 4\n\n\n\nTime\nTopic\nCourse material\n\n\n\n\n9:00 - 9:45\nLab: Clustering and heatmaps\n\n\n\n10:00 - 11:30\nExercises\nLab: clustering, Code\n\n\n13:00 - 14:00\nTake-home exam simulation\n\n\n\n14:00 - 16:00\nQ&A\n\n\n\n\nPapers\n\nPaper I: Cappelletti 2022\nPaper II: Ankill 2022" + "text": "Welcome!\nThis is the course website for MED3007: Statistical Principles in Genomics at the Faculty of Medicine, University of Oslo.\nThe course website is developed by the instructors of the course, hosted for free and public accessible on Github. Course material can be found in the github repository.\n\nStructure\n\nGet Started provides some information about software installation, data download and code.\nR Lab and Code hosts the lab session exercises and code.\n\n\n\n\n\n\n\nOfficial course webpage\n\n\n\nPlease refer to the official course page by University of Oslo for information related to application, evaluation and other administrative matters.\n\n\n\n\nPreparation before the course starts\nYou should have R installed on your computer before the course. More on preparation please read here.\nIf you have trouble setting it up with your own laptop, you can also use the PC in the PC lab.\n\n\n\nSchedule\nYou can find the time and place on the official course schedule at the University of Oslo course website.\nIf there is a conflict of information, please refer to the official schedule.\n\nDay 1\n\n\n\nTime\nTopic\nCourse material\n\n\n\n\n9:00 - 9:45\nIntroduction to the course\nLecture\n\n\n10:00 - 11:30\nLab: Introduction to R and Rstudio\nLab: intro to R, Code\n\n\n13:00 - 14:00\nData screening and multiple testing\nLecture\n\n\n14:00 - 16:00\nQ&A\n\n\n\n\n\n\nDay 2\n\n\n\nTime\nTopic\nCourse material\n\n\n\n\n9:00 - 9:45\nLab: Data screening and multiple testing\n\n\n\n10:00 - 11:30\nExercises\nLab: multiple testing, Code, Code (solution)\n\n\n13:00 - 14:00\nData visualization, dimensional reduction\nLecture\n\n\n14:00 - 16:00\nQ&A\n\n\n\n\n\n\nDay 3\n\n\n\nTime\nTopic\nCourse material\n\n\n\n\n9:00 - 9:45\nLab: Data visualization, dimensional reduction\n\n\n\n10:00 - 11:30\nExercises\nLab: PCA, Code\n\n\n13:00 - 14:00\nClustering\nLecture\n\n\n14:00 - 16:00\nQ&A\n\n\n\n\n\n\nDay 4\n\n\n\nTime\nTopic\nCourse material\n\n\n\n\n9:00 - 9:45\nLab: Clustering and heatmaps\n\n\n\n10:00 - 11:30\nExercises\nLab: clustering, Code\n\n\n13:00 - 14:00\nTake-home exam simulation\nExam\n\n\n14:00 - 16:00\nQ&A\n\n\n\n\nPapers\n\nPaper I: Cappelletti 2022\nPaper II: Ankill 2022" }, { "objectID": "about.html", @@ -186,14 +186,14 @@ "href": "lab/overview.html#lab-notes-and-r-scripts", "title": "R Lab: Overview", "section": "Lab notes and R scripts", - "text": "Lab notes and R scripts\n\n\n\nDay\nLab notes\nR script\nPresentation\n\n\n\n\nDay 1\nDay 1: Introduction to R\nCode\nSlides\n\n\n\n(supplement): Navigate RStudio and workspace\n\n\n\n\nDay 2\nDay 2: Multiple testing\nCode, Code (solution)\n\n\n\nDay 3\nDay 3: Principal Component Analysis\nCode\nSlides\n\n\n\n\n\n\n\n\nDay 4\nDay 4: Clustering\nCode" + "text": "Lab notes and R scripts\n\n\n\nDay\nLab notes\nR script\nPresentation\n\n\n\n\nDay 1\nDay 1: Introduction to R\nCode\nSlides\n\n\n\n(supplement): Navigate RStudio and workspace\n\n\n\n\nDay 2\nDay 2: Multiple testing\nCode, Code (solution)\n\n\n\nDay 3\nDay 3: Principal Component Analysis\nCode\nSlides\n\n\n\n\n\n\n\n\nDay 4\nDay 4: Clustering\nCode\nSlides" }, { "objectID": "lab/lab_day3_pca.html", "href": "lab/lab_day3_pca.html", "title": "R Lab (day 3): Data exploration, Principal Component Analysis", "section": "", - "text": "Download datasets here or from Canvas.\nR script: Code\nLab Lecture" + "text": "Download datasets here or from Canvas.\nR script: Code\nLab Lecture: Slides" }, { "objectID": "lab/lab_day3_pca.html#nci60", @@ -383,5 +383,26 @@ "title": "R Lab (day 4): Clustering", "section": "Exercise 2: Gene expression data", "text": "Exercise 2: Gene expression data\n(CH12Ex13 from statistical learning)\nWe use the Ch12Ex13.csv data to repeat some of the clustering analysis we did.\nThe first 20 samples are from healthy patients, while the second 20 are from a diseased group.\nLoad in the data using read.csv(). You will need to select header=F. Alternatively: load in the data using “Import dataset” in the upper right window, and click “no” on the “Heading” option.\nCarry out both hierarchical clustering and K-means clustering. You should choose the most meaningful number of clusters (think about how many groups of patients we have!). Compare the results.\nNote: remember that the data has genes on the rows and patients on the columns. You need to transpose the data so that the orders are reversed.\n\n# load in the data using read.csv(). You will need to select header=F.\ndata <- read.csv(\"data/Ch12Ex13.csv\", header=FALSE)\ndim(data)\n\n[1] 1000 40\n\n# transpose the data, so that we have each row is one patient (subject)\ndata <- t(data) \n\nNow the first 20 rows are measurements from healthy patients (group 0), and 21-50 rows are the disease patients (group 1). We can denote this information in a vector like this.\n\ntrue.groups <- c( rep(0,20), rep(1,20))\n\n\nHierarchical clustering\nYou can use different linkage options and distance metrics of your choosing. For example, with complete linkage the code is like this.\n\ndata.dist <- dist(data) # need to compute the distance matrix\nhclust.df <- hclust(data.dist, method=\"complete\" )\n#alternatives:\n#hclust.df <- hclust( D, method=\"average\" )\n#hclust.df <- hclust( D, method=\"single\" )\n\nWe can keep 2 clusters with cutree. Then do a cross tabulation of the true labels and clustered results: how well do they correspond?\n\n# find the clusters\npredicted <- cutree( hclust.df, k=2 )\n\n# How well does our clustering predict health vs. diseased\ntable(predicted, true.groups )\n\n true.groups\npredicted 0 1\n 1 20 0\n 2 0 20\n\n\n\n\nK-means\nNow you can use K-means to identify 2 clusters.\n\npredicted.kmean <- kmeans(data, 2, nstart=20)$cluster\n# agreement with true label\ntable(predicted.kmean, true.groups )\n\n true.groups\npredicted.kmean 0 1\n 1 20 0\n 2 0 20\n\n\nBoth methods seem to do work decently for the task." + }, + { + "objectID": "lab/lab_day4_clustering.html#exercise-1-food", + "href": "lab/lab_day4_clustering.html#exercise-1-food", + "title": "R Lab (day 4): Clustering", + "section": "Exercise 1: Food", + "text": "Exercise 1: Food\nWe use the same Food.txt data to illustrate two concepts: hierarchical clustering, and heatmap.\nThis is not a genomics dataset, but for the ease of interpretability, we use it for teaching purposes.\nLet us load the dataset.\n\nfood <- read.table('data/Food.txt', header=T)\n# we change the name from pulses to a more common name, legume\ncolnames(food)[7] <- 'Legume'\nhead(food) # print first 6 lines \n\n Meat Pigs Eggs Milk Fish Cereals Legume Fruit\nAlbania 10.1 1.4 0.5 8.9 0.2 42.3 5.5 1.7\nAustria 8.9 14.0 4.3 19.9 2.1 28.0 1.3 4.3\nBelg.Lux. 13.5 9.3 4.1 17.5 4.5 26.6 2.1 4.0\nBulgaria 7.8 6.0 1.6 8.3 1.2 56.7 3.7 4.2\nCzechoslovakia 9.7 11.4 2.8 12.5 2.0 34.3 1.1 4.0\nDenmark 10.6 10.8 3.7 25.0 9.9 21.9 0.7 2.4\n\n\nWe scale the data (also called standardize, or normalize sometimes) so that each column (feature, variable) has 0 mean and 1 variance. We call the scaled data food_s.\n\nfood_s <- scale(food)\nhead(food_s) # print first 6 lines\n\n Meat Pigs Eggs Milk Fish\nAlbania 0.08126490 -1.8299828 -2.2437259 -1.15570645 -1.20028213\nAustria -0.27725673 1.6636208 1.2335002 0.39161231 -0.64187467\nBelg.Lux. 1.09707621 0.3604512 1.0504883 0.05401549 0.06348211\nBulgaria -0.60590157 -0.5545403 -1.2371605 -1.24010566 -0.90638347\nCzechoslovakia -0.03824231 0.9427184 -0.1390890 -0.64931122 -0.67126454\nDenmark 0.23064892 0.7763564 0.6844645 1.10900556 1.65053488\n Cereals Legume Fruit\nAlbania 0.9159176 1.2227536 -1.35040507\nAustria -0.3870690 -0.8923886 0.09091397\nBelg.Lux. -0.5146342 -0.4895043 -0.07539207\nBulgaria 2.2280161 0.3162641 0.03547862\nCzechoslovakia 0.1869740 -0.9931096 -0.07539207\nDenmark -0.9428885 -1.1945517 -0.96235764\n\n\n\nDistances\nTo do hierarchical clustering, the most convenient command is hclust(). As input you would need a distance between the subjects (patients, or countries in this example). We do it on the scaled data.\nThe command to compute pair-wise distance is dist(). By default, the distance being computed is the Euclidean distance (details optional). Euclidean distance is possibly the most commonly used metric, but there are others. See ?dist() to find out more options.\nWe can present the pair-wise distances in a matrix format. You can see that this matrix is symmetric, with 0 on the diagonal - this should be intuitive: the distance between A - B is the same as B - A, and the distance between A and itself is 0.\n\n# compute distance\nfood_dist <- dist(food_s)\n# round(food_dist, digits = 2) # try this yourself to see what it does\n# alternatively, look at this as a matrix\nfood_dist_matrix <- as.matrix(food_dist)\nround(food_dist_matrix[1:5, 1:5], digits = 2) # first 5 row 5 col\n\n Albania Austria Belg.Lux. Bulgaria Czechoslovakia\nAlbania 0.00 5.95 5.13 2.77 4.44\nAustria 5.95 0.00 2.11 4.71 1.98\nBelg.Lux. 5.13 2.11 0.00 4.45 2.20\nBulgaria 2.77 4.71 4.45 0.00 3.17\nCzechoslovakia 4.44 1.98 2.20 3.17 0.00\n\n\n\n\n\n\n\n\nOptional: Euclidean disance\n\n\n\nYou can check the Euclidean distance between Albania and Austria is indeed 5.95. This distance is the square root of the sum of squared differences between two subjects in all their measurements.\n\n\n\nround(food_s[1:2,], digits = 2) # we only keep first 2 digits\n\n Meat Pigs Eggs Milk Fish Cereals Legume Fruit\nAlbania 0.08 -1.83 -2.24 -1.16 -1.20 0.92 1.22 -1.35\nAustria -0.28 1.66 1.23 0.39 -0.64 -0.39 -0.89 0.09\n\n# take the data for two countries each \nalbania <- round(food_s[1,], digits = 2)\naustria <- round(food_s[2,], digits = 2)\n# compute difference between each col\nd <- albania - austria\nd\n\n Meat Pigs Eggs Milk Fish Cereals Legume Fruit \n 0.36 -3.49 -3.47 -1.55 -0.56 1.31 2.11 -1.44 \n\n# euclidean distance: square each element, sum together, and take a square root\nsqrt(sum(d^2)) \n\n[1] 5.942096\n\n\n\n\nHierarchical clustering\nNow that we have computed the distance food_dist, we plug it in the clustering algorithm, hclust().\nWe try the complete linkage method, by specifying method = 'complete'. The result is saved as hc.complete. You can visualize it, and add label of the country names to make it easier to read.\n\nhc.complete <- hclust(food_dist, method=\"complete\")\nplot(hc.complete, labels=rownames(food), main=\"Complete Linkage\", xlab=\"\", sub=\"\")\n\n\n\n\n\n\nLinkage, dissimilarity, scaling\nHierarchical clustering is a class of methods, and there are a variety of options to set.\n\nLinkage (by seting method inside hclust()): complete, single, average\nDissimilarity: Euclidean, correlation, …\nScaling: scaled data (mean 0 variance 1) or unscaled, original data\n\nThere is no definite guide on which combination works the best, hence you can try them out and see what could make most sense. Again, in unsupervised learning data do not have outcome labels, so the interpretation is left for the domain experts to make.\n\n# single linkage\nhc.single <- hclust(food_dist, method=\"single\")\nplot(hc.single, labels=rownames(food), main=\"Single Linkage\", xlab=\"\", sub=\"\")\n\n\n\n# average linkage\nhc.average <- hclust(food_dist, method=\"average\")\nplot(hc.average, labels=rownames(food), main=\"Average Linkage\", xlab=\"\", sub=\"\")\n\n\n\n\n\n# unscaled data, complete linkage\nhc.unscaled <- hclust(dist(food), method=\"complete\")\nplot(hc.unscaled, labels=rownames(food), main=\"Complete linkage with unscaled features\", xlab=\"\", sub=\"\")\n\n\n\n\n\n# correlation as dissimiarity, rather than euclidean distance\ndd <- as.dist(1-cor(t(food_s))) # compute the metric\nhc.corr <- hclust(dd, method=\"complete\") # cluster\nplot(hc.corr, labels=rownames(food), main=\"Complete linkage with correlation-based distance\", xlab=\"\", sub=\"\")\n\n\n\n\n\n\nHeatmap\nHeatmap is a visualization tool to plot data of similar values in similar colors, so that you can identify visualy if there is any pattern. It can also be combined with hierarchical clustering - this is actually the default outcome: dendrograms are displayed for both rows and column.\n\n# make heatmap on the scaled data\nheatmap(food_s)\n\n\n\n\nTo preserve the original ordering of the columns and rows, you can specify Rowv = NA, Colv = NA.\n\n# no clustering for row or col, this preserves the original ordering\nheatmap(food_s, Rowv = NA, Colv = NA)\n\n\n\n\nCan also only do clustering for row only (or column only).\n\n# only clustering for row\nheatmap(food_s, Colv = NA)" + }, + { + "objectID": "lab/lab_day4_clustering.html#exercise-2-nci60", + "href": "lab/lab_day4_clustering.html#exercise-2-nci60", + "title": "R Lab (day 4): Clustering", + "section": "Exercise 2: NCI60", + "text": "Exercise 2: NCI60\nWe look at the NCI60 data again. First load the dataset.\n\nlibrary(ISLR)\n# or, load('data/NCI60.RData')\nnci.labs <- NCI60$labs # Sample labels (tissue type)\nnci.data <- NCI60$data # Gene expression data set\n\n\nHierarchical clustering\nWe start by scaling the data, and calculate the distance matrix (using the Euclidean distance), and then investigate different linkage methods.\n\n# Scale the data to zero mean and unit variance:\nsd.data <- scale(nci.data)\n\n# Calculate the distance matrix \n# equivalent: data.dist <- dist(sd.data, method=\"euclidean\")\ndata.dist <- dist(sd.data)\n\nNext we perform hierarchical clustering with distance matrix as input. The function we use is hclust(). We specify the linkage method to be complete.\nOnce the result is saved in hc.complete object, you can plot the dendrogram.\n\n# Perform clustering\nhc.complete <- hclust(data.dist, method=\"complete\")\n\n# names(hc.complete)\nplot(hc.complete, labels=nci.labs, main=\"Complete Linkage\", xlab=\"\", sub=\"\")\n\n\n\n\nThe object hc.complete contains a lot of information. To get the information, you can use the $ operator.\nYou should refer to the documentation for hclust() to see a complete list of output. Use ?hclust to get the documentation on how to use the function.\n\nhc.complete$dist.method # distance method\n\n[1] \"euclidean\"\n\n# hc.complete$merge # order of aggregations of samples / clusters\n# hc.complete$height # distance at which aggregations happen\n# hc.complete$labels # labels (numeric, since we don't know the original categories!)\n# hc.complete$method\n# hc.complete$call\n\nWe can try different linkage methods and see how the clustering results differ. Change the method argument in the function, and plot the results.\n\nhc.average <- hclust(data.dist, method=\"average\")\nhc.single <- hclust(data.dist, method=\"single\")\n\nplot(hc.average, labels=nci.labs, main=\"Average Linkage\", xlab=\"\", sub=\"\")\nplot(hc.single, labels=nci.labs, main=\"Single Linkage\", xlab=\"\", sub=\"\")\n\nNow we focus on complete linkage only.\nFirst, we use cutree() to compare the results when the data are separated into either 2 or 4 clusters.\n\n# Compare 2 clusters and 4 clusters:\nhc.clusters <- cutree(hc.complete, c(2, 4))\nhead(hc.clusters) # print first 6 results\n\n 2 4\nV1 1 1\nV2 1 1\nV3 1 1\nV4 1 1\nV5 1 2\nV6 1 2\n\n# cross tabulation\ntable(hc.clusters[,\"2\"], hc.clusters[,\"4\"])\n\n \n 1 2 3 4\n 1 40 7 0 0\n 2 0 0 8 9\n\n\nIt is more straightforward to check the results with a dendrogram.\n\n# visualize the cuts\n# how do you know where to draw the line? check height\nheights <- hc.complete$height\ntail(heights, 4) # print the last 4\n\n[1] 137.5633 141.2472 142.9218 162.2074\n\nplot(hc.complete, labels=nci.labs, main=\"Complete Linkage\", xlab=\"\", sub=\"\")\nabline(h=140, col=\"red\") # 4 clusters\nabline(h=150, col=\"blue\") # 2 clusters\n\n\n\n\nThe way to interpret the height variable is simple: it is where two clusters are merged into one. For example, the largest cluster corresponds to the last value of height (162.2) - if you check the figure, it is exactly where the horizontal line is merging the two groups. Similarly, 142.9 is where three groups became two, 141.2 is where four groups became three. If you draw a line at 140, it points out the four clusters.\nHow are the labels distributed between clusters? We can focus on 4 cluster situation, and use table() to list out which cancer types is merged in which of the four clusters.\nFor example, breast cancer appears in all but 3rd cluster; melanoma only appears in the first clsuter; so on so forth.\n\ntable(hc.clusters[,\"4\"], nci.labs)\n\n nci.labs\n BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA MCF7A-repro MCF7D-repro\n 1 2 3 2 0 0 0 0 0\n 2 3 2 0 0 0 0 0 0\n 3 0 0 0 1 1 6 0 0\n 4 2 0 5 0 0 0 1 1\n nci.labs\n MELANOMA NSCLC OVARIAN PROSTATE RENAL UNKNOWN\n 1 8 8 6 2 8 1\n 2 0 1 0 0 1 0\n 3 0 0 0 0 0 0\n 4 0 0 0 0 0 0\n\n\nFinally, we see what happens if we use unscaled data instead of scaled data, or if we use a correlation-based distance metric instead of the Euclidean distance.\nCompare the dendrograms: How different are the resulting clusterings? Do you recognise subclusters that are consistent?\n\n# Compare scaled data versus non-scaled data:\nhc.unscaled <- hclust(dist(nci.data), method=\"complete\")\nplot(hc.unscaled, labels=nci.labs, main=\"Complete linkage with unscaled features\", xlab=\"\", sub=\"\")\n\n\n\n# Compare Euclidean distance with correlation-based distance:\ndd <- as.dist(1-cor(t(sd.data)))\nhc.corr <- hclust(dd, method=\"complete\")\nplot(hc.corr, labels=nci.labs, main=\"Complete linkage with correlation-based distance\", xlab=\"\", sub=\"\")\n\n\n\n\n\n\nK-means clustering\nIn this section we explore the K-means clustering on the same dataset.\nIn contrast to the hierarchical clustering which requires a distance as input, with K-means you would provide the data matrix. The data matrix can be scaled (centered and with unit variance), or unscaled.\nIn this example we use scaled data computed from before, sd.data.\n\nset.seed(4) # set random seed\nkm.out4 <- kmeans(sd.data, centers = 4, nstart=20)\nkm.out4$cluster\n\n V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 \n 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 \nV21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 \n 4 4 4 1 1 4 1 4 1 4 4 1 1 3 3 3 3 3 3 3 \nV41 V42 V43 V44 V45 V46 V47 V48 V49 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59 V60 \n 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 \nV61 V62 V63 V64 \n 2 2 2 2 \n\n\nRead the help file ?kmeans to understand what the argument nstart=20 does. Comparing an analysis with nstart=20 versus nstart=1 demonstrates how the cluster results can be improved if we allow more evaluations with different randomly chosen starting centroids.\nSet a different random seed, say 3 (as long as it’s different from the one you used before), and run the analysis again. This time we use a different nstart\n\n# different starting centroids improve the clustering:\nset.seed(3)\nkm.out <- kmeans(sd.data, centers = 4, nstart=1)\nkm.out$cluster # cluster label\n\n V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 \n 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 \nV21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 \n 1 1 1 3 3 1 3 1 3 1 1 3 3 4 4 4 4 4 4 4 \nV41 V42 V43 V44 V45 V46 V47 V48 V49 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59 V60 \n 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 \nV61 V62 V63 V64 \n 2 2 2 2 \n\n\n\n\nCompare with hierarchical clustering\n\n# we can directly compare the k-means result (along rows)\n# with the hierarchical clustering result (along columns)\ntable(km.out4$cluster, hc.clusters[,\"4\"], deparse.level=2)\n\n hc.clusters[, \"4\"]\nkm.out4$cluster 1 2 3 4\n 1 11 0 0 9\n 2 9 0 0 0\n 3 0 0 8 0\n 4 20 7 0 0\n\n\nFrom the results, you can see that the results are slightly different between the two methods. Keep in mind that in unsupervised learning you do not have the real outcome label (such as the cancer types here), so you need to try a few different methods and compare the outputs, and make interpretations accordingly.\n\n\nVisualize clusters\nWe can visualise the K-means clustering results of high-dimensional data by using PCA for dimension reduction. We plot the first two principal components and colour the data points (= individual cell lines) by their assigned cluster from K-means.\n\n# first, run PCA again on the NCI60 data\npr.out <- prcomp(nci.data, scale=TRUE)\n\n# more cluster options\nkm.out2 <- kmeans(sd.data, 2, nstart=20)\nkm.out3 <- kmeans(sd.data, 3, nstart=20)\n\n# we can now visualise the K-Means results by labelling the data points\n# in a plot of the scores of the first 2 principal components:\npar(mfrow=c(1,3))\nplot(pr.out$x[,1:2], col=(km.out2$cluster+1), main=\"K-Means with K=2\",\n xlab=\"PC 1\", ylab=\"PC 2\", pch=20)\nplot(pr.out$x[,1:2], col=(km.out3$cluster+1), main=\"K-Means with K=3\",\n xlab=\"PC 1\", ylab=\"PC 2\", pch=20)\nplot(pr.out$x[,1:2], col=(km.out4$cluster+1), main=\"K-Means with K=4\",\n xlab=\"PC 1\", ylab=\"PC 2\", pch=20)\n\n\n\n\nCompare with the plot from Exercise 2 yesterday (left panel) along with the cancer type labels. The clusters from K-means seem to correspond decently to partition the data into groups.\n\npar(mfrow=c(1,1))\n\nCols=function(vec){\n cols=rainbow(length(unique(vec)))\n return(cols[as.numeric(as.factor(vec))])\n}\nplot(pr.out$x[,1:2], col=Cols(nci.labs), pch=19,xlab=\"PC 1\",ylab=\" PC 2\")\nlegend('topleft', col=rainbow(length(unique(nci.labs))), legend=unique(nci.labs), bty='n', lwd=2, cex=.6)\n\n\n\n\n\n\nHeatmap\nA heatmap is another way to visualize the clusters from the data. We use the principal components rather than the raw data, as PCs are already explaining a large amount of variability in the over 6000 features.\nSimilar values are presented with similar colors.\n\n## We use the scores of the PCA on the NCI60 data, to reduce dimension\nscores <- pr.out$x\nscores[1:5, 1:5] # first 5 pc, first 5 measurements\n\n PC1 PC2 PC3 PC4 PC5\nV1 -19.68245 3.527748 -9.7354382 0.8177816 -12.511081\nV2 -22.90812 6.390938 -13.3725378 -5.5911088 -7.972471\nV3 -27.24077 2.445809 -3.5053437 1.3311502 -12.466296\nV4 -42.48098 -9.691742 -0.8830921 -3.4180227 -41.938370\nV5 -54.98387 -5.158121 -20.9291076 -15.7253986 -10.361364\n\n# default choices\nheatmap(pr.out$x)\n\n\n\n\nYou can remove the dendrogram on the PCs, only keeping the ones for cancer types. Now you see that the PCs have kept their original order from 1 to 64.\n\n# hc.corr is the result from hclust. check the section on hierarchical clustering\nheatmap(pr.out$x, Rowv = as.dendrogram(hc.corr), Colv = NA)\n\n\n\n\nYou can also reduce the number of PCs, and add titles to the plot annd y-axis.\n\npar(cex.main = .7)\nheatmap(pr.out$x[,1:40], Rowv = as.dendrogram(hc.corr), Colv = NA,\n labRow = nci.labs, main = 'Heatmap of the scores of the first 40 PCs on the NCI60 data')" + }, + { + "objectID": "lab/lab_day4_clustering.html#exercise-3-gene-expression-data", + "href": "lab/lab_day4_clustering.html#exercise-3-gene-expression-data", + "title": "R Lab (day 4): Clustering", + "section": "Exercise 3: Gene expression data", + "text": "Exercise 3: Gene expression data\n(CH12Ex13 from statistical learning)\nWe use the Ch12Ex13.csv data to repeat some of the clustering analysis we did.\nThe first 20 samples are from healthy patients, while the second 20 are from a diseased group.\nLoad in the data using read.csv(). You will need to select header=F. Alternatively: load in the data using “Import dataset” in the upper right window, and click “no” on the “Heading” option.\nCarry out both hierarchical clustering and K-means clustering. You should choose the most meaningful number of clusters (think about how many groups of patients we have!). Compare the results.\nNote: remember that the data has genes on the rows and patients on the columns. You need to transpose the data so that the orders are reversed.\n\n# load in the data using read.csv(). You will need to select header=F.\ndata <- read.csv(\"data/Ch12Ex13.csv\", header=FALSE)\ndim(data)\n\n[1] 1000 40\n\n# transpose the data, so that we have each row is one patient (subject)\ndata <- t(data) \n\nNow the first 20 rows are measurements from healthy patients (group 0), and 21-50 rows are the disease patients (group 1). We can denote this information in a vector like this.\n\ntrue.groups <- c( rep(0,20), rep(1,20))\n\n\nHierarchical clustering\nYou can use different linkage options and distance metrics of your choosing. For example, with complete linkage the code is like this.\n\ndata.dist <- dist(data) # need to compute the distance matrix\nhclust.df <- hclust(data.dist, method=\"complete\" )\n#alternatives:\n#hclust.df <- hclust( D, method=\"average\" )\n#hclust.df <- hclust( D, method=\"single\" )\n\nWe can keep 2 clusters with cutree. Then do a cross tabulation of the true labels and clustered results: how well do they correspond?\n\n# find the clusters\npredicted <- cutree( hclust.df, k=2 )\n\n# How well does our clustering predict health vs. diseased\ntable(predicted, true.groups )\n\n true.groups\npredicted 0 1\n 1 20 0\n 2 0 20\n\n\n\n\nK-means\nNow you can use K-means to identify 2 clusters.\n\npredicted.kmean <- kmeans(data, 2, nstart=20)$cluster\n# agreement with true label\ntable(predicted.kmean, true.groups )\n\n true.groups\npredicted.kmean 0 1\n 1 20 0\n 2 0 20\n\n\nBoth methods seem to do work decently for the task." } ] \ No newline at end of file diff --git a/exam/MED3007_exam_data_V21.Rdata b/exam/MED3007_exam_data_V21.Rdata new file mode 100644 index 0000000..dbd3fa8 Binary files /dev/null and b/exam/MED3007_exam_data_V21.Rdata differ diff --git a/exam/MED3007_exam_simulation.R b/exam/MED3007_exam_simulation.R new file mode 100644 index 0000000..4b53e4b --- /dev/null +++ b/exam/MED3007_exam_simulation.R @@ -0,0 +1,198 @@ +###------------------------------### +### MED3007: Exam simulation ### +###------------------------------### + +## Load exam data set using "load()" +load("data/MED3007_exam_data_V21.Rdata") + +dim(clin) +dim(expr) +head(clin) + +## Remember: we want to have the variables on the columns in our datasets, i.e. we want +## there to be 72 rows (the number of patients) for both datasets "expr" and "clin". +## For "expr" we should have 7129 columns, since they are the genes, a.k.a our variables. + +# Use function t() to transpose the dataset so that it is on the correct form +data <- t(expr) # rename "expr" to "data" + +# Make group variable (here that is whether a patient has ALL or AML) +groups <- clin$ALL.AML + +## There are 2 groups: one possible analysis -> testing +##----------------------------------------------------- + +# Apply t-test to all genes in each of the groups using "apply" and "t.test" -> histogram +alpha <- 0.05 +pval.ttest <- apply(data, 2, function(x){t.test(x[which(groups=="ALL")], x[which(groups=="AML")])$p.value}) +hist(pval.ttest) + +# Simple way of calculation adjusted p-values (using p.adjust()): +pval.fwer <- p.adjust(pval.ttest, method = "bonferroni") +pval.fdr <- p.adjust(pval.ttest, method = "BH") + +# Number of significant p-values after Bonferroni correction +sum(pval.fwer < alpha) # conservative + +# Number of significant p-values after BH correction +sum(pval.fdr < alpha) # less conservative + +# Let us find the significant genes after correction +sign.genes.Bonf <- which(pval.fwer < alpha) +sign.genes.BH <- which(pval.fdr < alpha) + +# compute the group means using "apply" and "tapply" +genes.gr.means <- apply(data, 2, function(x){tapply(x, groups, mean)}) + +# Plot of the overall group means with significant genes +plot(genes.gr.means[1,], xlab = 'Genes', main = 'Significant genes, group means', + ylab = 'Mean expression', ylim = range(genes.gr.means)) +points(genes.gr.means[2,], col=2) +points(sign.genes.BH, genes.gr.means[1,sign.genes.BH], col=4, pch=4, lwd=2) +points(sign.genes.BH,genes.gr.means[2,sign.genes.BH], col=4, pch=4, lwd=2) +points(sign.genes.Bonf, genes.gr.means[1,sign.genes.Bonf], col=5, pch=4, lwd=2) +points(sign.genes.Bonf,genes.gr.means[2,sign.genes.Bonf], col=5, pch=4, lwd=2) +abline(h=0) +legend(6500, 25000, col=5:4, bty='n', lwd=2, pt.lwd=.5, legend = c('Bonferroni','B-H'), cex=.6) # add legend + +# a slightly better plot: standardized group mean differences, with significant genes +genes.gr.diff <- apply(genes.gr.means, 2, diff)/apply(data, 2, sd) +plot(genes.gr.diff, xlab = 'Genes', main = 'Significant genes, standardized group differences', + ylab = 'difference in group means / std dev') +points(sign.genes.BH,genes.gr.diff[sign.genes.BH], col=4, pch=4, lwd=2) +points(sign.genes.Bonf,genes.gr.diff[sign.genes.Bonf], col=5, pch=4, lwd=2) +abline(h=0) +legend(6500, 1.8, col=5:4, bty='n', lwd=2, pt.lwd=.5, legend = c('Bonferroni','B-H'), cex=.6) # add legend + + + + +## There are 2 groups: another possible analysis -> clustering +##-------------------------------------------------------- + +# Calculate the distance matrix, here with the Euclidean distance +data.dist <- dist(data, method = "euclidean") + +# Perform clustering with different linkage methods: +hc.complete <- hclust(data.dist, method="complete") +hc.average <- hclust(data.dist, method="average") +hc.single <- hclust(data.dist, method="single") + +## Compare the dendrograms and comment on the results in light of how +## the different linkage methods work. +par(mfrow=c(1,3)) +plot(hc.single, labels=groups, main="Single Linkage", xlab="", sub="") +plot(hc.complete, labels=groups, main="Complete Linkage", xlab="", sub="") +plot(hc.average, labels=groups, main="Average Linkage", xlab="", sub="") + +# How to decide between linkage methods? We look for the most "compact" clusters, where +# the division between the new branches are a bit "balanced". +par(mfrow=c(1,3)) +plot(hc.single, main='Euclidian single', xlab='', labels=F, sub='') +rect.hclust(hc.single, k=2) +rect.hclust(hc.single, k=3) +rect.hclust(hc.single, k=4) +plot(hc.complete, main='Euclidean complete',xlab='', labels=F, sub='') +rect.hclust(hc.complete, k=2) +rect.hclust(hc.complete, k=3) +rect.hclust(hc.complete, k=4) +plot(hc.average, main='Euclidean average',xlab='', labels=F, sub='') +rect.hclust(hc.average, k=2) +rect.hclust(hc.average, k=3) +rect.hclust(hc.average, k=4) + +## The different linkages seems to give very different results! +## Complete linkage seems to give 2 clusters/groups, which we know to be the truth. +## It is also reassuring to see that when we try to divide the patients into more than 2 groups, +## we do not gain anything. Although the other two linkage methods seems to say there +## are no groups at all, only one big group. +## Let's see if we detect the true groups for the clustering result from complete linkage. + +# We use the function "cutree" to "cut" the dendrogram into both k=2 and k=3 groups +cluster.ec <- cutree(hc.complete, k=c(2,3)) # complete linkage, euclidian + +# How are the true labels distributed between clusters: +table(cluster.ec[,"2"], groups) # look only at k=2 + +# Hierarchical clustering was not able to detect the groups very well.. Let's try K-Means. + +## K-means clustering +set.seed(4) +km.out2 <- kmeans(data, 2, nstart=20) # 2 clusters +km.out3 <- kmeans(data, 3, nstart=20) # 3 clusters + +# We can directly compare the k-means result (along rows) +# with the hierarchical clustering result (along columns) +table(km.out2$cluster, cluster.ec[,"2"], deparse.level=2) +table(km.out3$cluster, cluster.ec[,"3"], deparse.level=2) +# Not too bad, but still some disagreement + +# How are the true labels distributed between clusters: +table(km.out2$cluster, groups) + +# Not very good here either, maybe even worse than hierarchcial clustering..? + + +## Also dimensional reduction via PCA can be an option +##---------------------------------------------------- + +pr.out <- prcomp(data, scale=TRUE) + +# Plot also the proportion of variance explained +# Proportion of variance explained +pr.var <- pr.out$sdev^2 +pve <- pr.var/sum(pr.var) +pve <- 100*pve +par(mfrow=c(1,2)) +plot(pve, type="o", ylab="PVE", xlab="Principal Component", col="blue") +plot(cumsum(pve), type="o", ylab="Cumulative PVE", xlab="Principal Component", col="red") + +# Can also try to plot only the first 10 components in the first plot to zoom in a bit +par(mfrow=c(1,1)) +plot(1:10,pve[1:10], type="o", ylab="PVE", xlab="Principal Component", col="blue") +# -> the first two components seem to explain a lot +# alternative: selecting the number of components by having a threshold on pve, f.ex 80% or 70% + +# Helper function for colors (for the different groups) +Cols=function(vec){ + cols=rainbow(length(unique(vec))) + return(cols[as.numeric(as.factor(vec))]) +} + +# Plot the different components (color for each group) +par(mfrow=c(1,2)) +plot(pr.out$x[,1:2], col=Cols(groups), pch=19, xlab="PC 1", ylab=" PC 2") +legend('bottomright', col=rainbow(length(unique(groups))), legend=paste('group ',unique(groups),sep=''), + bty='n', lwd=1, pt.lwd=.3, cex=.6) +plot(pr.out$x[,c(1,3)], col=Cols(groups), pch=19,xlab="PC 1",ylab=" PC 3") + +# Still not very clear, but the sub-plot on the right seems to show a small tendency towards +# two groups..? + +# Extra: hierarchical clustering *after* a PCA + +# Hierarchical clustering on the SCORES (weights) of the first 3 PCs +data.dist.pca3 <- dist(pr.out$x[,1:3], method = 'manhattan') +hclust.df <- hclust(data.dist.pca3, method="complete" ) + +par(mfrow=c(1,1)) +plot(hclust.df, labels=groups) +rect.hclust(hclust.df, k=4) + +predicted <- cutree(hclust.df, k=2) +table(predicted, groups) + +# K-means clustering on the SCORES (weights) of the first 3 PCs +predicted.kmean <- kmeans(pr.out$x[,1:3], 2, nstart=20)$cluster +table(predicted.kmean, groups) + +## Possible comment / conclusion: +## -> this last clustering on the PCA is explaining +## the Leukemia groups much better! + +## Remember: you do NOT need to do all of these analyses on the exam! One is enough. + +## Final note: why are the methods not working so well on this dataset?? These types of results are +## actually quite common, since real datasets are often quite messy and does not necessarily have +## a clear grouping structures. So: report what you see, even if the results are not "pretty". + diff --git a/exam/MED3007_exam_v21.pdf b/exam/MED3007_exam_v21.pdf new file mode 100644 index 0000000..f314d93 Binary files /dev/null and b/exam/MED3007_exam_v21.pdf differ diff --git a/exam/MED3007_summary.R b/exam/MED3007_summary.R new file mode 100644 index 0000000..f9ccfbf --- /dev/null +++ b/exam/MED3007_summary.R @@ -0,0 +1,248 @@ +###----------------------------------------------### +### MED3007: Summary of methods with RStudio ### +###----------------------------------------------### + +### OUTLINE +###--------- + +### 0. Statistical testing +### 1. Multiple testing with correction (Lab 1) +### 2. Principal component analysis (Lab 2) +### 3. Clustering (Lab 3) +### 3.1. Hierarchical clustering +### 3.2. K-means clustering + +# Set the working directory: +setwd("~/Dropbox_UiO/Dropbox/MED3007_2023/day 5") + +###-------------------------### +### 0. STATISTICAL TESTING ### +###-------------------------### + +library(readxl) # reading from excel can be done using the "readxl" package, so we need to load it first +data_exc <- read_excel("data/Testfil_Rcourse.xlsx") +data_exc <- as.data.frame(data_exc) # remember to always force your excel-dataset to be a dataframe! + +# A single t-test can be performed the following way (given that it is normally distributed) +out <- t.test(vitD_v1 ~ gender, data_exc) +# ..where we test if vitD_v1 is different across gender groups +out$p.value # print out the p-value from the t-test +out$conf.int # can also get the confidence interval +out + +# However, there are other tests you can do with similar syntax: +?wilcox.test() # nonparametric (non-normal data) +?chisq.test() +# .. and so on + +# Summary statistics +summary(data_exc) + + +###----------------------### +### 1. MULTIPLE TESTING ### +###----------------------### + +## Steps: +## 1. Load data +## 2. Make sure you have the variables (that you want to test) column-wise (look at the data!!) +## 3. Compute the p-values for all the variables you want to test. Nice to visualize it with a histogram! +## 4. To do the correction: +## a) Bonferroni: compute new (stricter) significance level: alpha/k, k=number of tests. +## You keep the original p-values < alpha/k (these are considered significant under Bonf. correction) +## b) B-H: sort the p-values from small to large, compute the thresholds for each p-value. +## You keep the original p-values < threshold. + +# Load in the data +data_csv <- read.csv("data/Ch10Ex11.csv", header=F) # data contains 40 samples of 1000 genes + +# Always good to have a look at the data +View(data_csv) + +# Remember to check if the variables (i.e genes) are on the columns! +dim(data_csv) + +# Need to transform the data so we have the variables as columns +data_csv <- t(data_csv) + +# Grouping variable: the samples were grouped into healthy and diseased patients +groups <- c(rep(1,20), rep(2,20)) + +# Need to decide on a target significance level +alpha <- .05 + +# Apply t-test to all genes in each of the groups using "apply" and "t.test" -> histogram +pval.ttest <- apply(data_csv, 2, function(x){t.test(x[which(groups==1)], x[which(groups==2)])$p.value}) +hist(pval.ttest) + +# Simple way of calculation adjusted p-values (using p.adjust()): +pval.fwer <- p.adjust(pval.ttest, method = "bonferroni") +pval.fdr <- p.adjust(pval.ttest, method = "BH") + +# Number of significant p-values after Bonferroni correction +sum(pval.fwer < alpha) # conservative + +# Number of significant p-values after BH correction +sum(pval.fdr < alpha) + +# Let us find the significant genes after correction +sign.genes.Bonf <- which(pval.fwer < alpha) +sign.genes.BH <- which(pval.fdr < alpha) + +# Plot of the overall means with significant genes +genes.means <- apply(data_csv, 2, mean) # compute the means using "apply" +plot(genes.means, xlab = 'Genes', main = 'Mean across samples', ylab = 'Mean expression') +points(sign.genes.Bonf, genes.means[sign.genes.Bonf], col=3, pch=16) +points(sign.genes.BH, genes.means[sign.genes.BH], col=4, pch=16) +legend('topright', col=3:4, bty='n', lwd=2, legend = c('Bonferroni','B-H')) # add colorcode + + +###-------------------------### +### 2. PCA: GENOMIC EXAMPLE ### +###-------------------------### + +## Steps: +## 1. Load data +## 2. Make sure you have the variables column-wise (look at the data!!) +## 3. Do PCA with "prcomp", and we typically set "scale=TRUE" inside "prcomp" +## 4. Visualize the results: +## a) Compute and plot proportion of variance explained (PVE), try to decide on +## how many principal components you would choose. +## b) Plot the first principal components. What do you see? +## Extra: add a color for the group labels (if they exist). + +# Do PCA with "prcomp" +pr.out <- prcomp(data_csv, scale=TRUE) + +# Proportion of variance explained +pr.var <- pr.out$sdev^2 +pve <- pr.var/sum(pr.var) +pve <- 100*pve +par(mfrow=c(1,2)) +plot(pve, type="o", ylab="PVE", xlab="Principal Component", col="blue") +plot(cumsum(pve), type="o", ylab="Cumulative PVE", xlab="Principal Component", col="red") + +# How many principal components would you keep to achieve a good dimension reduction, +# while keeping most of the variability in the data set? +mysel80 <- which(cumsum(pve) > 80)[1] # explains 80% of the variability +mysel70 <- which(cumsum(pve) > 70)[1] # explains 70% of the variability + +# Setting a threshold at 80% PVE is very common, which would here result in 29 +# components (much more managable than 1000!) +# Now what? +# If we decide to keep 29 components, we can for ex. detect highly expressed genes by +# inspecting the different principal components (the very first principal components +# should typically have high weights (both positive and negative weight) for the most +# interesting variables/genes, and so on..). + +# Visualize results +# Helper function for colors (for the different groups) +Cols=function(vec){ + cols=rainbow(length(unique(vec))) + return(cols[as.numeric(as.factor(vec))]) +} + +# Plot the different components (color for each group) +par(mfrow=c(1,2)) +plot(pr.out$x[,1:2], col=Cols(groups), pch=19, xlab="PC 1", ylab=" PC 2") +plot(pr.out$x[,c(1,3)], col=Cols(groups), pch=19,xlab="PC 1",ylab=" PC 3") +legend('topleft', col=rainbow(length(unique(groups))), legend=paste('group ',unique(groups),sep=''), bty='n', lwd=2, cex=.6) + + +###---------------### +### 3. CLUSTERING ### +###---------------### + +## Steps: +## 1. Load data +## 2. Make sure you have the variables column-wise (--> look at the data!!) +## 3.1 Hierarchical clustering: +## a) Compute the distance matrix for your dataset using "dist" with a chosen distance measure, +## for ex. "euclidian" (but also good to try several distances) +## b) Do the clustering with "hclust" with a linkage method, for ex. "complete" (but also here +## good to try several methods and chose the best at the end). +## c) Plot the dendograms to decide which method to chose, and how many clusters to chose. +## d) Use "cutree" to cut the dendogram so that you finally get the clustering of the data. +## Now you can f.ex compare the results from different methods, or, if you already have +## the group labels, you can now see if you were able to detect them! +## 3.2 K-means clustering: +## a) Set a random seed +## b) Run clustering with "kmeans" where you need to specify: the number of clusters, and number +## of initial starts. + + +## Hierarchical clustering +# We continue to use the same dataset, however with the samples shuffled +# (since they are now ordered by group) +myshuffle <- sample(dim(data_csv)[1]) +data_csv <- data_csv[myshuffle,] + +# Calculate the distance matrix (default = Euclidean): +data.dist <- dist(data_csv) + +# Alternatively, try all distances +data.dist.e <- dist(data_csv, method="euclidean") # this is now exactly the same as data.dist +data.dist.c <- dist(data_csv, method="canberra") +data.dist.m <- dist(data_csv, method="manhattan") + +# Perform clustering with different linkage methods: +hc.complete <- hclust(data.dist, method="complete") +hc.average <- hclust(data.dist, method="average") +hc.single <- hclust(data.dist, method="single") + +## Compare the dendrograms and comment on the results in light of how +## the different linkage methods work. +par(mfrow=c(1,3)) +plot(hc.single, labels=groups, main="Single Linkage", xlab="", sub="") +plot(hc.complete, labels=groups, main="Complete Linkage", xlab="", sub="") +plot(hc.average, labels=groups, main="Average Linkage", xlab="", sub="") + +# How to decide between linkage methods? We look for the most "compact" clusters. +par(mfrow=c(1,3)) +plot(hc.single, main='Euclidian single', xlab='', labels=F, sub='') +rect.hclust(hc.single, k=2) +rect.hclust(hc.single, k=3) +rect.hclust(hc.single, k=4) +plot(hc.complete, main='Euclidean complete',xlab='', labels=F, sub='') +rect.hclust(hc.complete, k=2) +rect.hclust(hc.complete, k=3) +rect.hclust(hc.complete, k=4) +plot(hc.average, main='Euclidean average',xlab='', labels=F, sub='') +rect.hclust(hc.average, k=2) +rect.hclust(hc.average, k=3) +rect.hclust(hc.average, k=4) + +# We can see that we do not gain very much by having 3 clusters, compared to 2.. But we +# can still compare the results. + +# We use the function "cutree" to "cut" the dendrogram at a given number of groups +# Let's try with both k=2 and k=3 and see what is best +cluster.ec <- cutree(hc.complete, k=c(2,3)) # complete linkage, euclidian +cluster.ea <- cutree(hc.average, k=c(2,3)) # average linkage, euclidian +# Compare the two clustering results with each other with "table" +table(cluster.ec[,"2"], cluster.ea[,"2"]) # they completely agree! +table(cluster.ec[,"3"], cluster.ea[,"3"]) # here there are some minor disagreements --> conclusion: k=2!! + +# How are the true labels distributed between clusters: +table(cluster.ec[,"2"], groups[myshuffle]) # completely correct, the groups are detected! + +## K-means clustering +set.seed(4) +km.out2 <- kmeans(data_csv, 2, nstart=20) # 2 clusters +km.out3 <- kmeans(data_csv, 3, nstart=20) # 3 clusters + +# We can print the cluster values +km.out2$cluster + +# We can also directly compare the k-means result (along rows) +# with the hierarchical clustering result (along columns) +table(km.out2$cluster, cluster.ec[,"2"], deparse.level=2) # complete agreement! +table(km.out3$cluster, cluster.ec[,"3"], deparse.level=2) + +# How are the true labels distributed between clusters: +table(km.out2$cluster, groups[myshuffle]) # completely correct, the groups are detected! + +# Since there seems to be overall best agreement on the clusters when k=2 across all methods, we +# can we be fairly certain that this is the true clustering. However, in many other situations +# you will not be so certain. So: you should try several methods, and chose the result that seems +# to be most consistent across methods. \ No newline at end of file diff --git a/index.qmd b/index.qmd index b177db5..c8145eb 100644 --- a/index.qmd +++ b/index.qmd @@ -71,7 +71,7 @@ If there is a conflict of information, please refer to the official schedule. |:-------------:|:------------------------------------:|:-----------------:| | 9:00 - 9:45 | Lab: Clustering and heatmaps | | | 10:00 - 11:30 | Exercises | [Lab: clustering](https://ocbe-uio.github.io/course_med3007/lab/lab_day4_clustering.html), [Code](https://github.com/ocbe-uio/course_med3007/blob/main/lab/code/MED3007_Lab3_clustering.R) | -| 13:00 - 14:00 | Take-home exam simulation | | +| 13:00 - 14:00 | Take-home exam simulation | [Exam]() | | 14:00 - 16:00 | Q&A | | diff --git a/lab/code/MED3007_Lab3_clustering.R b/lab/code/MED3007_Lab3_clustering.R index a95f586..7f1c9e5 100644 --- a/lab/code/MED3007_Lab3_clustering.R +++ b/lab/code/MED3007_Lab3_clustering.R @@ -1,5 +1,84 @@ # code for clustering +# food ---- +# food example for clustering +# the clustered groups are easier to interpret, hence we use it for teaching + +# load data, and change the title of pulse to Legume +food <- read.table('./lab/data/Food.txt', header=T) +colnames(food)[7] <- 'Legume' +head(food) # first 6 lines of code + +# the food data has columns (features) that have quite big difference +# Scale the data to zero mean and unit variance, call it food_s +food_s <- scale(food) + + +## hierarchical clustering ---- + +# HC requires distances (not original data) as input +# Calculate the distance matrix on the scaled data +# (default method for distance = Euclidean) +food_dist <- dist(food_s) +# food_dist <- dist(food_s, method="euclidean") +# food_dist + +# food_dist is a vector of 300 elements +length(food_dist) # 300 + +# this is the number of pairs of data +# we have 25 data points, one for each country +# Albania vs Austria, Albania vs Belg.Lux, ... +# in total you have 25 * 24 / 2 pairs + + + +# Perform clustering with different linkage methods: +hc.complete <- hclust(food_dist, method="complete") +plot(hc.complete, labels=rownames(food), main="Complete Linkage", xlab="", sub="") + +# single linkage +hc.single <- hclust(food_dist, method="single") +plot(hc.single, labels=rownames(food), main="Single Linkage", xlab="", sub="") + +# average linkage +hc.average <- hclust(food_dist, method="average") +plot(hc.average, labels=rownames(food), main="Average Linkage", xlab="", sub="") + + +# unscaled data, complete linkage +hc.unscaled <- hclust(dist(food), method="complete") +plot(hc.unscaled, labels=rownames(food), main="Complete linkage with unscaled features", xlab="", sub="") + + +# correlation on scaled data, complete linkage +cor(food_s) # this is by col (food) +cor(t(food_s)) # by country + +# 1 minus makes them all positive +dd <- as.dist(1-cor(t(food_s))) +hc.corr <- hclust(dd, method="complete") +plot(hc.corr, labels=rownames(food), main="Complete linkage with correlation-based distance", xlab="", sub="") + + +## heatmap ---- +# heatmap (default) +# the default heatmap does clustering for both row and col variables +# dendrograms are also shown +heatmap(food_s) + +# you can specify the arguments so that no clustering is done +# the original order of col and row are kept +# heatmap with no clustering +heatmap(food_s, Rowv = NA, Colv = NA) + +# can also do clustering for only row (or col) +heatmap(food_s, Colv = NA) + + + + +# _________ ---- # NCI 60 ---- library(ISLR) nci.labs <- NCI60$labs # Sample labels (tissue type) @@ -148,45 +227,25 @@ plot(pr.out$x[,1:2], col=(km.out4$cluster+1), main="K-Means with K=4", # heatmap ---- - +# find out how to set argument for heatmap by reading the documentation ?heatmap -## let create a small heatmap, to fix ideas. -## We use the scores of the PCA on the NCI60 data, to reduce dimension +# We use the scores of the PCA on the NCI60 data, to reduce dimension -# default choices +# default heatmap (clustering done on both col and row) heatmap(pr.out$x) -# I use the previous dendrogram for better ordering of the patients, -# and I remove the dendrogram for the components +# clustering (on correlation) done for rows heatmap(pr.out$x, Rowv = as.dendrogram(hc.corr), Colv = NA) # I now plot less components for the sake of clarity, -# I add the patient's tumor type, and I give a title +# I add tumor type, and I give a title par(cex.main = .7) heatmap(pr.out$x[,1:40], Rowv = as.dendrogram(hc.corr), Colv = NA, labRow = nci.labs, main = 'Heatmap of the scores of the first 40 PCs on the NCI60 data') -## elbow plot - -# names(km.out2) -# # the within cluster sum-of-squares is within the object "withinss" -# # but we need to run much more k-means in order to decide... -# -# which.clust <- 1:15 -# within.clust.var <- NULL -# for(k in which.clust){ -# myresult <- mean(kmeans(sd.data, k, nstart=10)$withinss) -# within.clust.var <- c(within.clust.var, myresult) -# } -# # let's plot the values and look for the elbow -# par(mfrow=c(1,1)) -# plot(which.clust, within.clust.var, type = 'b', lwd = 2, -# xlab = 'number of clusters', -# ylab = 'within-cluster sum-of-squares', -# main = 'k-means clustering of NCI60 data') @@ -195,6 +254,7 @@ heatmap(pr.out$x[,1:40], Rowv = as.dendrogram(hc.corr), Colv = NA, # CH10Ex11 ---- # load in the data using read.csv(). You will need to select header=F. +# set the right path to load the data! data <- read.csv("lab/data/Ch12Ex13.csv", header=FALSE) data <- t(data) # want each row to represent a sample ... should have n=40 samples/rows @@ -216,6 +276,7 @@ table(predicted, true.groups ) # very well!! # kmeans ---- + predicted.kmean <- kmeans(data, 2, nstart=20)$cluster table(predicted.kmean, true.groups ) # also very well! diff --git a/lab/lab_day4_clustering.qmd b/lab/lab_day4_clustering.qmd index f94d130..ec09e09 100644 --- a/lab/lab_day4_clustering.qmd +++ b/lab/lab_day4_clustering.qmd @@ -10,10 +10,190 @@ Download datasets [here](https://github.com/ocbe-uio/course_med3007/tree/main/la R script: [Code](https://github.com/ocbe-uio/course_med3007/blob/main/lab/code/MED3007_Lab3_clustering.R) -Presentation: [Slides]() +Presentation: [Slides](presentation/Lab_clustering.pdf) +## Exercise 1: Food -## Exercise 1: NCI60 +We use the same `Food.txt` data to illustrate two concepts: hierarchical clustering, and heatmap. + +This is not a genomics dataset, but for the ease of interpretability, we use it for teaching purposes. + +Let us load the dataset. + +```{r} +#| label: hc-food-loaddata +#| warning: false +#| echo: true + +food <- read.table('data/Food.txt', header=T) +# we change the name from pulses to a more common name, legume +colnames(food)[7] <- 'Legume' +head(food) # print first 6 lines +``` + +We **scale** the data (also called standardize, or normalize sometimes) so that each column (feature, variable) has 0 mean and 1 variance. We call the scaled data `food_s`. + +```{r} +#| label: hc-food-scale +#| warning: false +#| echo: true + +food_s <- scale(food) +head(food_s) # print first 6 lines +``` + + + +### Distances + +To do hierarchical clustering, the most convenient command is `hclust()`. As input you would need a **distance** between the subjects (patients, or countries in this example). We do it on the scaled data. + +The command to compute pair-wise distance is `dist()`. By default, the distance being computed is the Euclidean distance (*details optional*). Euclidean distance is possibly the most commonly used metric, but there are others. See `?dist()` to find out more options. + +We can present the pair-wise distances in a matrix format. You can see that this matrix is symmetric, with 0 on the diagonal - this should be intuitive: the distance between A - B is the same as B - A, and the distance between A and itself is 0. + + +```{r} +#| label: hc-food-dist +#| warning: false +#| echo: true + +# compute distance +food_dist <- dist(food_s) +# round(food_dist, digits = 2) # try this yourself to see what it does +# alternatively, look at this as a matrix +food_dist_matrix <- as.matrix(food_dist) +round(food_dist_matrix[1:5, 1:5], digits = 2) # first 5 row 5 col +``` + +::: callout-note +## Optional: Euclidean disance + +You can check the Euclidean distance between Albania and Austria is indeed 5.95. This distance is the square root of the sum of squared differences between two subjects in all their measurements. + +::: + +```{r} +#| label: hc-food-dist2 +#| warning: false +#| echo: true + +round(food_s[1:2,], digits = 2) # we only keep first 2 digits +# take the data for two countries each +albania <- round(food_s[1,], digits = 2) +austria <- round(food_s[2,], digits = 2) +# compute difference between each col +d <- albania - austria +d +# euclidean distance: square each element, sum together, and take a square root +sqrt(sum(d^2)) +``` + + +### Hierarchical clustering + +Now that we have computed the distance `food_dist`, we plug it in the clustering algorithm, `hclust()`. + +We try the complete linkage method, by specifying `method = 'complete'`. The result is saved as `hc.complete`. You can visualize it, and add label of the country names to make it easier to read. + +```{r} +#| label: hc-food-complete +#| warning: false +#| echo: true + +hc.complete <- hclust(food_dist, method="complete") +plot(hc.complete, labels=rownames(food), main="Complete Linkage", xlab="", sub="") +``` + + +### Linkage, dissimilarity, scaling + +Hierarchical clustering is a class of methods, and there are a variety of options to set. + +* Linkage (by seting `method` inside `hclust()`): complete, single, average +* Dissimilarity: Euclidean, correlation, ... +* Scaling: scaled data (mean 0 variance 1) or unscaled, original data + +There is no definite guide on which combination works the best, hence you can try them out and see what could make most sense. Again, in unsupervised learning data do not have outcome labels, so the interpretation is left for the domain experts to make. + + +```{r} +#| label: hc-food-linkage +#| warning: false +#| echo: true + +# single linkage +hc.single <- hclust(food_dist, method="single") +plot(hc.single, labels=rownames(food), main="Single Linkage", xlab="", sub="") + +# average linkage +hc.average <- hclust(food_dist, method="average") +plot(hc.average, labels=rownames(food), main="Average Linkage", xlab="", sub="") +``` + + + +```{r} +#| label: hc-food-unscaled +#| warning: false +#| echo: true + +# unscaled data, complete linkage +hc.unscaled <- hclust(dist(food), method="complete") +plot(hc.unscaled, labels=rownames(food), main="Complete linkage with unscaled features", xlab="", sub="") +``` + + +```{r} +#| label: hc-food-cor +#| warning: false +#| echo: true + +# correlation as dissimiarity, rather than euclidean distance +dd <- as.dist(1-cor(t(food_s))) # compute the metric +hc.corr <- hclust(dd, method="complete") # cluster +plot(hc.corr, labels=rownames(food), main="Complete linkage with correlation-based distance", xlab="", sub="") +``` + + +### Heatmap + +Heatmap is a visualization tool to plot data of similar values in similar colors, so that you can identify visualy if there is any pattern. It can also be combined with hierarchical clustering - this is actually the default outcome: dendrograms are displayed for both rows and column. + +```{r} +#| label: hc-food-heatmap1 +#| warning: false +#| echo: true + +# make heatmap on the scaled data +heatmap(food_s) +``` + +To preserve the original ordering of the columns and rows, you can specify `Rowv = NA, Colv = NA`. + +```{r} +#| label: hc-food-heatmap2 +#| warning: false +#| echo: true + +# no clustering for row or col, this preserves the original ordering +heatmap(food_s, Rowv = NA, Colv = NA) +``` + +Can also only do clustering for row only (or column only). + +```{r} +#| label: hc-food-heatmap3 +#| warning: false +#| echo: true + +# only clustering for row +heatmap(food_s, Colv = NA) +``` + + + +## Exercise 2: NCI60 We look at the NCI60 data again. First load the dataset. @@ -190,8 +370,7 @@ Set a different random seed, say 3 (as long as it's different from the one you u # different starting centroids improve the clustering: set.seed(3) km.out <- kmeans(sd.data, centers = 4, nstart=1) -km.out$cluster -km.out$tot.withinss +km.out$cluster # cluster label ``` ### Compare with hierarchical clustering @@ -295,7 +474,7 @@ heatmap(pr.out$x[,1:40], Rowv = as.dendrogram(hc.corr), Colv = NA, -## Exercise 2: Gene expression data +## Exercise 3: Gene expression data (CH12Ex13 from statistical learning) diff --git a/lab/overview.qmd b/lab/overview.qmd index 56bab37..ef7f7f5 100644 --- a/lab/overview.qmd +++ b/lab/overview.qmd @@ -37,7 +37,7 @@ day 4 (clustering) | Day 2 | [Day 2: Multiple testing](lab_day2_testing.qmd) | [Code](https://github.com/ocbe-uio/course_med3007/blob/main/lab/code/MED3007_Lab1.R), [Code (solution)](https://github.com/ocbe-uio/course_med3007/blob/main/lab/code/MED3007_Lab1_exercise_solution.R) | | | Day 3 | [Day 3: Principal Component Analysis](lab_day3_pca.qmd) |[Code](https://github.com/ocbe-uio/course_med3007/blob/main/lab/code/MED3007_Lab2_pca.R) | [Slides](presentation/Lab_pca.pdf) | -| Day 4 | [Day 4: Clustering](lab_day4_clustering.qmd) |[Code](https://github.com/ocbe-uio/course_med3007/blob/main/lab/code/MED3007_Lab3_clustering.R) | | +| Day 4 | [Day 4: Clustering](lab_day4_clustering.qmd) |[Code](https://github.com/ocbe-uio/course_med3007/blob/main/lab/code/MED3007_Lab3_clustering.R) | [Slides](presentation/Lab_clustering.pdf)| diff --git a/lab/presentation/Lab_clustering.pdf b/lab/presentation/Lab_clustering.pdf new file mode 100644 index 0000000..ea019d2 Binary files /dev/null and b/lab/presentation/Lab_clustering.pdf differ