-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathclustering.R
508 lines (388 loc) · 15.5 KB
/
clustering.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
# Rclustering.R
#
# Purpose: Introduction to clustering - clustering of expression data
#
# Version: 3.0
#
# Date: 2019 05
# Author: Boris Steipe (boris.steipe@utoronto.ca)
#
# V 3.0 Reconceived for 2019 workshop, use LPS data
# V 2.1 Use 2 - correlation distance metric
# V 2 2018 restructuring
# V 1.1 2017 updates
# V 1.0 First code 2016
#
# TODO:
# - Plot heatmap as "standard" red/green heatmap view with annotation bars
#
# == HOW TO WORK WITH THIS FILE ================================================
#
# This file contains scenarios and tasks, we will discuss them in detail in
# class. If this file is named "myEDAintroduction.R", then edit it profusely,
# write code, experiment with options, or just play.
# Especially play.
#
# If there is anything you don't understand, use R's help system,
# Google for an answer, or ask. Especially ask. Don't continue if you don't
# understand what's going on. That's not how it works ...
#
# ==============================================================================
#
# C L U S T E R I N G
#
# ==============================================================================
#TOC> ==========================================================================
#TOC>
#TOC> Section Title Line
#TOC> --------------------------------------------------------
#TOC> 1 LOAD THE DATA 66
#TOC> 2 HEATMAPS 92
#TOC> 3 HIERARCHICAL CLUSTERING 141
#TOC> 3.1 Explorig distance metrics 167
#TOC> 3.2 Clusters From Dendrograms 221
#TOC> 3.2.1 Setting Cut-Levels Dynamically 298
#TOC> 4 PARTITIONING CLUSTERING 373
#TOC> 4.1 K-means 376
#TOC> 4.2 K-medoids 396
#TOC> 5 AFFINITY PROPAGATION CLUSTERING 411
#TOC> 6 CLUSTER QUALITY METRICS 439
#TOC>
#TOC> ==========================================================================
updateTOC() # <<<--- Execute this to update the TOC
# In this unit we will explore clustering. Our exploration will
# discover groups of genes with related properties.
# source("./sampleSolutions/clusteringSampleSolutions-ShowPlot.R")
# = 1 LOAD THE DATA =======================================================
# We will attempt to find groups in our LPS expression data, using clustering
# methods.
load("./data/LPSdat.RData")
# To select genes that actually are interesting, we'll select the genes
# with the top 250 standard deviations of expression values:
mySD <- apply(LPSdat[ , 2:15], 1, sd)
hist(mySD)
sel <- order(mySD, decreasing = TRUE)[1:250]
mySD[sel[1]]
mySD[sel[length(sel)]]
# It's convenient for clustering to use data objects that are simple
# numeric matrices
dat <- matrix(numeric(250 * 14), nrow = 250)
for (i in 1:length(sel)) {
dat[i, ] <- as.numeric(LPSdat[sel[i], 2:15])
}
rownames(dat) <- LPSdat$genes[sel]
colnames(dat) <- colnames(LPSdat[2:15])
str(dat)
# = 2 HEATMAPS ============================================================
# Heatmaps are a staple of gene expression analysis.
# You can tweak many of the parameters, but for a first look
# we'll just heatmap the data with default parameters.
# This is a standard view that can be applied to all manners
# of multidimensional data, not just genes.
heatmap(dat)
# Just for illustration and readability let's map only
# every fifth gene
heatmap(dat[seq(1, nrow(dat), by=5), ])
# What's the actual range of values?
range(dat[,1])
# TODO:
# TASK: In biology we are usually looking at red/black/green heatmaps ...
# what are these colours? Can we colour our data like that?
#
# Hint: https://www.r-graph-gallery.com/215-the-heatmap-function/
# Study the heatmap, and consider what it tells you.
# For example, there seem to be genes that are low at LPS
# but high control ...
set1 <- c("Cd274", "Cd69", "Daxx", "Zbp1", "Usp18")
# ... and there are genes for which the inverse is true:
set2 <- c("Gm2a", "Tmem176b", "Ctsh", "Alox5ap")
# We can use a "parallel coordinates" plot - matplot()
# to look at the actual expression levels. Note that
# matplot expects the values column-wise ordered, thus
# we have to transpose - t() - the data!
matplot(t(dat[set1, ]),
type="l", lwd=2, col="skyblue", lty=1,
xlab="conditions", ylab="log expression value")
# Then we can use lines() to superimpose the genes for set2.
# No transpose here :-)
for (i in 1:length(set2)) {
lines(dat[set2[i], ], type="l", lwd=2, col="firebrick")
}
# Indeed, these genes - visibly different in the heatmap
# have similar expression profiles.
# = 3 HIERARCHICAL CLUSTERING =============================================
# Hierarchical clustering is probably the most basic technique.
# The dendrograms on the rows and columns of the heatmap
# were created by hierarchical clustering.
# For hierarchical clustering, first we need to produce
# a distance table. There are many ways to define distances
# let's just go with the default: "Euclidian distance".
distDat <-dist(dat)
# Then we use the clustering distance matrix to produce a
# dendrogram in which the most similar genes are connected, and then
# similar genes or connected groups are added. There are
# several ways to define "most-similar", lets just go with the
# default for now: "complete linkage" hierarchical clustering
hc <- hclust(distDat)
plot(hc)
# Not bad. But do note that distance as well as clustering
# method matter, and there is not really a "best" way that
# works for all data. You'll need to explore: what you are looking for
# is a distance metric that gives the clearest block structure.
# == 3.1 Explorig distance metrics =========================================
dEu <- function(x) dist(x, method="euclidian")
heatmap(dat, distfun = dEu)
dCan <- function(x) dist(x, method="canberra")
heatmap(dat, distfun = dCan)
dMax <- function(x) dist(x, method="maximum")
heatmap(dat, distfun = dMax)
dMink <- function(x) dist(x, method="minkowski")
heatmap(dat, distfun = dMink)
# You are not confined to the default distance functions, it
# is quite straightforward to define your own, for example
# using correlation properties. Here is a distance function
# defined as 1 - abs(pearson correlation)...
dCorAbs <- function(x) as.dist(1 - abs(cor(t(x))))
heatmap(dat, distfun = dCorAbs)
# ... which is useful if you are interested in similarity of shape, regardless
# of the absolute value, or ...
dCor <- function(x) as.dist(2 - cor(t(x)))
heatmap(dat, distfun = dCor)
# ... calculating the entire range of the possible correlations, i.e.
# differentiating between correlated and anticorrelated samples.
# In the case of our LPS dataset it does indeed seem that the
# dCor function (2 minus correlation) gives us the clearest block-structure.
# Let's use the function to produce a distance matrix:
# A distance matrix is simply a square matrix of elements, which contains the
# distance between them in the cells
N <- nrow(dat)
myDist <- matrix(numeric(N*N), nrow = N)
rownames(myDist) <- rownames(dat)
colnames(myDist) <- rownames(dat)
for (i in 1:N) {
for (j in i:N){
d <- 2 - cor(dat[i, ], dat[j, ])
myDist[i, j] <- d
myDist[j, i] <- d
}
}
myDist <- as.dist(myDist)
hc <- hclust(myDist)
plot(hc)
# == 3.2 Clusters From Dendrograms =========================================
# To get clusters from a dendrogram, we need to "cut" it at some
# level. The tree then falls apart into sub-trees and each of these
# is one "cluster"...
# Draw rectangles at different cut-levels, to give the desired number
# of clusters.
rect.hclust(hc, k = 2)
rect.hclust(hc, k = 5)
rect.hclust(hc, k = 10)
rect.hclust(hc, k = 20)
rect.hclust(hc, k = 50)
# Now retrieve the actual indices and use them to generate
# parallel coordinate plots.
class <- cutree(hc, k = 20)
# Explain the output...
class
# The table() function allows us to count the number of
# occurences in each class ...
table(class)
sort(table(class), decreasing = TRUE)
# Let's plot the four largest classes (in parallel, into the same window)
# Look at this carefully. See how the selection statement on the object "class"
# generates a logical vector: TRUE in all rows for which the statement is true,
# and how this is used to select the rows of dat that we want to plot ...
oPar <- par(mfrow=c(2,2))
matplot(t(dat[class==1,]), type="l", xlab="time", ylab="log expression value")
matplot(t(dat[class==6,]), type="l", xlab="time", ylab="log expression value")
matplot(t(dat[class==5,]), type="l", xlab="time", ylab="log expression value")
matplot(t(dat[class==8,]), type="l", xlab="time", ylab="log expression value")
par(oPar)
# As an alternative, try Wards- linkage clustering (and read up on the
# options: single-, complete- and average-linkage clustering)
hc.ward <-hclust(distDat, method = "ward.D", members=NULL)
plot(hc.ward)
# draw rectangles
rect.hclust(hc.ward,k=9)
# This looks reasonable ...
# Now retrieve the actual indices and use them to generate
# paralell coordinate plots.
class.ward<-cutree(hc.ward, k = 9)
sort(table(class.ward))
# get some nice colors
if (!require(RColorBrewer, quietly=TRUE)) {
install.packages("RColorBrewer")
library(RColorBrewer)
}
# what spectra are there in the package .. ?
display.brewer.all()
niceCols <- brewer.pal(9, "Paired")
oPar <- par(mfrow=c(3,3))
for (i in 1:9) {
matplot(t(dat[class == i,]),
type="l", col=niceCols[i],
xlab="time",ylab="log expression value")
}
par(oPar)
# === 3.2.1 Setting Cut-Levels Dynamically
# While this may be aesthetically somewhat satisfactory, it is clear that
# the clusters are not homogenous as we might need them for biological
# interpretation. This is a general problem with clustering methods that
# fix the number of cluster centres either directly as in Kmeans (see
# below), or indirectly by cutting trees at a fixed level. It is also
# a problem with the data, where differences in absolute values might
# override separation into clusters that might better be defined in terms
# of relative values. Rather than "cutting" at a fixed level, we could
# adjust the level dynamically according to some objective function that
# may give us "better" clusters.
# Here is a package that adresses the dynamic range problem.
# Read about it here:
# http://cran.r-project.org/web/packages/dynamicTreeCut/dynamicTreeCut.pdf
if (!require(dynamicTreeCut, quietly=TRUE)) {
install.packages("dynamicTreeCut")
library(dynamicTreeCut)
}
hc <- hclust(myDist) # recreate hc
class.dynamic <- cutreeDynamic(dendro = hc,
distM = as.matrix(myDist),
cutHeight=100)
table(class.dynamic)
niceCols <- brewer.pal(7, "Spectral")
oPar <- par(mfrow=c(3,3))
for (i in 1:7) {
matplot(t(dat[class.dynamic == i,]),
type="l",
col=niceCols[i],
xlab="time",
ylab="log expression value")
}
par(oPar)
# Note that the number of members in each class is now much more homogenous,
# while the internal groups do not seem at lot more noisy.
# One thing our clustering algorithms do is to pull apart profiles
# that have similar shape, but different absolute levels. This is
# because we have not normalized our data. Let's thus try
# clustering merely based on profile shape, i.e.
# relative expression levels, by scaling all rows between zero
# and one.
datNorm <- t(apply(dat, 1, function(x)(x-min(x))/(max(x)-min(x))))
distDatNorm <- dist(datNorm)
hc.Norm <-hclust(distDatNorm)
class.dynamic <- cutreeDynamic(dendro = hc.Norm,
distM = as.matrix(distDatNorm),
cutHeight=15)
table(class.dynamic)
niceCols <- brewer.pal(8, "Spectral")
oPar <- par(mfrow=c(3,3))
for (i in 1:8) {
matplot(t(datNorm[class.dynamic == i,]),
type="l",
col=niceCols[i],
xlab="time",
ylab="log expression value")
}
par(oPar)
# = 4 PARTITIONING CLUSTERING =============================================
# == 4.1 K-means ===========================================================
# K-means clusters by assigning elements to a fixed
# number of cluster centres, so that similarity
# within a cluster is maximized.
?kmeans
k <- 4
cl <- kmeans(dat, k)
niceCols <- brewer.pal(k, "Spectral")
plot(dat[,1], dat[,4], col = niceCols[cl$cluster])
points(cl$centers, col = niceCols[1:k], pch = 8, cex=2)
# But: be aware ...
# ... K-means does not guarantee a globally optimal solution,
# merely a locally converged one.
# == 4.2 K-medoids =========================================================
# load library "cluster" for K-medoid partitioning
if (!require(cluster, quietly=TRUE)) {
install.packages("cluster")
library(cluster)
}
set.seed(112358)
k <- 4
cl<-pam(dat, 4)
plot(dat[, 1],dat[, 4], col=niceCols[cl$cluster])
plot(cl) # shows boundary and silhouette plots
# = 5 AFFINITY PROPAGATION CLUSTERING =====================================
# Based on B. J. Frey and D. Dueck. Clustering by
# passing messages between data points.
# Science, 315(5814):972–976, 2007
if (!require(apcluster, quietly=TRUE)) {
install.packages("apcluster")
library(apcluster)
}
apRes <- apcluster(negDistMat(r=2), dat)
apRes
heatmap(apRes)
length(apRes)
cutree(apRes)
oPar <- par(mfrow=c(3,3))
for (i in 1:9) {
matplot(t(dat[unlist(apRes[i]),]),
type="l",xlab="time",ylab="log expression value")
}
par(oPar)
# = 6 CLUSTER QUALITY METRICS =============================================
# So .. which method should we use?
# I don't think that there is an obvious biological
# criterium to decide. Typically we should take some
# orthogonal information (e.g. shared transcription
# factor binding site), so if functionally related
# genes end up in similar clusters, and judge our
# cluster success based on its biological predictive
# value.
#
# But we can certainly say something about whether
# our clusters look good in a mathematical sense.
if (!require(clValid, quietly=TRUE)) {
install.packages("clValid")
library(clValid)
}
# Package information:
# library(help = clValid) # basic information
# browseVignettes("clValid") # available vignettes
# data(package = "clValid") # available datasets
if (!require(kohonen, quietly=TRUE)) {
install.packages("kohonen")
library(kohonen)
}
# Package information:
# library(help = kohonen) # basic information
# browseVignettes("kohonen") # available vignettes
# data(package = "kohonen") # available datasets
if (!require(mclust, quietly=TRUE)) {
install.packages("mclust")
library(mclust)
}
# Package information:
# library(help = mclust) # basic information
# browseVignettes("mclust") # available vignettes
# data(package = "mclust") # available datasets
?clValid
# This is pretty nice: we _can_ use biological
# knowledge for validation...
# But for our example, we'll try internal validation
# on all available methods.
valClust <- clValid(dat,
nClust = 2:20,
clMethods = c("hierarchical",
"kmeans",
"diana",
"model",
"sota",
"pam",
"clara",
"agnes"),
validation = "internal")
summary(valClust)
plot(valClust)
vignette("clValid")
# Task:
# 1 - What appears to be the best clustering method?
# 2 - How can you actually apply it to the data?
# [End]