-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.Rmd
1078 lines (844 loc) · 65.5 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "A study in boolean model parameterization"
author: "[John Zobolas](https://github.com/bblodfon)"
date: "Last updated: `r format(Sys.time(), '%d %B, %Y')`"
description: "Investigations related to link operators mutations in boolean models"
url: 'https\://druglogics.github.io/bool-param-maps/'
github-repo: "druglogics/bool-param-maps"
bibliography: references.bib
link-citations: true
site: bookdown::bookdown_site
---
# Intro {-}
The purpose of this analysis is to make an investigation regarding the relation of **boolean model parameterization** (based on the [standardized equation form](https://druglogics.github.io/druglogics-doc/gitsbe-description.html#default-equation) inspired by [@Mendoza2006]) and various other model attributes, e.g. their drug combination prediction performance, their ability to predict synergies, their fitness to a specific cancerous cell line activity profile, as well as more general directives, such as the identification of **essential nodes that drive the change of dynamics** throughout the parameterization space.
The dataset used in this analysis is derived from the *CASCADE 1.0* topology [@cascade2020].
The cancer signaling network has a total of $77$ nodes, out of which $23$ have link operators in their respective boolean equation (both activating and inhibiting regulators).
Using the [abmlog](https://github.com/druglogics/abmlog) software, we generated all $2^{23} = 8388608$ possible link operator mutated models for the CASCADE 1.0 topology.
We use the relative new, *non-linear dimension reduction* method UMAP [@McInnes2018a] to place all the generated boolean models in a **boolean parameterization map** and most of our conclusions are based on observing patterns in the produced maps.
:::{.green-box}
The next picture gives a general overview:
```{r overview-img, fig}
knitr::include_graphics(path = "img/abmlog_to_maps.png")
```
:::
:::{.note}
The model dataset is available in Zenodo [](https://doi.org/10.5281/zenodo.4022783).
:::
Libraries used in this analysis:
```{r Load libraries, message = FALSE}
library(xfun)
library(knitr)
library(dplyr)
library(tidyr)
library(tibble)
library(ggpubr)
library(stringr)
library(ggplot2)
library(DT)
library(usefun)
library(emba)
library(forcats)
library(scales)
suppressPackageStartupMessages(library(ComplexHeatmap))
suppressPackageStartupMessages(library(circlize))
library(glmnet)
library(randomForest)
library(ranger)
library(uwot)
library(Ckmeans.1d.dp)
```
# CASCADE 1.0 - General {-}
## Network Properties {-}
:::{.blue-box}
In this section we demonstrate the **scale-free properties of the CASCADE 1.0 network**.
We show that both in- and out-degree distributions are asymptotically power-law.
:::
Use the script [get_distribution_stats.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/get_distribution_stats.R) to generate the degree distribution stats.
We load the results:
```{r load-degree-distribution-stats}
dd_stats = readRDS(file = "data/dd_stats.rds")
```
```{r in-degree-fig, warning=FALSE, message=FALSE, fig.show='hold', out.width='50%', cache=TRUE, dpi=300, fig.cap='Degree Distribution (CASCADE 1.0)'}
dd_stats %>% group_by(in_degree) %>% tally() %>%
ggplot(aes(x = in_degree, y = n)) +
geom_bar(stat = "identity", fill = "steelblue", width = 0.7) +
geom_smooth(aes(color = "red"), se = FALSE, show.legend = FALSE) +
theme_classic() +
labs(title = "In-Degree Distribution (CASCADE 1.0)", x = "In Degree", y = "Number of Nodes")
dd_stats %>% group_by(out_degree) %>% tally() %>%
ggplot(aes(x = out_degree, y = n)) +
geom_bar(stat = "identity", fill = "steelblue") +
geom_smooth(aes(color = "red"), se = FALSE, span = 0.58, show.legend = FALSE) +
theme_classic() +
labs(title = "Out-Degree Distribution (CASCADE 1.0)", x = "Out Degree", y = "Number of Nodes")
```
## Model Stable State Statistics {-}
The `gitsbe` files of the model dataset include also the fixpoint attractors of each model (`.bnet` files have only the equations).
Thus we can find the *frequency distribution* of the number of fixpoints across all produced models (use the script [count_models_ss.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/count_models_ss.R)).
The model stable state (fixpoint) statistics are as follows:
```{r ss-stats-all, fig.align='center', dpi=300, cache=TRUE, fig.cap='Stable States Distribution across all link-operator parameterized models (CASCADE 1.0)'}
models_ss_stats = readRDS(file = "data/models_ss_stats.rds")
models_ss_stats %>% group_by(ss_num) %>% tally() %>%
ggplot(aes(x = ss_num, y = n, fill = as.factor(ss_num))) +
geom_bar(stat = "identity", show.legend = FALSE) +
scale_y_continuous(labels = scales::label_number_si()) +
geom_text(aes(label = n), vjust = -0.5) +
geom_text(aes(label = paste0(100 * round(n/nrow(models_ss_stats), digits = 2), "%")), size = 10, vjust = c(2.5, 2.5, -2)) +
theme_classic2() +
theme(plot.title = element_text(hjust = 0.5)) +
labs(title = "Stable States Distribution", x = "Number of Stable States", y = "Number of models")
```
:::{.green-box}
Less than $50\%$ of the total possible parameterized models have a single fixpoint attractor which corresponds to a single stable phenotype behavior.
:::
# CASCADE 1.0 - Parameterization vs #fixpoints {-}
:::{.blue-box}
In this section we identify the **key nodes** whose parameterization affects the *change of dynamics* of the CASCADE 1.0 network, i.e. are responsible for the **change in the number of fixpoint attractors (0,1 and 2)** across all link-operator mutated models.
:::
We will use several statistical methods, in each of the sub-sections below.
:::{.orange-box}
The training data is a **link-operator matrix**, where rows are models ($2^{23}$), columns/features/variables are link-operator nodes ($23$ in total) and the parameterization values correspond to $0$ (`AND-NOT`) or $1$ (`OR-NOT`).
The ternary response for each model is a number denoting the number of fixpoints ($0,1$ or $2$).
:::
The matrix we can generate with the script [get_lo_mat.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/get_lo_mat.R) and the response is part of the previously generated data from the script [count_models_ss.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/count_models_ss.R).
## Multinomial LASSO {-}
Use the script [param_ss_glmnet.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/param_ss_glmnet.R) to fit a **multinomial LASSO model** for the data [@Friedman2010].
We now simply load the result object:
```{r param-ss-glmnet, fig.show='hold', out.width='50%', cache=TRUE, fig.cap='Euclidean Norm of glment coefficients vs lambda and deviance explained'}
fit_a1 = readRDS(file = "data/fit_a1.rds")
plot(fit_a1, xvar = "dev", type.coef = "2norm")
plot(fit_a1, xvar = "lambda", type.coef = "2norm")
```
As we can see there is no $\lambda$ that could explain more than $44\%$ of the deviance and there are a lot of non-zero coefficients associated with smaller values of $\lambda$.
For example, choosing $\lambda = 0.0142$ (tested prediction accuracy $\approx 0.72$ on a random subset of the data), we have the following coefficients, shown in a heatmap:
```{r param-ss-glmnet-2, cache=TRUE, dpi=300, fig.align='center', fig.cap='Heatmap of coefficients of multinomial model (glmnet)'}
# choose a lambda
lambda1 = fit_a1$lambda[32]
fit_a1_coef = coef(fit_a1, s = lambda1) %>%
lapply(as.matrix) %>%
Reduce(cbind, x = .) %>% t()
rownames(fit_a1_coef) = names(coef(fit_a1, s = lambda1)) # 0, 1 and 2 stable states
imp_nodes_colors = rep("black", length(colnames(fit_a1_coef)))
names(imp_nodes_colors) = colnames(fit_a1_coef)
imp_nodes_colors[names(imp_nodes_colors) %in% c("MEK_f", "PTEN", "MAPK14")] = "green4"
ComplexHeatmap::Heatmap(matrix = fit_a1_coef, name = "Coef", row_title = "Number of Fixpoints",
row_names_side = "right", row_dend_side = "right", row_title_side = "right",
column_title = "Glmnet Coefficient Scores (λ = 0.0142)",
column_dend_height = unit(20, "mm"), column_names_gp = gpar(col = imp_nodes_colors))
# check accuracy
# lo_mat = readRDS(file = "data/lo_mat.rds")
# models_ss_stats = readRDS(file = "data/models_ss_stats.rds")
# ss_num = models_ss_stats %>% pull(ss_num)
# set.seed(42)
# model_indexes = sample(x = 1:nrow(lo_mat), size = 100000)
# pred = predict(object = fit_a1, newx = lo_mat[model_indexes, ],
# type = "class", s = lambda1) %>% as.integer()
# acc = sum(pred == ss_num[model_indexes])/length(pred)
# print(acc) # should be ~ 0.72
```
:::{.green-box}
Even thought the *glmnet* classifier might not be accurate enough, we still find that the nodes `PTEN` and `MAPK14` are the **most important (larger coefficients)** for distinguishing between the models with different number of fixpoints.
Additional nodes (like `MEK_f` and `CTNNB1`) are likely to be important as well.
:::
If we cross-validate the regularization parameter $\lambda$ (using the same script, we randomly selected smaller model samples - $100000$, $20$ times in total), and choose the $\lambda_{1se}$ for each different run to get the coefficients, the results can be visualized as follows:
```{r param-ss-glmnet-cv-data, cache=TRUE}
cvfit_data = readRDS(file = "data/cvfit_data.rds")
cvfit_mat_list = lapply(cvfit_data, function(cvfit) {
co = coef(cvfit) %>%
lapply(as.matrix) %>%
Reduce(cbind, x = .) %>%
t()
rownames(co) = names(coef(cvfit)) # 0,1 and 2 stable states
return(co)
})
cvfit_mat = do.call(rbind, cvfit_mat_list)
```
```{r param-ss-glmnet-cv-fig, cache=TRUE, dpi=300, fig.align='center', fig.cap='Heatmap of coefficients of multinomial model (glmnet - 20 CV models)'}
imp_nodes_colors = rep("black", length(colnames(cvfit_mat)))
names(imp_nodes_colors) = colnames(cvfit_mat)
imp_nodes_colors[names(imp_nodes_colors) %in% c("MEK_f", "PTEN", "MAPK14", "CTNNB1", "mTORC1_c")] = "green4"
ComplexHeatmap::Heatmap(matrix = cvfit_mat, name = "Coef", row_title = "Number of Fixpoints",
row_dend_side = "right", row_title_side = "left",
column_title = "Glmnet Coefficient Scores", row_km = 3, row_km_repeats = 10,
show_row_names = FALSE, column_dend_height = unit(20, "mm"),
column_names_gp = gpar(col = imp_nodes_colors),
left_annotation = rowAnnotation(foo = anno_block(
labels = c("2", "1", "0"), # with `show_row_names = TRUE` you can check this
labels_gp = gpar(col = "black", fontsize = 12))))
```
:::{.green-box}
The top 5 most important nodes are seen in green in the above heatmap: `MAPK14`, `PTEN`, `CTNNB1`, `MEK_f` and `mTORC1_c`.
:::
## Random Forests {-}
We used the [param_ss_randf.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/param_ss_randf.R) script to tune and train a random forest classifier on the dataset [@Liaw2002].
First we tuned the `mtry` parameter, the number of variables randomly selected at each tree split:
```{r param-ss-rf-mtry, cache=TRUE, dpi=300, fig.align='center', fig.cap='Random Forest Tuning (mtry)'}
mtry_data = readRDS(file = "data/mtry_data.rds")
mtry_data %>%
ggplot(aes(x = mtry, y = OOBError, group = mtry)) +
geom_boxplot(fill = "green4") +
labs(title = "Tuning Random Forest mtry parameter") +
theme_classic(base_size = 14) + theme(plot.title = element_text(hjust = 0.5))
```
:::{.green-box}
A value between $14-18$ for `mtry` seems to minimize the Out-Of-Bag Error, so we choose $16$ for the rest of the analysis.
For the number of trees parameter, we stayed with the default value ($500$) as we observed that they were more than enough after a few test runs.
:::
Next, we randomly selected $100000$ models from the dataset - a total of $20$ times - to train the random forest classifier and find the **importance score of each node**.
We load the result data and tidy up a bit:
```{r param-ss-rf-imp}
rf_imp_data = readRDS(file = "data/rf_imp_data.rds")
# make a list of tibbles
tbl_list = lapply(rf_imp_data, function(mat) {
nodes = rownames(mat)
tibble::as_tibble(mat) %>% mutate(nodes = nodes)
})
# OneForAll
imp_res = dplyr::bind_rows(tbl_list)
# Get the importance stats
imp_stats = imp_res %>%
group_by(nodes) %>%
summarise(mean_acc = mean(MeanDecreaseAccuracy),
sd_acc = sd(MeanDecreaseAccuracy),
mean_gini = mean(MeanDecreaseGini),
sd_gini = sd(MeanDecreaseGini),
.groups = 'keep') %>%
ungroup()
```
The importance scores for each node were the **mean decrease in accuracy and node impurity** (Gini Index).
We calculate the mean importance and standard deviation scores across all random samples for both measures of importance:
```{r param-ss-rf-imp-fig, warning=FALSE, cache=TRUE, dpi=300, fig.align='center', fig.cap='Random Forest: Mean Decrease Accuracy per node'}
# color first 5 nodes in x axis
imp_col = c(rep("green4", 5), rep("grey30", nrow(imp_stats)-5))
imp_stats %>%
mutate(nodes = forcats::fct_reorder(nodes, desc(mean_acc))) %>%
ggplot(aes(x = nodes, y = mean_acc, fill = mean_acc)) +
geom_bar(stat = "identity", show.legend = FALSE) +
scale_fill_gradient(low = "steelblue", high = "red") +
geom_errorbar(aes(ymin=mean_acc-sd_acc, ymax=mean_acc+sd_acc), width = 0.2) +
theme_classic(base_size = 14) +
theme(axis.text.x = element_text(angle = 90, colour = imp_col)) +
labs(title = "Random Forest Variable Importance (Accuracy)",
x = "Nodes", y = "Mean Decrease Accuracy")
```
```{r param-ss-rf-imp-fig2, warning=FALSE, cache=TRUE, dpi=300, fig.align='center', fig.cap='Random Forest: Mean Decrease in Node Impurity (Gini Index)'}
imp_stats %>%
mutate(nodes = forcats::fct_reorder(nodes, desc(mean_gini))) %>%
ggplot(aes(x = nodes, y = mean_gini, fill = mean_gini)) +
geom_bar(stat = "identity", show.legend = FALSE) +
scale_fill_gradient(low = "steelblue", high = "red") +
geom_errorbar(aes(ymin = mean_gini-sd_gini, ymax = mean_gini+sd_gini), width = 0.2) +
theme_classic(base_size = 14) +
theme(axis.text.x = element_text(angle = 90, colour = imp_col)) +
labs(title = "Random Forest Variable Importance (Gini Index)",
x = "Nodes", y = "Mean Decrease in Node Impurity")
```
:::{.green-box}
The top 5 important nodes by any of the two importance measures using random forests, include the nodes found as important with the LASSO method: `MAPK14`, `ERK_f`, `MEK_f`, `PTEN`, `mTORC1_c`.
:::
Same results we get when using a faster, more memory efficient and with multi-thread support, random forest R package, namely `ranger` [@Wright2017].
Use the script [param_ss_ranger.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/param_ss_ranger.R) to reproduce the results:
```{r param-ss-rf-imp-fig3, warning=FALSE, cache=TRUE, dpi=300, fig.align='center', fig.cap='Random Forest (ranger): Mean Decrease in Node Impurity (Gini Index)'}
ranger_res = readRDS(file = "data/ranger_res.rds")
imp_res = tibble(nodes = names(ranger_res$variable.importance),
gini_index = ranger_res$variable.importance)
imp_res %>%
mutate(nodes = forcats::fct_reorder(nodes, desc(gini_index))) %>%
ggplot(aes(x = nodes, y = gini_index, fill = gini_index)) +
geom_bar(stat = "identity", show.legend = FALSE) +
scale_fill_gradient(low = "steelblue", high = "red") +
scale_y_continuous(labels = scales::label_number_si()) +
theme_classic(base_size = 14) +
theme(axis.text.x = element_text(angle = 90, colour = imp_col)) +
labs(title = "Random Forest (Ranger) Variable Importance (Gini Index)",
x = "Nodes", y = "Mean Decrease in Node Impurity")
```
## Parameterization Maps {-}
:::{.blue-box}
We use UMAP [@McInnes2018a] to **reduce the dimension of our dataset** from $23$ (number of nodes with link operators) to $2$ and visualize it, to see if there is any apparent *visual relation* between the model parameterization and number of fixpoints.
:::
We used the [param_ss_umap.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/param_ss_umap.R) script to run the UMAP implementation offered by the `uwot` R package.
We tried various values for the `n_neighbors` parameter where larger values result in **more global views** of the dataset, while smaller values result in **more local data** being preserved.
Also, the *distance metric* between the model parameterization vectors was mostly set to the standard (*euclidean*), but we also tried the *hamming* distance which seemed appropriate because of the binary nature of the dataset.
We make the figures afterwards using the result UMAP data with the [param_ss_umap_vis.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/param_ss_umap_vis.R) script.
See all the produced figures [here](https://github.com/druglogics/bool-param-maps/tree/master/img/all_models_maps).
### Unsupervised UMAP {-}
First we used UMAP in **unsupervised mode** (no *a priori* knowledge of the number of fixpoints per model provided or of any other information/label per model for that matter).
So, UMAP is given all binary numbers from $0$ ($23$ $0$'s) to $2^{23}-1$ ($23$ $1$'s) representing each possible link operator mutated model ($0$'s map to `AND-NOT`, $1$'s to `OR-NOT`) and places them in the 2D plane.
The following figures show us the 2D parameterization map of all CASCADE 1.0 models, **colored either by their decimal (base-10) number** (converted from the binary link-operator model representation) or **by their respective number of fixpoints**:
```{r umap-unsup-figs, fig.cap='CASCADE 1.0 Model Parameterization Maps (euclidean distance)', fig.show='hold', out.width='50%'}
knitr::include_graphics(path = "img/all_models_maps/umap_20nn_model_num.png")
knitr::include_graphics(path = "img/all_models_maps/umap_20nn.png")
```
```{r umap-unsup-figs-2, fig.cap='CASCADE 1.0 Model Parameterization Maps (hamming distance)', fig.show='hold', out.width='50%'}
knitr::include_graphics(path = "img/all_models_maps/ham_umap_20nn_model_num.png")
knitr::include_graphics(path = "img/all_models_maps/ham_umap_20nn.png")
```
:::{.green-box}
- Using the *hamming* distance metric the visualization of the dataset results in many more smaller clusters compared to the *euclidean* representation, which results in $8$ **neighborhoods/superclusters of similarly parameterized models**.
These seem to follow the numerical representation (i.e. models with close decimal numbering seem to be clustered together).
- Models with **similar parameterization seem to also have the same number of fixpoints** (i.e. there is some order in the placement of models that belong to the same fixpoint class - this phenomenon does not manifest chaotically - e.g. models in the same sub-cluster in the hamming parameterization map tend to have the same number of fixpoints).
- There is no distinct pattern that can match model parameterization with number of attractors.
In other words, models with different number of fixpoints can manifest in no particular order whatsoever across the parameterization map/space.
:::
### Supervised UMAP {-}
Next, we used **UMAP in supervised mode** - i.e. the association between each model and the corresponding fixpoint group was given as input to UMAP:
```{r umap-sup-fig, fig.cap='CASCADE 1.0 Model Parameterization Supervised Maps', fig.show='hold', out.width='50%'}
knitr::include_graphics(path = "img/all_models_maps/sumap_14nn_0_3_min_dist.png")
knitr::include_graphics(path = "img/all_models_maps/sumap_20nn.png")
```
:::{.green-box}
With **increased model complexity** (meaning dynamical complexity - i.e. **more attractors**), the subsequent fixpoint superclusters form sub-clusters that are very differently parameterized - i.e. they form **distinct families of models** and thus appear to be more *spread out* in the supervised parameterization map.
The simple argument for this is as follows: the larger distance between a model in the $0$ or $1$-fixpoint (supervised) supercluster is smaller than many of the model distances within the $2$-fixpoint sub-clusters.
:::
## Embedding Important Nodes in the Map {-}
Using random forest and the regularized LASSO method, we found important nodes whose parameterization affects the change of dynamics (number of fixpoints) in the CASCADE 1.0 signaling network.
Using UMAP we observed that closely parameterized models form clusters.
We will now color the UMAP parameterization maps according to the link-operator values of the top $5$ most important nodes found from the aforementioned methods as well as the $2$ least important node reported with random forests (use the [param_ss_umap_imp_nodes.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/param_ss_umap_imp_nodes.R) script and see all the produced figures [here](https://github.com/druglogics/bool-param-maps/tree/master/img/imp_nodes_param_ss_maps)).
The 3 most important nodes:
```{r map-imp-nodes-umap-fig, fig.show='hold', out.width='33%'}
knitr::include_graphics(path = "img/imp_nodes_param_ss_maps/unsup_MAPK14.png")
knitr::include_graphics(path = "img/imp_nodes_param_ss_maps/unsup_ERK_f.png")
knitr::include_graphics(path = "img/imp_nodes_param_ss_maps/unsup_MEK_f.png")
```
The next 2 most important nodes:
```{r map-less-imp-nodes-umap-fig, fig.show='hold', out.width='50%'}
knitr::include_graphics(path = "img/imp_nodes_param_ss_maps/unsup_mTORC1_c.png")
knitr::include_graphics(path = "img/imp_nodes_param_ss_maps/unsup_PTEN.png")
```
`CFLAR` and `CYCS` were the **least important node** for assessing the number of fixpoints of a model from its parameterization:
```{r map-least-imp-nodes-umap-fig, fig.show='hold', out.width='50%'}
knitr::include_graphics(path = "img/imp_nodes_param_ss_maps/unsup_CFLAR.png")
knitr::include_graphics(path = "img/imp_nodes_param_ss_maps/unsup_CYCS.png")
```
Same trend can be seen (and maybe a little more clearly) in the supervised corresponding maps:
```{r map-imp-nodes-sumap-fig, fig.show='hold', out.width='33%'}
knitr::include_graphics(path = "img/imp_nodes_param_ss_maps/sup_MAPK14.png")
knitr::include_graphics(path = "img/imp_nodes_param_ss_maps/sup_mTORC1_c.png")
knitr::include_graphics(path = "img/imp_nodes_param_ss_maps/sup_CFLAR.png")
```
:::{.green-box}
We can see a **visual link** between node importance (related to #fixpoints) and link operator assignment: **the less important a node is, the more randomly distributed (chaotically) it's link-operator values are across the parameterization map**.
A collection of important nodes can be used to more accurately define **families of closely parameterized models** and as we've seen above this also translates to models belonging to the same fixpoint class.
:::
# CASCADE 1.0 Analysis - 1 ss models {-}
## Stable States Data {-}
:::{.note}
In this section, **only the boolean models that have 1 stable state** will be used in the analysis.
All variables of interest (stable state, link-operator parameterization, fitness to steady state, performance MCC score, etc.) will relate only to the 1 stable state models from now on.
To load the stable state data for the models that have **1 stable state** use the Zenodo dataset [](https://doi.org/10.5281/zenodo.4022783) and the script [get_ss_data.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/get_ss_data.R)
:::
## Parameterization Maps {-}
In this section we present the results of using UMAP [@McInnes2018a] on the link-operator parameterization data of the CASCADE 1.0 models with 1 stable state.
We created several such *parameterization maps* by adjusting the *n_neighbors* parameter input (from $2$ to $20$), which is responsible for the **size of the local neighborhood** (in terms of number of neighboring sample points) used for the manifold approximation.
As the documentation says, larger values result in **more global views** of the manifold, while smaller values result in **more local data** being preserved.
To get these map images and the reduced dimension dataset, use the script [1ss_models_umap.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/1ss_models_umap.R) for more details.
:::{.blue-box}
Note that in all these mappings to the 2D space, **models that share similar link-operator parameterization will reside in the same area/cluster in the map**.
The larger the *n_neighbors* is, the more the smaller clusters merge into larger ones.
The images for $\ge 14$ *n_neighbors* are almost exactly the same.
:::
We present some of these maps below:
```{r param-maps-1ss-models-1, fig.show='hold', out.width='50%', fig.cap='2D Parameterization map for 1 stable state models (2 and 3 neighbors)'}
knitr::include_graphics(path = "img/1ss_umap/1ss_umap_unsup_2.png")
knitr::include_graphics(path = "img/1ss_umap/1ss_umap_unsup_3.png")
```
```{r param-maps-1ss-models-2, fig.show='hold', out.width='50%', fig.cap='2D Parameterization map for 1 stable state models (5 and 6 neighbors)'}
knitr::include_graphics(path = "img/1ss_umap/1ss_umap_unsup_5.png")
knitr::include_graphics(path = "img/1ss_umap/1ss_umap_unsup_6.png")
```
```{r param-maps-1ss-models-3, fig.show='hold', out.width='50%', fig.cap='2D Parameterization map for 1 stable state models (8 and 11 neighbors)'}
knitr::include_graphics(path = "img/1ss_umap/1ss_umap_unsup_8.png")
knitr::include_graphics(path = "img/1ss_umap/1ss_umap_unsup_11.png")
```
```{r param-maps-1ss-models-4, fig.show='hold', out.width='50%', fig.cap='2D Parameterization map for 1 stable state models (12 and 15 neighbors)'}
knitr::include_graphics(path = "img/1ss_umap/1ss_umap_unsup_12.png")
knitr::include_graphics(path = "img/1ss_umap/1ss_umap_unsup_15.png")
```
:::{.green-box}
We observe the existence of **two large families (superclusters) of parameterization models**, especially for more *global* views of the dataset ($\text{n_neighbors}\ge 8$).
**Distinct smaller sub-clusters of closely parameterized models** also manifest for different number of neighbors, which suggests that there is some order in the parameterization of the 1 stable state models (as exemplified by the UMAP method) across multiple visualization scales.
:::
## Gitsbe Models on the Map {-}
[Gitsbe](https://druglogics.github.io/druglogics-doc/gitsbe.html) uses a genetic algorithm approach to produce boolean models that are fitted to activity-based, biomarker training data.
We used Gitsbe and tested the produced models performance (ensemble-wise drug combination predictions) against synergy data from [@Flobak2015] in another report (AGS paper Sup. Material).
The calibrated models performed very well in terms of both ROC and PR-AUC.
:::{.blue-box}
Here we want to check whether models produced by a method such as a genetic algorithm-based one **have similar parameterization** - i.e. they belong in the same neighborhood in the parameterization map.
:::
We will use models from $1000$ gitsbe simulations, calibrated to steady state (a total of $3000$ models, choosing the $3$ best-fit models from each simulation).
The results are provided in [this data file](https://github.com/druglogics/bool-param-maps/blob/master/data/cascade_1.0_ss_1000sim_fixpoints_hsa.tar.gz) and to reproduce them, follow the instructions [here](https://druglogics.github.io/ags-paper/reproduce-data-simulation-results.html), keeping the default configuration options for CASCADE 1.0 and changing only the number of simulations to $1000$).
All the Gitsbe models had a large fitness to the steady state AGS data (their stable states fitting almost exactly the states of the manually-curated 24 nodes), as it can be seen from the next figure (see [gitsbe_models_fit.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/gitsbe_models_fit.R)):
```{r gitbse-fit-fig, fig.cap='Gitsbe model fitness to AGS steady state', fig.height=4, fig.width=6}
knitr::include_graphics(path = "img/gitsbe_fit_density.png")
```
To generate the next figures (same map, same gitsbe models, different number of neighbors) use the [gitsbe_model_embedding.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/gitsbe_model_embedding.R) - all figures are available [here](https://github.com/druglogics/bool-param-maps/tree/master/img/gitsbe_umaps):
```{r gitsbe-maps-1, fig.show='hold', out.width='50%', fig.cap='Gitsbe models in Parameterization map (2 and 4 neighbors)'}
knitr::include_graphics(path = "img/gitsbe_umaps/2nn.png")
knitr::include_graphics(path = "img/gitsbe_umaps/4nn.png")
```
```{r gitsbe-maps-2, fig.show='hold', out.width='50%', fig.cap='Gitsbe models in Parameterization map (6 and 8 neighbors)'}
knitr::include_graphics(path = "img/gitsbe_umaps/8nn.png")
knitr::include_graphics(path = "img/gitsbe_umaps/14nn.png")
```
:::{.green-box}
Gitsbe-generated models that fit the biomarker steady state data for the AGS cell line have a **diverse structure that spans across the parameterization map** but nonetheless appear to gather in **smaller parameterization-specific sub-clusters** (better seen in the Figure with 14 neighbors which gives a more global view of the dataset).
Observing the distribution of the gitsbe models in the parameterization map, we see that most of them are being **placed at one of the two superclusters**.
:::
Of course, there are areas in the map that Gitsbe models do not cover, which may as well be high-performance model areas.
Since we have generated all possible link-operator models with CASCADE 1.0, we can proceed to generate a performance map (atop the parameterization one) and cross-check if the gitsbe models fall into high-performance areas or not.
## Performance Maps {-}
Every model generated via `abmlog` ([Model Stable State Statistics]) was tested against the synergy dataset of [@Flobak2015].
Among $21$ drug combinations, $4$ were found synergistic in that dataset, namely:
```{r cascade1-obs-synergies, results='asis'}
obs_syn = emba::get_observed_synergies(file = "data/observed_synergies_cascade_1.0")
usefun::pretty_print_vector_values(obs_syn, vector.values.str = "synergies")
```
:::{.blue-box}
Using the [drabme](https://druglogics.github.io/druglogics-doc/drabme.html) software module, we tested every CASCADE 1.0 model against this dataset and got each model's predictions for each drug combination.
The *HSA* rule was used to define if a model is synergistic for a particular combination.
The results are available in the Zenodo dataset [](https://doi.org/10.5281/zenodo.4022783), file `cascade_1.0_hsa_fixpoints.tar.gz`.
As previously said, we are going to use the 1 stable state model predictions only.
:::
We used the emba R package [@Zobolas2020] to perform a biomarker analysis on the 1 stable state models dataset and their predictions from `drabme` (see script [emba_analysis.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/emba_analysis.R)).
Part of the results from the emba analysis is the calculation of the **Matthews correlation coefficient (MCC) performance score** for each model.
We use these MCC model scores to draw the next figures (see [mcc_figures.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/mcc_figures.R) script)
Splitting all the 1 stable state models to $4$ MCC classes we can see that **a lot of models have performance close to random prediction or even worse**:
```{r mcc-histogram, fig.show='hold', out.width='50%', fig.cap = 'MCC Classes Histograms'}
knitr::include_graphics(path = "img/mcc_hist.png")
knitr::include_graphics(path = "img/mcc_hist2.png")
```
:::{.green-box}
Most of the 1 stable state models have MCC performance close to random or worse, making it thus difficult for any algorithm to find the *best* performance models (i.e. the 3-4th MCC class models, which amounts to only $5\%$ of the total number of models with 1 stable state, given the standardized boolean equation parameterization inspired by [@Mendoza2006]).
:::
### Unsupervised UMAP {-}
If we draw the parameterization maps for different number of neighbors and **color the points/models according to their MCC score**, we get these images (to see all figures, check [here](https://github.com/druglogics/bool-param-maps/tree/master/img/mcc_maps)):
```{r mcc-maps-1, fig.show='hold', out.width='50%', fig.cap='MCC Parameterization map (2 and 14 neighbors)'}
knitr::include_graphics(path = "img/mcc_maps/2nn.png")
knitr::include_graphics(path = "img/mcc_maps/14nn.png")
```
:::{.green-box}
We observe that the two large parameterization superclusters (especially outlined by the figures with $2$ and $\ge 8$ neighbors) are closely related with the MCC performance metric.
Specifically, these **2 superclusters dichotomize the models performance landscape** into 2 areas, where only one of them has the majority of good performance models (i.e. those that have an MCC score $>0$).
Also, we observe that closely parameterized models tend to have same performance (**existence of smaller parameterization clusters of same performance models**).
:::
:::{.orange-box}
Comparing the corresponding **Gitsbe parameterization maps with the MCC maps**, we can clearly verify now that the Gitsbe models **might not always be the best performing ones** (in terms of MCC score) since they appear in both superclusters (and that's only a first-fine measure for determining performance based on structure) - but at least they **tend to be more in the higher performance supercluster**.
Reasons why the gitsbe models are not always on the high-performance supercluster can be justified by:
- The nature of the genetic algorithm, which is a *heuristic* method and as such produces *local* optima
- The training dataset does not provide enough *constraints* for the optimization problem, i.e. more steady state nodes are needed to be fitted in the models stable state which will result in models with more strict parameterization patterns
- A combination of the two above
:::
### Supervised UMAP {-}
:::{.blue-box}
We perform **supervised dimension reduction** using UMAP on the 1 stable state model dataset.
The only difference with the previous performance maps is that we provide UMAP (as an input) the information of the models performance (the MCC score) and not just overlay it (via coloring) afterwards.
The main thing we want to answer here is **if UMAP can do a better job placing the models on the map**, given additional information about their performance.
:::
We tried UMAP with different *number of neighbors* (from local to a more global view of the dataset), *target weight* (weighting factor between data topology ($0$) and target/response topology ($1$) or somewhere in between) and setting the response either to the MCC score as a continuous, numeric value or to the MCC classes (discrete variable) as [shown above](#fig:mcc-histogram).
See script [mcc_sumap.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/mcc_sumap.R) for more details.
First, we show some maps, where the response **MCC was treated as a continuous variable**:
```{r mcc-sumaps-1, fig.show='hold', out.width='33%', fig.cap='MCC Parameterization map (4 neighbors, MCC continuous)'}
knitr::include_graphics(path = "img/mcc_sumaps/4nn_0w.png")
knitr::include_graphics(path = "img/mcc_sumaps/4nn_0.5w.png")
knitr::include_graphics(path = "img/mcc_sumaps/4nn_1w.png")
```
```{r mcc-sumaps-2, fig.show='hold', out.width='50%', fig.cap='MCC Parameterization map (14 neighbors, MCC continuous)'}
knitr::include_graphics(path = "img/mcc_sumaps/14nn_0.5w.png")
knitr::include_graphics(path = "img/mcc_sumaps/14nn_1w.png")
```
```{r mcc-sumaps-3, fig.show='hold', out.width='50%', fig.cap='MCC Parameterization map (20 neighbors, MCC continuous)'}
knitr::include_graphics(path = "img/mcc_sumaps/20nn_0.5w.png")
knitr::include_graphics(path = "img/mcc_sumaps/20nn_1w.png")
```
:::{.green-box}
Very artistic and knotty figures!
UMAP achieves some sort of distinction between the different MCC values, especially when the topology is heavily based on the response MCC value (*target weight* is $1$)
:::
Next, we show some maps, where the response **MCC was treated as a categorical variable**, denoting the MCC class the models belong to (higher the better):
```{r mcc-sumaps-4, fig.show='hold', out.width='50%', fig.cap='MCC Parameterization map (4 and 6 neighbors, MCC Classes)'}
knitr::include_graphics(path = "img/mcc_sumaps/class/4nn_0.5w_class.png")
knitr::include_graphics(path = "img/mcc_sumaps/class/6nn_0.5w_class.png")
```
```{r mcc-sumaps-5, fig.show='hold', out.width='50%', fig.cap='MCC Parameterization map (8 and 10 neighbors, MCC Classes)'}
knitr::include_graphics(path = "img/mcc_sumaps/class/8nn_0.5w_class.png")
knitr::include_graphics(path = "img/mcc_sumaps/class/10nn_0.5w_class.png")
```
```{r mcc-sumaps-6, fig.show='hold', out.width='33%', fig.cap='MCC Parameterization map (14 neighbors, MCC Classes)'}
knitr::include_graphics(path = "img/mcc_sumaps/class/14nn_0w_class.png")
knitr::include_graphics(path = "img/mcc_sumaps/class/14nn_0.5w_class.png")
knitr::include_graphics(path = "img/mcc_sumaps/class/14nn_1w_class.png")
```
:::{.green-box}
We observe that the higher the number of neighbors is ($\ge 10$ with a balanced *target weight* of $0.5$), the better UMAP classifies the models to distinct (super-) clusters representing the different MCC classes.
:::
## Performance vs Fitness {-}
:::{.blue-box}
We try to see if there is any correlation between **performance (MCC)** and **fitness to the AGS steady state**.
We study **two AGS steady states**: one with a total of $24$ nodes which is the result of manual literature curation (S4 Table in @Flobak2015) and serves as the *gold-standard*, and one with all $77$ nodes, which is the single (and only) fixpoint of the AGS boolean model in @Flobak2015.
Note that second, larger steady state reports the same activity values for the common $24$ nodes as the first one.
Use the [fit_figures.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/fit_figures.R) script to reproduce the figures for the literature-curated steady state and the [fit_vs_performance_ags_pub.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/fit_vs_performance_ags_pub.R) for the AGS model fixpoint/steady state.
:::
First, we check if the fitness density of all 1 stable state models covers the whole *fitness spectrum*:
```{r 1ss-models-fit-density-fig, fig.show='hold', out.width='50%', fig.cap='All 1 stable state models fitness to AGS steady state(s)'}
knitr::include_graphics(path = "img/1ss_models_fit_density.png")
knitr::include_graphics(path = "img/1ss_models_fit_density_pub.png")
```
The Pearson correlation between the two fitness vectors is high:
```{r fit-cor-ags, cache = TRUE, results='asis'}
fit_data = readRDS(file = "data/fit_data.rds")
fit_data_pub = readRDS(file = "data/fit_data_pub.rds")
# data check
stopifnot(all(names(fit_data) == names(fit_data_pub)))
# Pearson correlation
cor(x = fit_data, y = fit_data_pub, method = "pearson") %>% round(digits = 2)
```
:::{.green-box}
We observe that **most of the models have at least half of the nodes** in the same state as in the AGS steady state, no matter which steady state we fit the stable state data against.
This makes the **fitness score distribution skewed** towards the higher fitness values.
Of course, though the two steady states completely agree on the values of the $24$ nodes, it might be that the rest of the nodes have slightly different activity state values from the ones found in the fixpoint attractor of the AGS model in @Flobak2015.
If that is the case in reality, the fitness distribution might be more uniform.
:::
We follow the same classification scheme as [above](#fig:mcc-histogram), i.e. splitting the 1 stable models to $4$ MCC classes and comparing the fitness scores between these classes:
```{r fit-vs-perf-fig, fig.show='hold', out.width='50%', fig.cap='MCC performance vs Fitness to AGS steady state(s) for all 1 stable state models. Left is fitness to the AGS-curated steady state (24 nodes), right is for the full, 77-node steady state.'}
knitr::include_graphics(path = "img/mcc_vs_fit.png")
knitr::include_graphics(path = "img/mcc_vs_fit_pub.png")
```
:::{.green-box}
The last MCC class has a **statistically significant higher median fitness** to the AGS steady state compared to all other lower MCC classes, even though the correlation between fitness and performance is not linear and most of the models have a fitness higher than $0.5$.
Note also that models that have significantly lower than $0.5$ fitness in each steady state case, belong to the first two, lower performance classes.
The last two figures points us to the fact that more constraints are needed for the fitness calculation of the Gitsbe genetic algorithm or any other for that matter (i.e. more literature-curated nodes in the AGS steady state - now only $24/77=31\%$ is included) in order to define more restrictive parameterized models that would allow a much more *uniform* fitness density spectrum (or at least **not skewed towards the higher fitness values**).
Such fitness spectrum would (hopefully) allow for more granularity in the corresponding performance behavior between the different MCC classes, and thus more distinctive correlation.
:::
## Fitness Maps {-}
If we draw the parameterization maps for different number of neighbors and **color the points/models according to their fitness to the AGS steady state (24 nodes)**, we get these images (see [fit_figures.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/fit_figures.R) script):
```{r fit-maps-1, fig.show='hold', out.width='50%', fig.cap='Fitness Parameterization map (2 and 4 neighbors)'}
knitr::include_graphics(path = "img/fit_maps/2nn.png")
knitr::include_graphics(path = "img/fit_maps/4nn.png")
```
```{r fit-maps-2, fig.show='hold', out.width='50%', fig.cap='Fitness Parameterization map (11 and 14 neighbors)'}
knitr::include_graphics(path = "img/fit_maps/11nn.png")
knitr::include_graphics(path = "img/fit_maps/14nn.png")
```
:::{.green-box}
Higher fitness models manifest in both superclusters, suggesting again the need for more literature-curated nodes in the training data (AGS steady state).
Also, closely parameterized models tend to have same fitness (**existence of smaller parameterization clusters of same fitness models**).
No apparent correlation can be observed between fitness and performance (MCC) maps.
:::
## Performance biomarkers {-}
:::{.blue-box}
We assess **important nodes (biomarkers)** whose *activity* and/or link-operator (*parameterization*) affects the models performance in terms of the achieved MCC score.
:::
### emba biomarkers {-}
We use the `emba` results with $4$ MCC classes (see script [emba_analysis.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/emba_analysis.R) and MCC classification histograms [above](#fig:mcc-histogram)).
What emba [@Zobolas2020] does is to group the models to $4$ MCC Classes and compare the average stable state activities and link operator values for each node between **every possible pair of MCC classes**.
Nodes that have **higher average differences** (either positively or negatively and between any comparison pair) are considered as **more important** and are thus annotated as *biomarkers* based on a given threshold.
In the next heatmap, a positive (resp. negative) state difference for a node denotes that its activity value was larger (resp. lower) in the higher MCC class of the corresponding comparison pair:
```{r emba-activity-biomarkers, cache = TRUE, fig.width=7, fig.height=5, fig.align='center', fig.cap='Heatmap of Average State Differences (emba, 4 MCC Classes)'}
mcc_res = readRDS(file = "data/emba_mcc_res/mcc_4_res.rds")
# color some important nodes and lower the threshold
activity_biomarkers = emba::get_biomarkers(mcc_res$diff.state.mcc.mat, threshold = 0.48)
imp_nodes_colors = rep("black", ncol(mcc_res$diff.state.mcc.mat))
names(imp_nodes_colors) = colnames(mcc_res$diff.state.mcc.mat)
imp_nodes_colors[names(imp_nodes_colors) %in% activity_biomarkers$biomarkers.pos] = "green4"
imp_nodes_colors[names(imp_nodes_colors) %in% activity_biomarkers$biomarkers.neg] = "red4"
# define coloring function of state differences
col_fun = circlize::colorRamp2(c(min(mcc_res$diff.state.mcc.mat), 0, max(mcc_res$diff.state.mcc.mat)),
c("red", "white", "green"))
set.seed(42)
state_heat = ComplexHeatmap::Heatmap(matrix = mcc_res$diff.state.mcc.mat,
name = "State Difference", col = col_fun,
row_title = "MCC Classes Pairs", row_names_side = "left",
row_title_side = "left", row_dend_side = "right",
column_title = "Average State Differences between MCC Classes",
column_names_gp = gpar(fontsize = 6, col = imp_nodes_colors),
heatmap_legend_param = list(direction = "horizontal"))
biomarkers_legend = Legend(title = "Activity State Biomarkers",
labels = c("Active", "Inhibited"),
legend_gp = gpar(fill = c("green4", "red4")))
draw(state_heat, annotation_legend_list = biomarkers_legend,
heatmap_legend_side = "bottom", annotation_legend_side = "bottom",
merge_legend = TRUE)
```
So, the following nodes are identified as *active* biomarkers (they tend to be active in the higher performance models):
```{r active-state-biomarkers, results='asis'}
usefun::pretty_print_vector_values(activity_biomarkers$biomarkers.pos)
```
and the following as *inhibited* biomarkers:
```{r inhibited-state-biomarkers, results='asis'}
usefun::pretty_print_vector_values(activity_biomarkers$biomarkers.neg)
```
Note that some nodes tend to have either no actual state differences between the MCC classes (so they are not important) or they take both negative and positive differences between the different MCC class pair comparisons (without surpassing the given threshold, so also not important for us here).
```{r emba-lo-biomarkers, cache = TRUE, fig.width=7, fig.height=5, fig.align='center', fig.cap='Heatmap of Average Link Operator Differences (emba, 4 MCC Classes)'}
# color some important nodes and lower the threshold
param_biomarkers = emba::get_biomarkers(mcc_res$diff.link.mcc.mat, threshold = 0.48)
imp_nodes_colors = rep("black", ncol(mcc_res$diff.link.mcc.mat))
names(imp_nodes_colors) = colnames(mcc_res$diff.link.mcc.mat)
imp_nodes_colors[names(imp_nodes_colors) %in% param_biomarkers$biomarkers.pos] = "green4"
imp_nodes_colors[names(imp_nodes_colors) %in% param_biomarkers$biomarkers.neg] = "red4"
# define coloring function of state differences
col_fun = circlize::colorRamp2(c(min(mcc_res$diff.link.mcc.mat), 0, max(mcc_res$diff.link.mcc.mat)),
c("red", "white", "green"))
set.seed(42)
state_heat = ComplexHeatmap::Heatmap(matrix = mcc_res$diff.link.mcc.mat,
name = "LO Difference", col = col_fun,
row_title = "MCC Classes Pairs", row_names_side = "left",
row_title_side = "left", row_dend_side = "right",
column_title = "Average Link Operator Differences between MCC Classes",
column_names_gp = gpar(fontsize = 11, col = imp_nodes_colors),
heatmap_legend_param = list(direction = "horizontal"))
biomarkers_legend = Legend(title = "Link Operator Biomarkers",
labels = c("OR-NOT (1)", "AND-NOT (0)"),
legend_gp = gpar(fill = c("green4", "red4")))
draw(state_heat, annotation_legend_list = biomarkers_legend,
heatmap_legend_side = "bottom", annotation_legend_side = "bottom",
merge_legend = TRUE)
```
:::{.green-box}
We observe that `ERK_f` will have the `OR-NOT` link operator in most of the higher performance models and the `MAPK14` the `AND-NOT` link operator.
This of course relates to the fact that these nodes were found also as *active* and *inhibited* biomarkers respectively and that they have a very large observed agreement between stable state activity value and link operator parameterization (see [analysis here](https://druglogics.github.io/brf-bias/cascade-1-0-data-analysis.html#fig:ss-lo-agreement-prop)).
Interestingly, these two nodes (`ERK_f` and `MAPK14`) were 2 of the top most important nodes influencing the change of dynamics (number of attractors) in the link operator parameterization space of the CASCADE 1.0 network.
:::
### Random Forest biomarkers {-}
We use the `ranger` R package [@Wright2017] to find **important nodes/variables** that determine the difference in performance (MCC score) between the input models.
Both the stable state data as well the link operator parameterization data will be used as training sets (see the script [perf_biomarkers_ranger.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/perf_biomarkers_ranger.R)).
```{r rf-mcc-state-imp, warning=FALSE, cache=TRUE, dpi=300, fig.align='center', fig.cap='Random Forest Important Nodes for Performance Classification (MCC) based on stable state data'}
ranger_ss_mcc_res = readRDS(file = "data/ss_mcc_ranger_res.rds")
imp_ss_rf = tibble(nodes = names(ranger_ss_mcc_res$variable.importance),
gini_index = ranger_ss_mcc_res$variable.importance)
# color first 12 nodes in x axis
imp_col = c(rep("green4", 14), rep("grey30", nrow(imp_ss_rf) - 14))
imp_ss_rf %>%
mutate(nodes = forcats::fct_reorder(nodes, desc(gini_index))) %>%
ggplot(aes(x = nodes, y = gini_index, fill = gini_index)) +
geom_bar(stat = "identity", show.legend = FALSE) +
scale_fill_gradient(low = "steelblue", high = "red") +
theme_classic(base_size = 14) +
theme(axis.text.x = element_text(angle = 90, size = 7, colour = imp_col)) +
labs(title = "RF Importance for MCC Classification (stable states)",
x = "Nodes", y = "Mean Decrease in Node Impurity")
```
:::{.green-box}
We observe a lot of common important activity biomarkers (important nodes) between the random forest results and the `emba` results: `ERK_f`,`MAP3K7`,`MAP2K7`,`NLK`,`TAB_f` to name a few.
:::
```{r rf-mcc-param-imp, warning=FALSE, cache=TRUE, dpi=300, fig.align='center', fig.cap='Random Forest Important Nodes for Performance Classification (MCC) based on link operator data)'}
ranger_lo_mcc_res = readRDS(file = "data/lo_mcc_ranger_res.rds")
imp_lo_rf = tibble(nodes = names(ranger_lo_mcc_res$variable.importance),
gini_index = ranger_lo_mcc_res$variable.importance)
# color first 8 nodes in x axis
imp_col = c(rep("green4", 8), rep("grey30", nrow(imp_lo_rf) - 8))
imp_lo_rf %>%
mutate(nodes = forcats::fct_reorder(nodes, desc(gini_index))) %>%
ggplot(aes(x = nodes, y = gini_index, fill = gini_index)) +
geom_bar(stat = "identity", show.legend = FALSE) +
scale_fill_gradient(low = "steelblue", high = "red") +
theme_classic(base_size = 14) +
theme(axis.text.x = element_text(angle = 90, size = 12, colour = imp_col)) +
labs(title = "RF Importance for MCC Classification (parameterization)",
x = "Nodes", y = "Mean Decrease in Node Impurity")
```
:::{.green-box}
`ERK_f` is a major link operator biomarker in both `emba` and random forest results
:::
### emba vs RF Results {-}
We check the correlation between `emba` (sum of absolute state differences for all comparison pairs per node, i.e. summing up the absolute values of the columns in the heatmaps above) and random forest (Gini Index) results:
```{r cor-emba-rf, cache=TRUE, dpi=300, fig.show='hold', out.width='50%', fig.cap="Correlation between emba and Random Forest Importance Results"}
# compute total emba importance
imp_ss_emba = mcc_res$diff.state.mcc.mat %>% abs() %>% colSums()
imp_lo_emba = mcc_res$diff.link.mcc.mat %>% abs() %>% colSums()
# make sure it's in the same order
imp_ss_emba = imp_ss_emba[imp_ss_rf %>% pull(nodes)]
imp_lo_emba = imp_lo_emba[imp_lo_rf %>% pull(nodes)]
imp_ss_rf %>% add_column(emba_score = imp_ss_emba) %>%
ggplot(aes(x = emba_score, y = gini_index)) +
geom_point() +
geom_smooth(formula = y ~ x, method = "lm") +
ggpubr::stat_cor(method = "kendall", cor.coef.name = "tau") +
theme_classic() +
labs(title = "Random Forest vs emba variable importance results (Activity State)",
x = "emba (Sum of absolute state differences)", y = "RF (Gini Index)")
imp_lo_rf %>% add_column(emba_score = imp_lo_emba) %>%
ggplot(aes(x = emba_score, y = gini_index)) +
geom_point() +
geom_smooth(formula = y ~ x, method = "lm") +
ggpubr::stat_cor(method = "kendall", cor.coef.name = "tau") +
theme_classic() +
labs(title = "Random Forest vs emba variable importance results (Parameterization)",
x = "emba (Sum of absolute link operator differences)", y = "RF (Gini Index)")
```
:::{.green-box}
- We use the non-parametric Kendall correlation method and it's rank-based correlation coefficient, $\tau$.
- There is **a correlation between the importance results from the two different methods**, even though the `emba` results are more *artificially-made* (like adding up absolute coefficient scores from linear models as a measure of importance) compared to the impurity metric which is a *built-in* feature of the tree-based algorithms such as random forests.
As a result for example, we observe a much greater granularity of the importance measurements in the activity state correlation figure.
- `emba` results are **more verbose** though in the sense that they allow *directionality* in their interpretation: we get which nodes must be activated or inhibited in the higher performance models or which should be parameterized with `OR-NOT` or `AND-NOT`, information which the random forests do not provide.
- **Larger correlation** is observed **for the activity state** compared to the parameterization results.
:::
### Embedding Link Operator Biomarkers in the Map {-}
:::{.blue-box}
Using as a base the **performance parameterization maps (supervised and unsupervised)** for the 1 stable state models, we color the points (models) according to the **link operator value** of some of the **performance biomarker** that we found in the previous analysis using `emba` and random forests.
:::
We give some examples of how the distribution of link-operator values looks like in the parameterization maps for the more important nodes (e.g. `ERK_f`, `MAPK14`) but also for the least important ones (e.g. `CYCS`, `CFLAR`).
See the script [perf_biomarkers_embedding.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/perf_biomarkers_embedding.R) for more details.
All the produced images by the script are accessible [here](https://github.com/druglogics/bool-param-maps/tree/master/img/nodes_lo_maps).
Using as a base the **unsupervised parameterization MAP with 14 neighbors**, we have:
```{r nodes-lo-maps-1, fig.show='hold', out.width='50%', fig.cap='MCC Unsupervised Parameterization maps colored by link-operator values of different performance biomarkers'}
knitr::include_graphics(path = "img/nodes_lo_maps/unsup_MAPK14.png")
knitr::include_graphics(path = "img/nodes_lo_maps/unsup_ERK_f.png")
knitr::include_graphics(path = "img/nodes_lo_maps/unsup_CFLAR.png")
knitr::include_graphics(path = "img/nodes_lo_maps/unsup_CYCS.png")
```
Using as a base the **supervised parameterization MAP with 14 neighbors (MCC as continuous variable)**, we have:
```{r nodes-lo-maps-2, fig.show='hold', out.width='50%', fig.cap='MCC Supervised Parameterization maps (MCC continuous) colored by link-operator values of different performance biomarkers'}
knitr::include_graphics(path = "img/nodes_lo_maps/sup_cont_MAPK14.png")
knitr::include_graphics(path = "img/nodes_lo_maps/sup_cont_ERK_f.png")
knitr::include_graphics(path = "img/nodes_lo_maps/sup_cont_CFLAR.png")
knitr::include_graphics(path = "img/nodes_lo_maps/sup_cont_CYCS.png")
```
Using as a base the **supervised parameterization MAP with 14 neighbors (MCC as discrete class variable)**, we have:
```{r nodes-lo-maps-3, fig.show='hold', out.width='50%', fig.cap='MCC Supervised Parameterization maps (MCC discrete) colored by link-operator values of different performance biomarkers'}
knitr::include_graphics(path = "img/nodes_lo_maps/sup_MAPK14.png")
knitr::include_graphics(path = "img/nodes_lo_maps/sup_ERK_f.png")
knitr::include_graphics(path = "img/nodes_lo_maps/sup_CFLAR.png")
knitr::include_graphics(path = "img/nodes_lo_maps/sup_CYCS.png")
```
:::{.green-box}
We observe that both in the unsupervised parameterization maps and the MCC-supervised ones (especially the continuous one), **the less important a node is, the more chaotically it's link operator value manifests across the map**.
:::
## Synergy Maps {-}
As stated in a [previous section](#performance-maps), the CASCADE 1.0 models produced by `abmlog` were tested for synergy against the drug combination dataset in @Flobak2015.
Using the *HSA* method to define a drug combination as synergistic or not (antagonistic), we first present some useful statistics (see script [synergy_maps.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/synergy_maps.R)):
```{r syn-stats, fig.cap='Synergy Statistics across all CASCADE 1.0 models with 1 stable state'}
knitr::include_graphics(path = 'img/synergy_stats.png')
```
:::{.orange-box}
For the 'NA' models, either one of the two single-drug perturbed models or the double-perturbed model, did not have any fixpoint attractors.
:::
We observe that **only a very few models** (compared to the total parameterization space) can predict the observed drug combinations as *synergistic*.
This can be visualized also in the next synergy maps:
```{r syn-maps-1, fig.show='hold', out.width='50%', fig.cap='AK-PZ Synergy Parameterization Maps'}
knitr::include_graphics(path = "img/synergy_maps/AK-5Z_2nn.png")
knitr::include_graphics(path = "img/synergy_maps/AK-5Z_14nn.png")
```
```{r syn-maps-2, fig.show='hold', out.width='50%', fig.cap='PD-AK Synergy Parameterization Maps'}
knitr::include_graphics(path = "img/synergy_maps/PD-AK_2nn.png")
knitr::include_graphics(path = "img/synergy_maps/PD-AK_14nn.png")
```
```{r syn-maps-3, fig.show='hold', out.width='50%', fig.cap='PI-PD Synergy Parameterization Maps'}
knitr::include_graphics(path = "img/synergy_maps/PI-PD_2nn.png")
knitr::include_graphics(path = "img/synergy_maps/PI-PD_14nn.png")
```
```{r syn-maps-4, fig.show='hold', out.width='50%', fig.cap='PI-PZ Synergy Parameterization Maps'}
knitr::include_graphics(path = "img/synergy_maps/PI-5Z_2nn.png")
knitr::include_graphics(path = "img/synergy_maps/PI-5Z_14nn.png")
```
:::{.green-box}
As we have seen, the UMAP results with higher number of neighbors cluster the similarly parameterized models better.
This makes it easier to spot the **small clusters** which include models that can predict each of the respective observed synergies as *synergistic*.
We note that all of these **synergistic sub-clusters** are part of the **high-performance supercluster** (right one in the images above).
:::
Interestingly, if we see consider **all possible observed synergy subsets** we see that there are models that can predict the double synergy set `PI-PD,PD-AK` and the `PI-5Z,AK-5Z`, but there is literally **no model that can predict three or four (all) of the observed synergies**:
```{r syn-subset-stats, results='asis'}
synergy_subset_stats = readRDS(file = "data/synergy_subset_stats.rds")
usefun::pretty_print_vector_names_and_values(synergy_subset_stats)
```
This is another proof that this parameterization is *restrictive* since there is not even one possible parameterization which represents biological reality (i.e. treatment with the previously mentioned four drug combinations is synergistic).
Also, looking at the numbers of models for each set of synergies, we see that the models predicting the `PD-AK` synergy form a proper subset of the `PI-PD,PD-AK` models and the same for `PI-5Z` and `PI-5Z,AK-5Z`.
Thus we include only the parent sets in the next synergy maps, where we have combined all synergies in one:
```{r syn-maps-5, fig.show='hold', out.width='50%', fig.cap='Combined Synergy Parameterization Maps'}
knitr::include_graphics(path = "img/synergy_maps/all_syn_11nn.png")
knitr::include_graphics(path = "img/synergy_maps/all_syn_14nn.png")
```
:::{.green-box}
We observe that there are **two synergy sub-clusters**, one primarily related to the `PD` drug and the other to the `5Z` drug.
Since the synergies belong to $2$ mutually exclusive (in terms of parameterization) clusters, this is a visual cue that there cannot be a model that predicts all 4 of these synergies.
:::
## Stable State Patterns in Synergistic models {-}
:::{.blue-box}
We want to identify if there are **activity patterns** in the models that predict the observed synergies.
To do this, we take a random sample of such synergistic models that predict each of the $4$ observed synergies and show the corresponding heatmaps of stable state activities.
More details on the [synergy_heatmaps.R](https://github.com/druglogics/bool-param-maps/blob/master/scripts/synergy_heatmaps.R) script.
:::
```{r synergy-heatmaps, fig.cap='PD-AK stable state activity heatmap'}
knitr::include_graphics(path = "img/synergy_heatmaps/PD-AK_ss_heat.png")
```
```{r synergy-heatmaps-2, fig.cap='PI-PD stable state activity heatmap'}
knitr::include_graphics(path = "img/synergy_heatmaps/PI-PD_ss_heat.png")
```
:::{.green-box}
- All the models that predict the `PD-AK` synergy show one **distinct activity state pattern**
- The `PI-PD` heatmap is more *heterogeneous* (though still patterns do exist).
The reason for this might be because the parameterization is more spread out across the map in the combined synergy [figure above](#fig:syn-maps-5) - i.e. some models are in the low-performance supercluster.
:::
```{r synergy-heatmaps-3, fig.cap='AK-5Z stable state activity heatmaps'}
knitr::include_graphics(path = "img/synergy_heatmaps/AK-5Z_ss_heat.png")
```
```{r synergy-heatmaps-4, fig.cap='PI-5Z stable state activity heatmaps'}
knitr::include_graphics(path = "img/synergy_heatmaps/PI-5Z_ss_heat.png")
```
:::{.green-box}
All the synergistic models in the `5Z` sub-cluster seem to follow the **same stable state activity patterns** (pretty much we get the same node names in the columns after the clustering in the heatmaps).
There exist $2$ **main such patterns**, specified by the state vector (ON,{ON/OFF},{OFF/ON},ON) - where ON and OFF denote vectors of *active*/*inhibited* nodes that are clustered together.
:::
Note that in all $4$ heatmaps above, the `Prosurvival` node is always *active*, denoting proliferating activity patterns in the boolean models where synergies can manifest.