-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpaper_jasa blinded.tex
1420 lines (1076 loc) · 132 KB
/
paper_jasa blinded.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%\documentclass{article}
\documentclass[12pt]{article}
\usepackage{times}
%\usepackage{natbib}
%\usepackage{multicol}
\RequirePackage{natbib}
\usepackage{amsmath, amssymb, fullpage, amsthm, array, algorithm2e,graphicx,asa}
%\usepackage[dvips]{graphics}
%\usepackage{hyperref} % for hyper reference
\graphicspath{{images/}}
% \usepackage{pifont} % this package is used to print check mark \checkmark
% \linespread{1.6} % factor 1.6 = double space
\usepackage{setspace, url}
\doublespacing
\setlength{\oddsidemargin}{0in}
\setlength{\evensidemargin}{0in}
\setlength{\textwidth}{6.5in}
\setlength{\topmargin}{-0.4in}
\setlength{\textheight}{9in}
\evensidemargin
\oddsidemargin
\newtheorem{thm}{Theorem}[section]
\newtheorem{dfn}{Definition}[section]
\newtheorem{cor}{Corollary}[thm]
\newtheorem{con}{Conjecture}[thm]
\newtheorem{lemma}[thm]{Lemma}
%\topmargin -0.10in % when making pdf
%\textheight 9.15in % when making pdf
\pdfminorversion=4 % as instructed by JASA file upload
\begin{document}
% Article top matter
\title{Validation of Visual Statistical Inference, Applied to Linear Models}
%\author{{Mahbubul Majumder, Heike Hofmann, Dianne Cook}
%\thanks{Mahbubul Majumder is a PhD student (e-mail: mahbub72@gmail.com) , Heike Hofmann is an Associate Professor and Dianne Cook is a Professor in the Department of Statistics and Statistical Laboratory, Iowa State University, Ames, IA 50011-1210. This research is supported in part by the National Science Foundation Grant \# DMS 1007697.}}
\date{\vspace{-.7in}}
%\date{\today} %\today is replaced with the current date
\maketitle
\begin {abstract}
Statistical graphics play a crucial role in exploratory data analysis, model checking and diagnosis. A new method, called the lineup protocol, enables determining the statistical significance of visual findings, which bridges a big gulf between exploratory and inferential statistics. In this paper inferential methods for statistical graphics is developed further by refining the terminology of visual inference, and framing the lineup protocol in a context that allows direct comparison with conventional tests, in scenarios when a conventional test exists. This framework is used to compare the performance of the lineup protocol against conventional statistical testing in the scenario of fitting linear models. A human subjects experiment is conducted using simulated data to provide controlled conditions. Results suggest that the lineup protocol provides comparable results to the conventional tests, and expectedly out-performs them when data is contaminated, a scenario where assumptions required for performing a conventional are violated, and, there may be some super-individuals who yield better power than the conventional test even in the most difficult of tasks.
{\bf Keywords: \sf statistical graphics, lineup, non-parametric test, data mining, visualization, exploratory data analysis}
\end {abstract}
%\begin{multicols}{2}
%\twocolumn
\section{Introduction}
%\checkmark
Statistical graphics nourish the discovery process in data analysis by revealing unexpected things, finding structure that was not previously anticipated, or orthogonally by contrasting prevailing hypotheses. The area of graphics is often associated with exploratory data analysis, which was pioneered by \cite{tukey:eda} and, is particularly pertinent in today's data-rich world where discovery during data mining has become an important activity. Graphics are also used in many places where numerical summaries simply do not suffice: model checking, diagnosis, and in the communication of findings.
Statistical graphics research and development is agreeably established now, and has been cemented by several new developments in recent years. Early studies on evaluating how well statistical plots are perceived and read by the human eye \citep{cleveland:1984}, have been repeated and expanded \citep{simkin:1987,spence:1991, heer:2010} with findings supporting the original results. This body of work provides a framework for evaluating new statistical graphics. In a complementary direction, new research on formalizing statistical graphics with language characteristics makes it easier to abstractly define, compare and contrast data plots. \cite{wilkinson:1999} developed a grammar of graphics that is enhanced by \cite{hadley:2009}. These methods provide a mechanism to abstract the way data is mapped to graphical form. Finally, technology advances make it simple and easy for everyone to draw plots of data, and particularly the existence of software systems, such as R \citep{R}, enable making beautiful data graphics that can be tightly coupled with statistical modeling.
However, measuring the strength of patterns seen in plots, and differences in individual perceptual ability, is something that is difficult and perhaps handicaps graphics use among statisticians, where measuring probabilities is of primary importance. This has also been addressed in recent research. \citet{buja:2009} proposes a protocol that allows the testing of discoveries made from statistical graphics. This work represents a major advance for graphics, because it bridges the gulf between conventional statistical inference procedures and exploratory data analysis. One of the protocols, the lineup, places the actual data plot among a page of plots of null data, and asks a human judge to pick the plot that is different. Figure \ref{fig:test_category} shows an example lineup. Which plot do you think is the most different from the others? (The position of the actual data plot is provided in Section \ref{sec:category}.) Wrapped in a process that mirrors conventional inference, where there is an explicit, a priori, null hypothesis, picking the plot of the data from the null plots represents a rejection of that null hypothesis. The null hypothesis typically derives from the task at hand, or the type of plot being made. The alternative encompasses all possible antitheses, all types of patterns that might be detected in the actual data plot, accounting for all possible deviations from the null without the requirement to specify these ahead of time. The probability of rejection can be quantified, along with Type I, and Type II error, and $p$-value and power can be defined and estimated.
\begin{figure}[htp]
%\begin{figurehere}
\centering
\includegraphics[width=0.95\textwidth]{lineup_category.pdf}
\caption{Lineup plot ($m=20$) using side-by-side boxplots for testing $H_0: \beta_k=0$. One of these plots is the plot of the actual data, and the remaining are null plots, produced by simulating data from a null model that assumes $H_0$ is true. Which plot is the most different from the others, in the sense that there is the largest shift or location difference between the boxplots? (The position of the actual data plot is provided in Section \ref{sec:category}.)}
\label{fig:test_category}
\end{figure}
%\end{figurehere}
The protocol has only been informally tested until now. In the work described in this paper, the lineup protocol is compared head to head with the equivalent conventional test. Specifically, the lineup is examined in the context of a linear model setting, where we are determining the importance of including a variable in the model. This is not the envisioned environment for the use of the lineup -- actually it is likely the worst case scenario for visual inference. The intended use of lineups is where there is no existing test, and unlikely to be ever any numerical test. The thought is though, that the conventional setting provides a benchmark for how well the lineup protocol works under controlled conditions, and will provide some assurance that they will work in scenarios where there is no benchmark. Testing is done based on a human-subjects experiment using Amazon's Mechanical Turk \citep{turk}, using simulation to provide controlled conditions for assessing lineups. The results are compared with those of the conventional test.
%, analysis requires plots of data. For exploratory data analysis, statistical graphics play an invaluable role in model checking and diagnostics. Even though we have established mathematical procedures to obtain various statistics, we need to support the results by also producing the relevant plots.
%The scientific foundation of graphical methods for data analysis is well established by \cite{cleveland:1984}. In recent years we have seen several major advances in statistical graphics. Modern computing systems like R and SAS ease the production of high quality statistical plots. A grammar of graphics introduced by \cite{wilkinson:1999} presents a structured way to generate specific graphics from data and define connections between disparate types of plots. \cite{hadley:2009} has implemented a revised version of the grammar of graphics in R, in the package {\tt ggplot2}. \blue {Graham Wills has implemented the grammar in SPSS, but I'm not sure how to cite this.}
%\citet{buja:2009}, following from \cite{gelman:2004}, proposed two protocols that allow the testing of discoveries made from statistical graphics. This work represents a major advance for graphics, because it bridges the gulf between conventional statistical inference procedures and exploratory data analysis.
%In this paper we compare the lineup protocol to conventional statistical inference, specifically applied to the regression framework. We also present results of a human-subject study assessing the performance of individuals on lineup plots \citep{buja:2009} for testing significance of regression parameters.
The paper is organized as follows. Section \ref{sec:visual_test} defines terms as used in visual inference, and describes how to estimate the important quantities from experimental data. The effect of the lineup size and number of observers on the power of the test is discussed in Section \ref{sec:size}. Section \ref{sec:regression} focuses on the application of visual inference to linear models. Section \ref{sec:simulation} describes three user studies based on simulation experiments conducted to compare the power of the lineup protocol with the equivalent conventional test and Section \ref{sec:results} presents an analysis of the resulting data. %The last section describes the next steps in this research.
\section{Definitions and Explanations for Visual Statistical Inference} \label{sec:visual_test}
Table \ref{tbl:compare} illustrates the lineup protocol in relation to conventional hypothesis testing. Both start from the same place, the same set of hypotheses. The conventional test statistic is the $t$-statistic, where the parameter estimate is divided by its standard error. In the lineup protocol, the test statistic is a plot of the data. Here, side-by-side boxplots are used, because the variable of interest is categorical and takes just two values. In conventional hypothesis testing the value of the test statistic is compared with all possible values of the sampling distribution, the distribution of the statistic if the null hypothesis is true. If it is extreme on this scale then the null hypothesis is rejected. In contrast in visual inference, the plot of the data is compared with plots of a finite number of samples drawn from the null distribution. If the actual data plot is selected as the most different, then this results in rejection of the null hypothesis.
\begin{table*}[hbtp]
\caption{Comparison of visual inference with conventional inference.}
\centering
\begin{tabular}{llll}
\hline \hline
& Conventional Inference & Lineup Protocol \\ %[0.5ex] % inserts table %heading
\hline
Hypothesis & $H_0: \beta=0$ vs $H_1: \beta > 0$& $H_0: \beta=0$ vs $H_1: \beta > 0$\\
& \begin{minipage}[h]{2.5cm} \begin{center} \scalebox{0.2}{\includegraphics{down_arrow.pdf}} \end{center} \end{minipage} & \begin{minipage}[h]{2.5cm} \begin{center} \scalebox{0.2}{\includegraphics{down_arrow.pdf}} \end{center} \end{minipage} \\
Test statistic & $T(y)=\frac{\hat{\beta}}{se(\hat{\beta})}$ & $T(y)=$ \begin{minipage}[h]{1cm} \begin{center} \scalebox{0.45}{\includegraphics{stat_category.pdf}} \end{center} \end{minipage} \\
& \begin{minipage}[h]{2.5cm} \begin{center} \scalebox{0.2}{\includegraphics{down_arrow.pdf}} \end{center} \end{minipage} & \begin{minipage}[h]{2.5cm} \begin{center} \scalebox{0.2}{\includegraphics{down_arrow.pdf}} \end{center} \end{minipage} \\
Sampling Distribution & $f_{T(y)}(t); $\begin{minipage}[h]{2.5cm} \begin{center} \scalebox{0.55}{\includegraphics{stat_mathematical_test.pdf}} \end{center} \end{minipage} & $f_{T(y)}(t); $ \begin{minipage}[h]{2.5cm} \begin{center} \scalebox{0.32}{\includegraphics{lineup_category_small.pdf}} \end{center} \end{minipage} \\
& \begin{minipage}[h]{2.5cm} \begin{center} \scalebox{0.2}{\includegraphics{down_arrow.pdf}} \end{center} \end{minipage} & \begin{minipage}[h]{2.5cm} \begin{center} \scalebox{0.2}{\includegraphics{down_arrow.pdf}} \end{center} \end{minipage} \\
Reject $H_0$ if & actual $T$ is extreme & actual plot is identifiable \\
\hline
\end{tabular}
\label{tbl:compare}
\end{table*}
%Visual tests are non-parametric tests of the form $H_0$: the data plot is visually not distinguishable from a null plot vs $H_1$: the data plot is visually distinguishable from a plot generated consistently with $H_0$.
%Power of a plot and $p$-value have been derived in Buja and Wickham as ...
%for this paper, we want to place the parametric inference of a regression setting into the framework of visual testing:
%This section outlines the concepts of visual inference in comparison to the procedures of conventional statistical inference. Table \ref{tbl:compare} (derived from \citet{buja:2009}) gives a summarized overview of this comparison.
In general, we define $\theta$ to be a population parameter of interest, with $\theta \in \Theta$, the parameter space. Any null hypothesis $H_0$ then partitions the parameter space into $\Theta_0$ and $\Theta_0^c$, with $H_0: \theta \in \Theta_0$ versus $H_1: \theta \in \Theta_0^c$. A test statistic, $T(y)$, is a function that maps the sample into a numerical summary, that can be used to test the null hypothesis. The hypothesis test maps the test statistic into \{0, 1\}, based on whether $T(y)$ falls into the acceptance region, or the rejection region, respectively. $T(y)$ is assessed relative to null values of this statistic $T(y_0)$ which would be possible values of $T$ if $\theta \in \Theta$. %Similarly, for any test statistic, we have a rejection region $R$ and, complementary to that, $R^c$, that partitions the space of the test statistic, which is usually a subspace of the real numbers.
For visual inference, unlike in the conventional hypothesis test, the statistic is not a single value, but a graphical representation of the data chosen to display the strength of the parameter of interest, $\theta$. When the alternative hypothesis is true, it is expected that the plot of the actual data, the test statistic, will have visible feature(s) consistent with $\theta \in \Theta_0^c$, and that visual artifacts will not distinguish the test statistic as different when $H_1$ is not true. We will call a plot with this property a {\it visual statistic} for $\theta$. More formally,
\begin{dfn} \label{dfn:test}
A \textbf{visual test statistic}, $T(.)$, is a function of a sample that produces a plot. $T(y)$ maps the actual data to the plot, and we call this the \textbf{(actual) data plot}, and $T(y_0)$ maps a sample drawn from the null distribution into the same plot form. These type of plots are called \textbf{null plots}.
\end{dfn}
\noindent Ideally, the visual test statistic is defined and constructed using the grammar of graphics \citep{wilkinson:1999,hadley:2009}, consisting of type, and specification of aesthetics, necessary for complete reproducibility. \\
\noindent
The visual test statistic is compared with values $T(y_0)$ using a lineup, which is defined as:
\begin{dfn}\label{dfn:lplot}
A \textbf{lineup} is a layout of $m$ randomly placed visual statistics, consisting of
\begin{itemize}\itemsep-3pt
\item $m-1$ statistics, $T(y_0)$, simulated from the model specified by $H_0$ (null plots) and
\item the test statistic, $T(y)$, produced by plotting the actual data, possibly arising from $H_1$.
\end{itemize}
\end{dfn}
\noindent The $(m-1)$ null plots are % can be
members of the sampling distribution of the test statistic assuming that the null hypothesis is true. If $H_1$ is true, we expect this to be reflected as a feature in the test statistic, i.e. the plot of the data, that makes it visually distinguishable from the null plots. A careful visual inspection of the lineup by %\strikethrough{an}
independent observers follows; observers are asked to point out the plot most different from the lineup. If the test statistic is identified in the lineup, this is considered as evidence against the null hypothesis.
%If it is not identified the conclusion is to not reject the null hypothesis.
This leads us to a definition for the $p$-value of a lineup: under the null hypothesis, each observer has a $1/m$ chance of picking the test statistic from the lineup. For $K$ independent observers, let $X$ be the number of observers picking the test statistic from the lineup. Under the null hypothesis $X \sim \text{Binom}_{K, 1/m}$, therefore:
\begin{dfn}\label{dfn:pvalue}
The $p$-value of a lineup of size $m$ evaluated by $K$ observers is given as
\begin{equation}\label{binom}
P(X \ge x) = 1 - \text{Binom}_{{K, 1/m}} (x-1) = \sum_{i=x}^{K} {K \choose i} \left(\frac{1}{m}\right)^{i} \left(\frac{m-1}{m}\right)^{K-i}
\end{equation}
\noindent with $X$ defined as above, and $x$ is the number of observers selecting the actual data plot.
\end{dfn}
\noindent Note that for $x=0$ the $p$-value becomes, mathematically, equal to 1. It might make more sense from a practical point of view to think of the $p$-value as
being larger than $P(X \ge 1)$ in this situation. If we increase either $m$ or $K$, we would be able to determine the value at a higher precision.
Table \ref{pvalue} shows $p$-values for different numbers of observers for lineups of size $m = 20$. It can be seen that, if the null hypothesis is true, it is unlikely that multiple observers would pick the actual data plot.
\begin{table}[htp]
\caption{Possible $p$-values for different numbers of observers, $K$, for fixed size $m = 20$ lineups.}
\begin{center}
\scalebox{.9}{
\begin{tabular}{|rrr|rrr|rrr|rrr|rrr|}\hline
$K$ & $x$ & $p$-value & $K$ & $x$ & $p$-value & $K$ & $x$ & $p$-value & $K$ & $x$ & $p$-value & $K$ & $x$ & $p$-value\\ \hline %\cline{1-3}\cline{4-6}\cline{7-9}\cline{10-12}\cline{17-19}
1 & 1 & 0.0500 & 2 & 1 & 0.0975 & 3 & 1 & 0.1426 & 4 & 1 & 0.1855 & 5 & 1 & 0.2262 \\%\cline{1-3}
&&& 2 & 2 & 0.0025 & 3 & 2 & 0.0073 & 4 & 2 & 0.0140 & 5 & 2 & 0.0226 \\
&&& &&& 3 & 3 & 0.0001 & 4 & 3 & 0.0005 & 5 & 3 & 0.0012 \\%\cline{1-6}
&&& & & & & && 4 & 4 & $< 0.0001$ & 5 & 4 & $< 0.0001$ \\\hline
\end{tabular}
}
\end{center}
\label{pvalue}
\end{table}
%\begin{dfn}\label{dfn:test}
%\red{why do we need the $\theta$? - it's not defined and not explained. usually we would assume that lineups are non-parametric}
%The \textbf{visual test}, $V_{\theta}$, is defined to be
%\begin{itemize}\itemsep-3pt
%\item \textbf{Reject} $H_0$ if `enough' observers select the actual data plot in the lineup of $m$ plots, and
%\item \textbf{Fail to reject} $H_0$ otherwise. %if the observer selects a null plot.
%\end{itemize}
%where `enough' observers is defined as some threshold $x_{1-\alpha}$, for which $P(X \ge x_{1-\alpha}) < \alpha$, the level of significance.
%\end{dfn}
%\blue{To avoid confusion with one tailed $(1-\alpha)$ vs two tailed $(1-\alpha/2)$ notation, can we just use $x_{\alpha}$? }
%\red{alternative definition of visual test where $type-I$ error $\alpha$ is attached.}
\begin{dfn}\label{dfn:test}
The \textbf{visual test}, $V_{\theta}$ of size $m$ and significance level $\alpha$, is defined to
\begin{itemize}\itemsep-3pt
\item \textbf{Reject} $H_0$ if out of $K$ observers at least $x_{\alpha}$ correctly identify the actual data plot in the lineup, and
\item \textbf{Fail to reject} $H_0$ otherwise.
\end{itemize}
where $x_{\alpha}$ is such that $P(X \ge x_{\alpha}) \le \alpha$.
\end{dfn}
\noindent
Associated with any test is the risk of Type I or II errors, which for visual inference are defined as follows:
\begin{dfn}\label{dfn:error}
The \textbf{Type I error} associated with visual test $V_{\theta}$ is the probability of rejecting $H_0$ when it is true; the probability for that is $P(X \ge x_{\alpha})$, which is controlled by $\alpha$.\\ \noindent
The \textbf{Type II error} is the probability of failing to identify the actual data plot, when $H_0$ is not true, $P( X < x_{\alpha})$.
\end{dfn}
%\green{Oh, we are using observed and observer, which might be confusing. May need to change observed data plot to something else. Will think about this. Will use ``actual'' for now.} \hh{I think that change is done}
\noindent Because $X$ takes only discrete values we can not always control exactly for $\alpha$. %, with an exact $x_{\alpha}$.
For example, when there is only one observer, $1/m$ is the minimal value at which we can set $\alpha$. It can be set to be smaller, even arbitrarily small, by increasing $K$, the number of observers.
Type II error is harder to calculate, as is usually the case. In visual inference, individual abilities need to be incorporated to calculate Type II error. %This is not necessary for Type I error, because it is simply the chance that the observer picks the actual data plot, by chance.
Here, we need to estimate the probability that an observer sees the actual data plot as different, when it really is different. This involves understanding the individual's visual skills. Thus, let $X_i$ be a binary random variable with $X_i = 1$, if individual $i \ (=1, \dots , K)$ identifies the actual data plot from the lineup, and $X_i = 0$ otherwise. Let $p_i$ be the probability that individual $i$ picks out the actual data plot from the lineup. If all individuals have the same ability, with the probability, $p$, for picking out the actual data plot, then $X = \sum_i X_i$ has distribution $\text{Binom}_{K, p}$, and we can estimate $p$ by $\hat{p} = x/K$, where $x$ is the number of observers (out of $K$), who pick out the actual data plot from the lineup.
If there is evidence for individual skills influencing the probability $p_i$, then $X_i \sim Binom_{1, p_i}$ and $X$ is a sum of independent Bernoulli random variables with different success rates $p_i$. This makes the distribution of $X$ a Poisson-Binomial by definition (see \citet{butler93} for details). Ways to estimate $p_i$ will be discussed in the following sections.
%one of the main concerns for the remainder of the paper. One way we can estimate this is fitting a generalized mixed effect model as described in Section \ref{sec:model}.
%\green{WHY??? And how does this relate to the logistic model below?}
%This leads to the definition of the power of the visual test, as follows:
%The lineup plot can be evaluated by one or more individuals. When a single individual identifies the observed graph in the lineup plot we report a $p$-value of at most $1/m$, otherwise the $p$-value is at least $1-\frac 1m$.
%Notice that when $N=1$, this $p$-value is $\frac1m$.
%\subsection{Power}
%Next, we will develop power theoretically and then relate it to the empirical results.
%We will first theoretically define and derive the power of a visual test in the setting of a parametric hypothesis test, i.e. we assume a test of $H_0: \theta = 0$ vs $H_1: \theta \neq 0$ for some real valued parameter vector $\theta$. In a next step we will then relate the theoretical results to our empirical findings.
%\red{How are $\Theta_0$ and $\Theta^c_0$ defined? - why do we need them? wouldn't it be better to re-formulate power in terms of acceptance region, not in terms of the parameter space? - $1/m$ also doesn't look right for the power - I think that's just looking at one observer.}
%\begin{dfn} \label{dfn:power}
%The \textbf{power} of a visual test $V_{\theta}$ is defined as the probability to reject the null hypothesis for a given parameter value $\theta$:
% \begin{equation*}
% \text{Power}_V(\theta)=
% \begin{cases}
% \frac1m & \text{if $\theta \in \Theta_0$,} \\
% Pr(\text{Reject } H_0) &\text{if $\theta \in \Theta^c_0$.}
% \end{cases}
% \end{equation*}
%\end{dfn}
%
%
%\red{Alternative Power definition:}
\begin{dfn} \label{dfn:power}
The \textbf{power} of a visual test, $V_{\theta}$, is defined as the probability to reject the null hypothesis for a given parameter value $\theta$:
\begin{equation*}
\text{Power}_V(\theta)= Pr(\text{Reject } H_0 \mid \theta)
% = P(X \ge x_{\alpha}) =
% \begin{cases}
% 1 - \text{Binom}_{K, 1/m}(x_{\alpha} - 1) & \text{ if } H_0 \text{ is true}.\\
% 1 - F_{X, \theta} (x_{\alpha} - 1) & \text{ if } H_1 \text{ is true}.
% \end{cases}
\end{equation*}
\end{dfn}
\noindent An important difference between conventional and visual testing is that lineups will depend on observers evaluation. Thus, the number of observers affects the estimation of power, where $X$ is the number of observers (out of $K$) who identify the actual data plot from the lineup, and power is estimated by
\[
\widehat{Power}_{V} (\theta) = {Power}_{V, K} (\theta) = 1 - F_{X, \theta} (x_{\alpha} - 1),
\]
Here $F_{X, \theta}$ is the distribution of $X$ and $x_\alpha$ is such that $P(X \ge x_{\alpha}) \le \alpha$. Note that the distribution $F_X$ depends on which hypothesis is true:
under the null hypothesis, $X \sim Binom_{K, 1/m}$, leading to:
\[
Power_V(\theta, K)= 1 - Binom_{K, 1/m} (x_\alpha - 1),
\]
%where $x_\alpha$ is such that $P(X \ge x_{\alpha}) \le \alpha$.
If the alternative hypothesis is true, with a fixed parameter value $\theta$, we can assume that an individual's probability to identify the data plot depends on the parameter value, and $X_i \sim Binom_{1, p_i(\theta)}$. Assessing an individual's skill to identify the actual data plot will require that an individual evaluates multiple lineups.
%\noindent Note that $F_{X, \theta}$ is the distribution of $X$ under the alternative with given parameter value $\theta$.
%\red{there may be a flow in this definition. Due to the discrete nature of binomial (specially $x_{\alpha}$) this definition would only give a conservative lower limit of the actual power. I feel like we should not define power using binomial.}
Power is an important consideration in deciding which test to use for solving a problem. %In conventional testing, new test procedures are examined for their power under a variety of scenarios.
Here we use it to compare the performance of the visual test with the conventional test, but in practice for visual inference it will mostly be important in deciding on the choice of plot to use. Analysts typically have a choice of plots to make, and a myriad of possible options such as reference grids, for any particular purpose. This is akin to different choices of statistics in conventional hypothesis testing, for example, mean, median, or trimmed mean. One is typically better than another. For two different visual test statistics of the same actual data, one is considered to be better, if $T(y)$ is more easily distinguishable to the observer. Power is typically used to measure this characteristic of a test. %Power is the probability of rejecting $H_0$ when it is not true. We can assess power therefore both empirically based on experimental data and through theoretical considerations.
\section{Effect of Observer Skills and Lineup Size}~\label{sec:size}
%\green{Notation changes $i$ to $j$, and $p$ to $\pi$. Can someone please decide which to use, and make the appropriate changes? How does the ``Let $X_{j\ell} \sim B_{1, \pi_{j\ell}}$...'' below relate to the ``let $X_i$ be a binary random variable...'' above? It looks like this is a repeat. One needs to go once notation is clarified.}
%\red{Here is what my suggestion for notation \\ Use $B$ for logistic model and $\beta$ for linear model \\ Use $X$ to indicate logistic model covariate matrix and $U$ to denote binomial variable \\ $Y$ to indicate actual data. \\ Use $\pi_i$ instead of $p_i$ since $p_i$ is used for $p$-value. \\ Should I change everything accordingly? I see that we have been careful not to use $\beta$ for power! }
\subsection{Subject-specific abilities}~\label{sec:model}
%Calculating the power is intricately related to the probability to identify the actual data plot from the lineup.
Suppose each of $K$ independent observers gives evaluations on multiple lineups, and responses are considered to be binary random variable, $X_{\ell i} \sim Binom_{1, p_{\ell i}}$, where $X_{\ell i} = 1$, if subject $i$ correctly identifies the actual data plot on lineup $\ell$, $1 \le \ell \le L$, and 0 otherwise. A mixed effects logistic regression model is used for $P(X_{\ell i} = 1) = p_{\ell i} = E(X_{\ell i})$, accommodating both for different abilities of observers as well as differences in the difficulty of lineups.
%Let $X_{\ell i} \sim Binom_{1, p_{\ell i}}$ be the random variable, with $X_{ \ell i}= 1$, if subject $i$ correctly identifies the actual data plot from lineup $\ell$, and 0 otherwise.
The model can be fit as:% $p_{ \ell i}$ with a logistic regression with subject-specific random effects as:
\begin{equation} \label{eqn:mixed}
g( p_{\ell i} )= W_{\ell i} \delta + Z_{\ell i} \tau_{\ell i},
\end{equation}
\noindent where $g(.)$ denotes the {\it logit} link function $g(\pi)=\log(\pi) - \log(1-\pi); 0 \le \pi \le 1$.
$W$ is a design matrix of covariates corresponding to specifics of lineup $\ell$ and subject $i$, and $\delta$ is the vector of corresponding parameters. Covariates could include demographic information of individuals, such as age, gender, education level etc., as well lineup-specific elements, e.g. effect size or difficulty level.
%
$Z_{\ell i}$, $1 \le i \le K$, $1 \le \ell \le L$, is a design matrix corresponding to random effects specific to individual $i$ and lineup $\ell$; and
$\tau$ is a vector of independent normally distributed random variables $\tau_{\ell i}$ with variance matrix $\sigma_\tau I_{KL \times KL}$. $\tau$ will usually include a component incorporating an individual's ability or skill to evaluate lineups.
Note that $\tau_{\ell i }$ usually only includes a partial interaction; for a full interaction of subjects' skills and lineup-specific difficulty we would need replicates of the same subject evaluating the same lineup, which in practice is not feasible without loosing independence.
\noindent Using the inverse {\it logit} link function, $g^{-1}(.)$, from Equation \ref{eqn:mixed} leads to the estimate of the subject and the lineup specific probability of successful evaluation as
\begin{equation} \label{eqn:mixed_power}
\hat p_{\ell i} = g^{-1}(W_{\ell i} \hat {\delta} + Z_{\ell i} \hat {\tau}_{\ell i}).
\end{equation}
%\red{This model is described in more detail in equation \eqref{eqn:mixed}.}
%\noindent For any single observer, the power of a visual test has a lower limit of $\frac1m$ since the probability that an observer will randomly pick the test statistic under $H_0$ is $\frac1m$. When $K$ individuals evaluate a lineup plot independently, the number of successful evaluations, $U \sim \text{Binom} (K,1/m)$ under $H_0$. This leads us to obtain the probability
%\begin{equation}\label{eqn:pvalue}
%Pr(U \ge u)= \sum_{k \ge u}^K {{K \choose k} \left(\frac{1}{m}\right)^k\left(1-\frac 1m\right)^{(K-k)}}
%\end{equation}
%where $u$ is the observed number of successful evaluations. We use this probability to define the $p$-value for visual test $V_{\theta}$ as follows:
%\begin{dfn} \label{dfn:pval}
%The $p$-value associated with the decision of a visual test $V_{\theta}$ is given \blue{as the probability to observe a value at least as extreme as the test statistic $u$,
%where $u$ is the number of successful evaluations of the lineup, which leads us to a $p$-value of $P(U \ge u)$ as given in eqn (\ref{eqn:pvalue}). Since $u$ is a discrete value with boundaries 0 and $K$,
% we can also re-formulate the $p$-value in terms of upper and lower boundaries:
% \begin{equation*}
% \begin{array}{rcll}
% P(U \ge u+1) \le &p\text{-value}& (\le 1), & \text{if }u < K, \text{and in particular for } u=0\\
%( 0 \le) &p\text{-value}& \le P(U \ge u-1), & \text{if }u > 1, \text{and in particular for } u=K.\\
% \end{array}
% \end{equation*}
% }
%\red{HH: Tried to re-write it to answer the question below.
%Why is the $p$ value not just $Pr(U \ge u)$ ? (We need to modify this a bit with respect to the rejection region $R$, but there should not be a difference in terms of whether we reject or accept. }
%
% \begin{equation*}
% \begin{cases}
% p\text{-value} \le Pr(U \ge u) & \text{when reject $H_0$,} \\
% p\text{-value} > 1-Pr(U \ge u) & \text{when fail to reject $H_0$}
% \end{cases}
% \end{equation*}
%\end{dfn}
%\red{HH: the following paragraph needs some adjustment in terms of t and $\theta$ - could you try to take of that, Mahbub?}\\
%\red{For the following paragraph, we wanted to get rid of the absolute value of $t$ and instead look at $t \in R$ and $t \in R^c$.}
%\noindent
\subsection{Lineup size, $m$}
The finite number $m-1$ of representatives of the null distribution used as comparison against the test statistic, is a major difference between visual inference and the conventional testing. The choice of $m$ has an obvious impact on the test.
The following properties can only be derived for the situation of a fully parameterized simulation study, as conducted in this paper. They allow for a direct comparison of lineup tests against the conventional counterparts, and also to identify properties relevant for a quality assessment of lineups when they are used in practical settings. Two assumptions are critical:
\begin{enumerate} \itemsep 0in
\item the plot setup is structured in a way that makes it possible for an observer to identify a deviation from the null hypothesis,
\item an observer is able to identify the plot with the strongest `signal' (or deviation from $H_0$) from a lineup.
\end{enumerate}
Evidence in support of the second assumption will be seen in the data from the study discussed in Section~\ref{sec:simulation}, the degree to which the first assumption is fulfilled is reflected by the power of a lineup. The better suited a design is for a particular task, the higher its power will be.
In order to compare the power of conventional and visual tests side-by-side, it is necessary to assume, that we are in the controlled environment of a simulation with tests corresponding to a known parameter value $\theta \in R$ and associated distribution function $F_t$ of the test statistic.
\begin{lemma}~\label{lemma}
Suppose $F_{|t|}(.)$ is the distribution function of an absolute value of $t$, the conventional test statistic. Suppose the associated test statistic in a conventional test is observed as $t_{obs}$ with $p$-value $p_D$.
The probability of picking the data plot from a lineup depends on the size $m$ of the lineup and the strength of the signal in the data plot.
Under the above assumptions, the probability is expressed as:
\[
P(p_D < p_0) = E\left[ (1 - p_D)^{m-1}\right]
\]
where $p_D$ is the $p$-value associated with the data in the test statistic, and $p_0$ is the minimum of all $p$-values in the data going into null plots.
\end{lemma}
\begin{proof}
%We will now make use of properties of the data going into the lineup and assume that the properties are reflected in the lineup display:
By definition
$$p_D=Pr\left(|t| \ge t_{obs} \mid H_0\right)=1-F_{|t|}(t_{obs}) \ \ \Rightarrow \ \ |t_{obs}|=F_{|t|}^{-1}(1-p_D)$$
%Now suppose $F_{|t|}(.;\delta)$ denotes the distribution function of an absolute value of $t$, the conventional test statistic.
%, with non-centrality parameter $\delta$.
\noindent Then the distribution function of the $p$-value, $p_D$, under $H_0$, is uniform, since:
\begin{eqnarray}\label{dist_p}
F_{p_D}(p) &=& Pr(p_D \le p)=1-Pr(1-p_D \le 1-p) \nonumber \\
&=& 1-Pr\left(F_{|t|}^{-1}(1-p_D) \le F_{|t|}^{-1}(1-p) \right) \nonumber \\
&=& 1-Pr\left(|t_{obs}| \le F_{|t|}^{-1}(1-p)\right) \nonumber \\
&=&%\left\{ \begin{array}{ll}
1-F_{|t|}\left( F_{|t|}^{-1}(1-p)\right)=p \mbox{ ; under $H_0$}
% 1-F_{|t|}( F_{|t|}^{-1}(1-p); \delta) &\mbox{ ; under $H_1$}
% \end{array} \right.
\end{eqnarray}
%Thus the density of $p_D$ is Uniform(0,1) under $H_0$. As noted by \cite{Ruppert:2007}, under $H_1$ the density of $p_D$ $$f_{p_D}(p_D; \delta)= \frac{f_{|t|}(F_{|t|}^{-1}(1-p_D);\delta)}{f_{|t|}(F_{|t|}^{-1}(1-p_D))}$$ derived from equation \eqref{dist_p} is a right skewed distribution.
Let $p_{0,i}$, $i=1, ..., m-1$ denote the $p$-values associated with data corresponding to the $m-1$ null plots. Since this data is generated consistently with the null hypothesis, the $p$-values are independent and follow a standard Uniform distribution, $p_{i,0} \sim U[0,1], i= 1, ..., m-1$. The minimum $p_0 = \min_{1 \le i \le m-1} \ p_{0,i}$ then follows a Beta distribution with shape parameters 1 and $m-1$, and corresponding distribution function
\[
F_{p_0} (x) = 1- (1- x) ^{m-1} \text{ for } x \in [0,1].
\]
\noindent Thus
\begin{eqnarray*}
P(p_D < p_{0}) &=& 1 - P(p_{0} \le p_D) = 1- \int_0^1 P(p_{0} \le p_D \mid p_D=t) f_{p_D}(t) dt \\
&=& 1 - \int_0^1 F_{p_{0}}(t) f_{p_D}(t) dt = 1 - \int_0^1 f_{p_D}(t) dt + \int_0^1 (1-t)^{m-1} f_{p_D}(t) dt \\
&=& E\left[ (1 - p_D)^{m-1}\right].
\end{eqnarray*}
%\noindent This proves the statement above.
%We further see:
%\[
%E\left[ (1 - p_D)^{m-1}\right] = \sum_{k=0}^{m-1} {m-1 \choose k} (-1)^k E[p_D^k] = 1-(m-1)E[p_D] + O(E[p_D^2]).
%\]
% If the density of $p$-value is very right skewed, the expectation term would be large. The distribution function would be highly right skewed when there would be strong signal in the data plot.Thats where the distribution function of $p$-value under alternative comes in. Should we add that in Equation \ref{dist_p}? )},
%conversely, as $m$, the number of choices given in the lineup, goes up the probability to pick the data plot goes down.
%}
%\blue{This shows that the lineup size $m$ has an effect on the power of the visual test: it is the $m-1$th moment of the distribution of $1-p_D$.
%
%
%
%Let us think of a lineup as a head-to-head comparison of the test statistic and $m-1$ null plots.
%%Let the data plot have a $p$-value of $p_D$, and the null plots $p$ values of $p_{0, i}$ with $1 \le i \le m-1$.
%We know that for each comparison the probability that the data, on which the plot is based, has the smaller $p$ value is
%\begin{eqnarray*}
%P(p_D < p_{0,i}) &=& 1 - P(p_{i,0} \le p_D) = 1- \int_0^1 P(p_{i,0} \le p_D \mid p_D=t) f_{p_D}(t) dt = \\
%&=& 1 - \int_0^1 F_{p_{0,i}}(t) f_{p_D}(t) dt = 1 - \int_0^1 t f_{p_D}(t) dt = 1 - E[p_D],
%\end{eqnarray*}
%which, in particular, is independent of $p_{0,i}$ for all $i$.
%
%Let us make the assumption that an observer is able to identify the chart corresponding to the data with the smallest $p$-value. Further we will assume that all observers have the same ability in identifying the data plot.
%
%With that, we define $Z$ as the number of null plots in a lineup, for which the $p$-value $p_{0,i}$ is smaller than $p_D$.
%
%Then $Z \sim B_{m-1, E[p_D]}$, and the probability that an observer will pick the data plot in a given lineup is
%\[
%P(Z=0) = \left(1 - E[p_D] \right)^{m-1} = P(p_D \le p_0), \ \ \ \text{ where } p_0 = \min_{1 \le i \le m-1} \ p_{0,i}.
%\]
\end{proof}
%\red{pull the results for $P(p_D < p_{i,0})$ and $P(p_D < p_0)$ to the front, including the Figure on $1/m$ versus power. These results are independent of the regression setting - they don't make any assumptions in terms of regression.}
The above lemma allows two immediate conclusions for the use of lineups.
The probability that the observer correctly identifies the data plot is closely connected to the size of the lineup $m$, since the right hand side of the above equation decreases for larger $m$, the probability of correctly identifying the actual data plot decreases with $m$. %This may seem counterintuitive, but it makes sense.
%With fewer null plots the actual data plot will stand out, when the alternative hypothesis is true.
Further we see, that the rate of this decrease depends strongly on the distribution of $p_D$ -- if the density of $p_D$ is very right skewed, the expectation term on right hand side will be large and less affected by an increase in $m$. This can also be seen in Figure \ref{fig:pval_power}, which illustrates lemma \ref{lemma}. Figure \ref{fig:pval_power} shows the probability of picking the actual data plot for lineups of different size: as $m$ increases we have an increased probability to observe a more highly structured null plot by chance. It can also be seen that for a $p$-value, $p_D$, of about 0.15 for the data plot, the signal in the plot is so weak, that it cannot be distinguished from null plots in a lineup of size $m=20$. % This pattern is confirmed by our findings described in Section~\ref{sec:simulation}.
%also seen in our experiments as shown in Figure \ref{fig:pval_pcorrect}.
%Even under $H_1$ we expect some observers to pick a null plot with a probability that depends on the strength of the signal in our data plot.
%Reversely, we will now make use of the number of observers who do not pick the data plot in a lineup to infer the strength of the signal in the test statistic.
\begin{figure}[htbp] % figure placement: here, top, bottom, or page
\centering
\includegraphics[width=3in]{images/powerplot.pdf}
\caption{Probability that the data plot has the smallest probability in a lineup of size $m$. With increasing $p$-value the probability drops -- when it reaches $1/m$ a horizontal line is drawn to emphasize insufficient sensitivity of the test due to the lineup size.
%Comparison of probabilities to pick the actual data plot for different size lineups, $m$, and different values for data plot strength, $p_D$. The dashed line, and grey 95\% pointwise confidence band, shows the probability of picking the data plot, by chance, if $H_0$ is true. }
}
\label{fig:pval_power}
\end{figure}
%Now assume that the same lineup plot is evaluated by $K$ independent observers. %W.l.o.g.\green{Can we get rid of cliches like WLOG?}
%Let the last plot be the data plot and plots 1 through $m-1$ the null plots. Let $n_i$ be the number of observers who picked plot $i$ as the plot with the structure least consistent with the null hypothesis, corresponding to random variable $N_i$. Then $(N_1, N_2, ..., N_m) \sim$ Mult$_{\pi_1, ..., \pi_m}$ with $\sum_i \pi_i = 1$ and $\sum_{k=1}^{m} n_k = K$. We can estimate $\pi_i$ as $\widehat{\pi_i} = n_i/K$.
%
%For the distribution of $N_m$, %the number of times out of $K$ that observers picked the last plot to show the structure least consistent with the null hypothesis,
%we get
%\[
%N_m\sim \left \{
%\begin{array}{ll}
%B_{K, \hh{E\left[ (1 - p_D)^{m-1}\right]}} & \text { under } H_1\\
%B_{K, 1/m} & \text { under } H_0
%\end{array}
%\right .
%\]
%
%By matching the expected values, we therefore get an expression for the $p$-value $p_D$ in the data
%%under $H_1$ %% no such thing aas $p$-value under H_1 - even though it's just an unfortunate choice of word.
%as:
%\begin{equation}\label{eqn:power_estimate}
%%{E[p_D]} = 1- \left(\frac{E[N_m]}{K}\right)^{1/(m-1)}.
%\hh{1 - E[N_m]/K = (m-1)E[p_D] + O(E[p_D^2])}
%\end{equation}
%
%Note that the left hand side in Equation \ref{eqn:power_estimate} is independent of the parameter $\theta$, but just based on the lineup evaluation by independent observers. While this allows us to compute this value independently of the parametric setting, it also means that observers might pick out the data plot for a reason not related to the value of $\theta$.
%\hh{Also note, that this equation yields an estimate of $p_D$ that is bounded by $1/(m-1)$. We should therefore not make the decision of significance depend on this estimate, but only use this equation after the fact.
%Both of these topics are further highlighted in the application section.}
%Picking out the data plot is still a measure of how different the data plot is compared to the null plots, and therefore is a test statistic of the visual test.
%This allows us to compute this value independently of the parametric setting. We therefore define the {\it signal strength} of a visual test as the following:
%
%\begin{dfn} \label{dfn:signal}
%The {\it signal strength} $\sigma_T$ of a visual test is defined as the probability to identify the data plot.
%\end{dfn}
%
%In a lineup of size $m$, evaluated by $K$ independent observers, we can estimate the signal strength of the test as
%\[
%\sigma_T^{m-1} = \frac{n_m}{K}.
%\]
%
%In a parametric setting, signal strength of a test coincides with the power of the test.
%Theoretical power for the regression parameters is derived in the next section.
%\subsection{Estimating Power from Lineups}
%
%We estimate the empirical power from responses on a specific lineup plot generated with known values of sample size ($n$), variance ($\sigma^2$) and regression parameter ($\beta$) in model \ref{multi}. Suppose, we have responses from $K$ independent observers with $u$ identifications of the actual plot. This gives an estimated power of
%\begin{equation}\label{power_est} Power(\beta)=\frac{u}{K} \hspace{1cm} 0 \le u \le K \end{equation}
%From model \ref{eqn:mixed} we obtain %$\pi=Pr(Y=1|X_1, X_2, ..., X_p)$,
%the power of the underlying testing procedure as a population average %by ignoring the random effect part.
%%Thus we have the overall probability of success ($\pi$ or power)
%for specified sample size ($n$) and variance ($\sigma$) as
%
%\begin{equation}\label{eqn:power}
%Power(\beta) = \pi=Pr(Y=1|\beta, n, \sigma)
%\end{equation}
%
%The main covariate factors that determine the power of the test are sample size ($n$), $\sigma$ and the parameter $\beta$. combining these we obtain one single covariate effect denoted by $E$ and defined as $E = \beta/ \sigma*\sqrt{n}$. Using equation \eqref{eqn:mixed}, we then have the power as
%
%\begin{equation}\label{eqn:effect}
%Power(E) = \pi=Pr(Y=1| E)
%\end{equation}
\section{Application to Linear Models} \label{sec:regression}
To make these concepts more concrete consider how this would operate in the linear models setting. Consider a linear regression model
\begin{equation}\label{multi} Y_i = \beta_0 + \beta_1 X_{i1} + \beta_1 X_{i2} + \beta_3 X_{i1}X_{i2} + ... + \epsilon_i
\end{equation}
where $\epsilon_i \stackrel{iid}{ \sim } N(0,\sigma^2)$, $i=1,2, .., n$. The covariates ($X_j, j=1,..,p$) can be continuous or discrete.
In this setting, there are many established graphics that are used to evaluate and diagnose the fit of a regression model (e.g. \citet{cook:99}). Table \ref{tbl:stat_multiple} lists several common hypotheses related to the regression setting, and commonly used statistical plots that might be used as corresponding visual test statistics.
%These plots should be familiar, because they are all in common use.
For example, to examine the effect of variable $X_j$ on $Y$, we would plot residual obtained from fitting model without $X_j$ against $X_j$ or for single covariate we may plot $Y$ against $X_j$ (cases 1-4). To assess whether the linear assumption is appropriate we would draw a plot of residuals against fitted values (case 5). For the purposes of comparing visual against conventional inference, we focus on cases 2 and 3, with a continuous and categorical explanatory variable respectively.
%Note that some charts can be associated with several different testing situations. In our paper, we pick the examples for situation 2 and 3 as candidates.
\begin{table*}[hbtp]
\caption{Visual test statistics for testing hypotheses related to the model $Y_i = \beta_0 + \beta_1 X_{i1} + \beta_1 X_{i2} + \beta_3 X_{i1}X_{i2} + ... + \epsilon_i $ }
\centering
\begin{tabular}{m{0.5cm}m{3cm}m{2cm}m{3cm}m{5.5cm}}
\hline\hline
Case & Null Hypothesis & Statistic & Test Statistic & Description \\ [0.5ex] % inserts table %heading
\hline
1 & $H_0: \beta_0=0$ & Scatter plot & \begin{minipage}[t]{3cm} \scalebox{0.4}{\includegraphics{stat_intercept.pdf}}\end{minipage} & Scatter plot with least square line overlaid. For lineup plot, we simulate data from fitted null model. \\
2 & $H_0: \beta_k=0$ & Residual plot & \begin{minipage}[t]{3cm} \scalebox{0.4}{\includegraphics{stat_beta_k.pdf}}\end{minipage} & Residual vs $X_k$ plots. For lineup plot, we simulate data from normal with mean 0 variance $\hat{\sigma}^2$. \\
3 & $H_0: \beta_k=0$ (for categorical $X_k$) & Box plot & \begin{minipage}[t]{3cm} \scalebox{0.4}{\includegraphics{stat_category.pdf}} \end{minipage} & Box plot of residuals grouped by category of $X_k$. For lineup plot, we simulate data from normal with mean 0 variance $\hat{\sigma}^2$. \\
4 & $H_0: \beta_k=0$ (interaction with categorical $X_k$) & Scatter plot & \begin{minipage}[t]{3cm} \scalebox{0.4}{\includegraphics{stat_interection.pdf}} \end{minipage} & Scatter plot with least square lines of each category overlaid. For lineup plot, we simulate data from fitted null model.\\[1ex] % [1ex] adds vertical space
5 & $H_0: X$ Linear & Residual Plot & \begin{minipage}[t]{3cm} \scalebox{0.4}{\includegraphics{stat_nonlinear.pdf}} \end{minipage} & Residual vs predictor plots with loess smoother overlaid. For lineup plot, we simulate residual data from normal with mean 0 variance $\hat{\sigma}^2$. \\
6 & $H_0: \sigma^2=\sigma^2_0$ & Box plot & \begin{minipage}[t]{3cm} \scalebox{0.4}{\includegraphics{stat_sigma_box.pdf}}\end{minipage} & Box plot of standardized residual divided by $\sigma^2_0$. For lineup plot, we simulate data from standard normal. \\
7 & $H_0: \rho_{X,Y|Z}=\rho$ & Scatter Plot & \begin{minipage}[t]{3cm} \scalebox{0.4}{\includegraphics{stat_intercept.pdf}} \end{minipage} & Scatter plot of Residuals obtained by fitting partial regression. For lineup plot, we simulate data (mean 0 and variance 1) with specific correlation $\rho$. \\
8 & $H_0:$ Model Fits & Histogram & \begin{minipage}[t]{3cm} \scalebox{0.4}{\includegraphics{stat_goodness_simple.pdf}} \end{minipage} & Histogram of the response data. For lineup plot, we simulate data from fitted model. \\[1ex] % [1ex] adds vertical space
9 & Special case $p=1$ $H_0: \rho_{X,Y} =\rho$ & Scatter plot & \begin{minipage}[t]{3cm} \scalebox{0.4}{\includegraphics{stat_intercept.pdf}} \end{minipage} & Scatter plot with least square line overlaid For lineup plot, we simulate data with correlation $\rho$.\\
\hline
\end{tabular}
\label{tbl:stat_multiple}
\end{table*}
Suppose $X_k$ is a categorical variable with two levels, and we test the hypothesis $H_0:\beta_k=0$ vs $H_1: \beta_k \ne 0$. If the responses for the two levels of the categorical variable $X_k$ in the model are different, the residuals from fitting the null model should show a significant difference between the two groups. For a visual test, we draw boxplots of the residuals conditioned on the two levels of $X_k$. If $\beta_k\ne 0$ the boxplots should show a vertical displacement.
%A lineup including the test statistic for a binary $X_k$ is shown in Figure \ref{fig:test_category}. The 19 null plots are generated by simulating residuals from $N(0,\hat{\sigma}^2)$. The test statistic, the plot containing the actual data, is randomly placed among these null plots. If the test statistic is identifiable the null hypothesis is rejected. % with a $p$-value of at most 0.05.
%\subsection{Expected power}
% Di: This is specific for the regression model
%Consider the hypothesis test $H_0: \beta_k=0$ against $H_1: \beta_k \ne 0$ in model \ref{multi}.
%%Now consider estimating the power of the visual test.
%%Suppose $F_{|t|}(.)$ is the distribution function of an absolute value of $t$. For the regression slope parameter $\beta$ suppose the associated test statistic in conventional test be $t_{obs}$ with $p$-value $p_D$. By definition we have
%%
%%$$p_D=Pr(|t| \ge t_{obs}| H_0)=1-F_{|t|}(t_{obs}) \Rightarrow |t_{obs}|=F_{|t|}^{-1}(1-p_D)$$
%%
%%\red{HH: the following paragraph should probably move up, but I'm not sure how to change it appropriately. Mahbub, could you take care of that?}
%%Now suppose $F_{|t|}(.;\delta)$ denotes the distribution function of an absolute value of $t$ distribution with non-centrality parameter $\delta$. Thus the distribution function of $p-value$ be
%%\begin{eqnarray}\label{dist_p}
%%F{p_D}(p) &=& Pr(p_D \le p)=1-Pr(1-p_D \le 1-p) \nonumber \\
%% &=& 1-Pr(F_{|t|}^{-1}(1-p_D) \le F_{|t|}^{-1}(1-p) ) \nonumber \\
%% &=& 1-Pr(|t_{obs}| \le F_{|t|}^{-1}(1-p)) \nonumber \\
%% &=&\left\{ \begin{array}{ll}
%% 1-F_{|t|}( F_{|t|}^{-1}(1-p))=p &\mbox{ ; under $H_0$} \\
%% 1-F_{|t|}( F_{|t|}^{-1}(1-p); \delta) &\mbox{ ; under $H_1$}
%% \end{array} \right.
%%\end{eqnarray}
%%
%%Thus the density of $p_D$ is Uniform(0,1) under $H_0$. As noted by \cite{Ruppert:2007}, under $H_1$ the density $$f_{p_D}(p_D; \delta)= \frac{f_{|t|}(F_{|t|}^{-1}(1-p_D);\delta)}{f_{|t|}(F_{|t|}^{-1}(1-p_D))}$$ derived from equation \eqref{dist_p} is a right skewed distribution.
%In a lineup plot we simulate $m-1$ residual data sets from null model where each of these $m-1$ null data sets produces a corresponding $p$-value $p_{0,i}$ and $p_{0,i} \sim \text{Uniform}(0,1)$ for $i = 1, ..., m-1$. Suppose $p_0=\min_i(p_{0,i})$. Thus $p_0 \sim \text{Beta}(1,m-1)$.
The conventional test in this scenario uses $T= \hat{\beta_k}/ se(\hat{\beta_k})$ and rejects the null hypothesis, if $T$ is extreme on the scale of a $t$ distribution with $n-p$ degrees of freedom. It forms the benchmark upon which we evaluate the visual test. To calculate what we might expect for the power of the visual test, under perfect conditions, first assume that the observer is able to pick the plot with the smallest $p$-value from a lineup plot. This leads to the decision to reject $H_0$ when $p_{D} < p_0$, where $p_{D}$ is the conventional $p$-value as details given in Lemma \ref{lemma}. Thus the expected power of the visual test in this scenario is
\begin{equation}\label{power_exp}
Power(\beta)=Pr(p_{D} < p_0) \quad \text{for} \quad \beta \ne 0
\end{equation}
Figure \ref{fig:power_expected} shows the power of the conventional test in comparison to the expected power of the visual test obtained from Equation \ref{power_exp}. Notice that the expected power of the visual test is %almost as good as
close to the power of the conventional test.
%Theorem \ref{thm:power} helps to compute the expected power of visual inference.
%\begin{thm}\label{thm:power}
% equation \eqref{power_exp} yields expected power of visual inference as Power($\beta$) = $E(1-p_D)^{m-1}$ .
%\end{thm}
%
%{ \em proof}
%
%Since $p_0 \sim \text{Beta}(1,m-1)$, for $t \in (0,1)$ we have the distribution function of $p_0$ be
%\begin{eqnarray*}
%F{p_0}(t) &=& (m-1) \int_0^t{(1-p_0)^{(m-2)}dp_0} \\
% &=& -(m-1) \int_1^{1-t}{x^{(m-2)}dx}\\
% &=& \left[ x^{m-1}\right]^1_{1-t}\\
% &=&1-(1-t)^{m-1} \rightarrow 1 \quad \text{as} \quad m \rightarrow \infty
%\end{eqnarray*}
%
%% verified this computation from http://www.wolframalpha.com using command int (m-1)*(1-p)^(m-2) dp, p=0,t
%
%Thus expected power in equation \eqref{power_exp} be
%\begin{eqnarray*}
% Pr(p_{B} < p_0) &=& 1- Pr( p_0 \le p_{B} ) \\
% &=& 1- \int_{0}^{1}{Pr( p_0 \le p_{B} |p_{B} =t) f_{p_{B} }(t)dt } \\
% &=& 1- \int_{0}^{1}{F{p_0}(t) f_{p_{B} }(t)dt } \\
% &=& E(1-p_D)^{m-1} \\
%% & \rightarrow &1- \int_{0}^{1}{ f_{p_{B} }(t)dt } = F_{p_D}(0) \quad \text{as} \quad m \rightarrow \infty
%\end{eqnarray*}
%
%
\begin{figure}[hbtp]
\centering
\scalebox{.6}{\includegraphics{power_expected.pdf}}
\caption{Expected power of visual test and the power of conventional test for sample size $n =100$, $\sigma = 12$ and $m$=20. }
\label{fig:power_expected}
\end{figure}
% \pagebreak
% -------- Heike's writeup added here
%\subsection{Some Power Considerations}
%\subsection{Expected number of choices}
%Since $X$ has a binomial distribution, we can have a look at the number of plots among the null plots that we would expect to be picked over the data plot:
%\[
%E[X] = (m-1)(1-E[p_D]).
%\]
%With an increase of the lineup size $m$ the number of null plots with a potentially stronger signal than the data plot grows linearly.
%This should allow us to infer some of the signal strength $p_D$ as
%\begin{equation} \label{plot_signal}
%E[p_D] = 1 - E[X]/(m-1),
%\end{equation}
%i.e. by averaging over the number of plots with a stronger signal than the data plot we can evaluate signal strength of the plot even in the case where the data plot is not being picked by the observer.
%
\section{Human Subjects Experiments with Simulated Data} \label{sec:simulation}
Three experiments were conducted to evaluate the effectiveness of the lineup protocol relative to the equivalent test statistic used in the regression scenario. The first two are conventional settings where we would not expect the lineup protocol to do better than the conventional test, but we hope it to perform favorably. The third experiment is a scenario where assumptions required for the conventional test are violated, and we would expect the lineup protocol to outperform the conventional test. (Data and lineups used in the experiments are available in the supplementary material.)
After many small pilot studies with local personnel, it was clear that some care was needed to set up the human subjects experiments. It was best for a observer or a subject to see a block of 10 lineups with varying difficulty, with a reasonable number of ``easy'' lineups. The explanations about each experiment (below) includes an explanation of how the lineups were sampled and provided to the subjects, for each of the three experiments.
Participants for all the experiments were recruited through Amazon's online web service, Mechanical Turk \citep{turk}. A summary of the data obtained for all three experiments are shown in Table \ref{tbl:summary}. Participants were asked to select the plot they think best matched the question given, provide a reason for their choice, and say how confident they are in their choice. Gender, age, education and geographic location of each participant are also collected. %The data was collected through the web.
For each of the experiments one of the lineups was used as a test plot (easy plot) which everyone should get correct, so that a measure of the quality of the subjects effort could be made.
Note that no participant was shown the same lineup twice.
%\green{Something probably needs to be said about subjects who do multiple tests, what checks are done.} \blue{We do have some data where we notice multiple responses for a lineup by a subject. This happens when subject accidentally use browser's ``back'' button and same data is saved twice. We made sure in the data cleaning process that we do not include more than one response on a lineup by a subject. For this we only keep the very first response.}
\subsection{Discrete covariate}\label{sec:category}
The experiment is designed to study the ability of human subjects to detect the effect of a single categorical variable $X_2$ (corresponding to parameter $\beta_2$) in a two variable ($p=2$) regression model (Equation \ref{multi}). Data is simulated a range of values of $\beta_2~ (=0, 1, 3, 5, 7, 8, 10, 16)$, two different sample sizes ($n=100, 300$) and two standard deviations of the error ($\sigma=5, 12$). The range of $\beta_2$ values was chosen so that estimates of the power would produce reasonably continuous power curves, comparable to that calculated for the theoretical conventional test. Values were fixed for other regression parameters, $\beta_0 = 5$, $\beta_1=15$, and the values for $X_1$ were randomly generated from a Poisson $(\lambda=30)$ distribution, which is almost Gaussian. Data sets were generated with parameter values shown in Table \ref{tbl:experiment_params}. Three replicates at each level were generated. These produced 60 different ``actual data sets'', and thus, 60 different lineups. To produce the lineups, the null model was fit to each actual data set, to obtain residuals and parameter estimates. The actual data plot was generated by making side-by-side boxplots of the residuals (Table \ref{tbl:stat_multiple}, case 3). The 19 null data sets were generated by simulating from $N(0, {\hat{\sigma}}^2)$, and plotted in the same way. The actual data plot was randomly placed among these null data plots to produce the lineup. Figure \ref{fig:test_category} is an example of one of these lineups. The actual data plot is number 11 which was generated for $n$=300, $\beta_2$=15 and $\sigma$=12.
%{\bf Steps to obtain lineup}
%\begin{enumerate}
% \item For a given parameter setting obtain actual data set
% \item Fit the null model to the actual data and obtain residuals and other parameter estimates ($\hat{\beta}$ and $\hat{\sigma}$)
% \item Generate actual data plot using the procedure as described in table \ref{tbl:stat_multiple}
% \item Generate 19 null (residual) data sets from $N(0, {\hat{\sigma}}^2$) and obtain 19 null plots
% \item Finally randomly place 19 null plots and one actual plot in a layout (4X5) to get a lineup
%\end{enumerate}
The number of evaluations required for each lineup to provide reasonable estimates of the power is determined by the variance of the estimated power. Suppose $\gamma$ denotes the conventional test power for each parameter combination shown in Table \ref{tbl:experiment_params}. Since the expected power of visual inference is very close to the power of conventional test (Figure \ref{fig:power_expected}) the proportion of successful evaluations of lineups should give a rough estimate for $\gamma$.
For a given proportion $\gamma$ it is desired to have a margin of error (ME) less than or equal to 0.05. Thus we have $ME =1.96 \sqrt{ \gamma(1-\gamma) / n_{\gamma} } \le 0.05$ which gives us the estimation of minimum number of evaluations $$n_{\gamma} \geq \frac{\gamma(1-\gamma)}{(0.05/1.96)^2}.$$
% \hh{we're not even close to $n_\gamma$ with ten lineups .... we need more in the order of 400. Let's not emphasize this so much.}
Each subject viewed at least 10 lineups with the option to evaluate more.
%and we allowed them to provide more evaluations if they wanted.
Depending on the parameter combinations we group the lineups in different difficulty levels as easy, medium, hard and mixed (actual numbers are given in the supplementary material). For each difficulty level a specific number of lineups was randomly picked for evaluation. This number is chosen so that total number of evaluations for each lineup for that group exceed the threshold $n_{\gamma}$. To satisfy this plan we needed to recruit at least 300 subjects.
\begin{table}[hbtp]
\caption{Combination of parameter values, $\beta_2$, $n$ and $\sigma$, used for each of the simulation experiments.} % title name of the table
\centering
\scalebox{.9}{
\begin{tabular}{c c c c c}
% creating 10 columns
\hline\hline
% inserting double-line
& & \multicolumn{3}{c}{Slope ($\beta$)} \\
\cline{3-5}
Sample size & Error SD & \multicolumn{1}{c} {Experiment 1} & \multicolumn{1}{c} {Experiment 2} & \multicolumn{1}{c} {Experiment 3} \\
($n$) & ($\sigma$) & Discrete covariate & Continuous covariate & Contaminated data
\\ [0.5ex]
\hline
% inserts single-line
% Entering 1st row
& 5 & 0, 1, 3, 5, 8 & 0.25, 0.75, 1.25, 1.75, 2.75 & 0.1, 0.4, 0.75, 1.25, 1.5, 2.25\\[-1ex]
\raisebox{1.5ex}{100} &12
& 1, 3, 8, 10, 16 & 0.5, 1.5, 3.5, 4.5, 6 &\\[1ex]
% Entering 2rd row
& 5 & 0, 1, 2, 3, 5 & 0.1, 0.4, 0.7, 1, 1.5&\\[-1ex]
\raisebox{1.5ex}{300} & 12
& 1, 3, 5, 7, 10 & 0, 0.8, 1.75, 2.3, 3.5 &\\[1ex]
% [1ex] adds vertical space
\hline
% inserts single-line
\end{tabular}
}
\label{tbl:experiment_params}
\end{table}
%For a given proportion $p$ we want to have margin of error (ME) to be 0.1. Thus we have $ME =1.96 \sqrt{ \frac 1 n p(1-p)} \le 0.05$ which gives us the estimation of minimum number of evaluations $$n_p \geq \frac{p(1-p)}{(0.05/1.96)^2}.$$ Now from the theoretical (UMP) power curve for each parameter combinations shown in Table \ref{tbl:experiment_params} we know the power(say $p$) and thus we can estimate the number of evaluations required for that specific combinations of parameter. We collect data such that each of the parameter combinations in the Table \ref{tbl:experiment_params} has approximately the estimated sample size that we calculated here.
%\green{Did you in fact bin like this???}\red{For each parameter combination a set of 1000 actual data are generated and their associated p-values were computed. This produces an empirical distribution of p-value under alternative hypothesis. Finally, three data sets that have 16th, 50th and 84th quantile of the p-values were taken. Similar procedure was used to generate 19 null plots based on the empirical distribution of p-value under null hypothesis}. For added control, to ensure a signal in the simulated actual data a blocking structure was used to filter data sets. A 1000 sets were generated for each parameter combination and the conventional $t$-statistic and $p$-value associated with $H_0: \beta_2=0$ were calculated. The 3 replicates were drawn from each of three blocks of $p$-values: (0.0-$q_{33}$), ($q_{33}$-$q_{66}$), ($q_{66}$-1) where $q_i$ is the $i$th percentile in the distribution of the $p$-values. Additional control was applied to the 19 null plots. Because the distribution of these $p$-values should follow a Uniform(0,1) distribution, data sets were binned on this range by $p$-value, and a data set was randomly selected from each bin.
\subsection{Continuous covariate} \label{sec:continuous}
This experiment is very similar to the previous one, except that there is a single continuous covariate and no second covariate (Equation \ref{multi} with $p=1$), following the test in Table \ref{tbl:stat_multiple}, case 2. Data is simulated with two sample sizes ($n=100, 300$), two standard deviations of the error ($\sigma=5, 12$), a variety of slopes ($\beta$), as given in Table \ref{tbl:experiment_params}. We arbitrarily fix $\beta_0 = 6$ and values for $X_1$ are simulated from $N(0,1)$. For each combination of parameters, three different actual data sets are produced, yielding a total of 65 lineups. For some parameter combinations we have more than three replications, because additional lineups were produced for values of the parameters that exhibited large variation in the initial data.
The actual data plot is generated by making a scatterplot of $Y$ vs $X_1$ with the least squares regression line overlaid. To produce the null plots in the lineup null data was simulated from $N(X \hat{\beta}, {\hat{\sigma}}^2$) and plotted using the same scatterplot method as the actual data.
To select 10 lineups for a subject, each combination of sample size ($n$) and error SD ($\sigma$) is given a difficulty value based on the slope ($\beta$) parameters. For the smallest slopes the difficulty is 4 (hardest) and for the largest slopes the difficulty is 0 (easiest).
Figure \ref{fig:test_continuous} shows an example lineup for this experiment from difficulty level 4. This lineup is generated using sample size ($n$)=100, slope($\beta$)=1.25 and and error SD($\sigma$)=5. The actual data plot location is 11.
For each combination of sample size and standard deviation, the subject sees a randomly selected lineup of each difficulty level, all the slopes, giving 5 lineups. Another four lineups are chosen from a second choice of sample size and standard deviation, with difficulty levels 0-3, and the last lineup shown was a randomly selected lineup with difficulty 0. The order that the lineups are shown to participants is randomized. Each subject saw a substantial number of difficult lineups.
%\hh{While this allows us a better estimation for the power in small slopes, }
%%The intention was noble, to get a better handle on the power of the lineups for small slopes,
%but practically this made the study very tiring for the participants. %It would be better to provide the subjects with a larger selection of easier lineups and just a few difficult ones.
%\green{perhaps should go into so much detail about this, or keep this for the conclusions section.}
%{\bf Steps to obtain lineup}
%\begin{enumerate}
% \item For a given parameter setting obtain actual data set
% \item Fit the null model to the actual data and obtain residuals and other parameter estimates ($\hat{\beta}$ and $\hat{\sigma}$)
% \item Generate actual data plot (a scatter plot of $Y$ vs $X_1$ a least square line overlaid)
% \item Generate 19 null data sets from $N(X \hat{\beta}, {\hat{\sigma}}^2$) and obtain 19 null plots
% \item Finally randomly place 19 null plots and one actual plot in a layout (4X5) to get a lineup
%\end{enumerate}
%\green{Fill in details of lineups, probably need a figure with an example lineup, and how lineups were assigned to subjects.}
%{\bf Presenting lineups for evaluation} Each subject got at least 10 lineups for evaluation. For each combination of sample size ($n$) and error SD ($\sigma$) we define the difficulty of the lineup based on the five slope ($\beta$) parameters we choose as shown in Table \ref{tbl:experiment_params}. For the smallest slope parameter the difficulty is 0 and for the largest parameter the difficulty is 4. The 10 lineups are chosen as follows;
%\begin{enumerate}
% \item For one combination of sample size ($n$) and error SD ($\sigma$) randomly select five lineups with one from each difficulty level
% \item For another combination of sample size ($n$) and error SD ($\sigma$) select four lineups with one from each difficulty level 0 through 3
% \item Finally select another lineup with difficulty level 0
%\end{enumerate}
\begin{figure}[p]
%\begin{figurehere}
\centering
\includegraphics[width=0.95\textwidth]{lineup_continuous.pdf}
\caption{Lineup plot ($m=20$) using scatter plots for testing $H_0: \beta_k=0$ where covariate $X_k$ is continuous. One of these plots is the plot of the actual data, and the remaining are null plots, produced by simulating data from a null model that assumes $H_0$ is true. Which plot is the most different from the others, in the sense that there is the steepest slope? (The position of the actual data plot is provided in Section \ref{sec:continuous}.)}
\label{fig:test_continuous}
\end{figure}
\subsection{Contaminated data} \label{sec:contamination}
The first two simulation experiments use data generated under a normal error model, satisfying the conditions for conventional test procedures. In these situations there exists a test, and there would, in general, be no need to use visual inference. The simulation is conducted with the hopes that the visual test procedure, will at least compare favorably with the conventional test -- without any ambition of performing equally well.
This third simulation is closer to the mark for the purpose of visual inference. The assumptions for the conventional test are violated, by contaminating the data. The contamination makes the estimated slopes effectively 0, yet the true value of slope parameter is not. The data is generated from the following model:
\[
Y_i = \left\{
\begin{array}{l l}
\alpha+\beta X_i + \epsilon_i & \quad X_i \sim N(0,1) \quad i =1,...,n\\
\lambda+ \eta_i & \quad X_i \sim N(\mu,1/3) \quad i=1, ...,n_c\\
\end{array} \right.
\]
where $\epsilon_i \stackrel{iid}\sim N(0,\sigma)$, $\eta_i \stackrel{iid}\sim N(0,\sigma/3)$ and $\mu = -1.75$. $n_c$ is the size of the contaminated data. For the experiment we consider $n=100$ and $n_c=15$ producing actual data with $115$ points. Further, $\alpha=0$, $\lambda=10$, and $\sigma$ is chosen to be approximately 3.5, so that error standard deviation across both groups of the data is $5$. %Figure \ref{fig:cont_dat} shows an example of contaminated data, generated by this model for $\beta=5$.
A linear model (Equation \ref{multi} with $p=1$ and intercept $\beta_0=0$) is fit to the contaminated data. This experiment follows the test in Table \ref{tbl:stat_multiple}, case 2. The actual data plot shows a scatterplot of the residuals vs $X_1$, and the null plots are scatterplots of null data generated by plotting simulated residuals from $N(0, {\hat{\sigma}}^2$) against $X_1$.
Experiment three consists of a total of 30 lineups, made up of five replicates for each of the six slopes as shown in Table \ref{tbl:experiment_params}. We use the slope directly as a measure of difficulty, with difficulty = 0 for the largest slope and difficulty = 5 for the smallest slope. Subjects were exposed to a total of ten lineups, with two lineups from each of the difficulty levels 0 through 3, and one lineup each from levels 4 and 5.
%{\bf Steps to obtain lineup}
%\begin{enumerate}
% \item For a given parameter setting obtain actual data set
% \item Fit the null model to the actual data and obtain residuals and other parameter estimates ($\hat{\beta}$ and $\hat{\sigma}$)
% \item Generate actual data plot (a scatter plot of Residual vs $X_1$)
% \item Generate 19 null residual data sets from $N(0, {\hat{\sigma}}^2$) and obtain 19 null plots
% \item Finally randomly place 19 null plots and one actual plot in a layout (4X5) to get a lineup
%\end{enumerate}
An example lineup for slope $\beta=0.4$ is shown in Figure \ref{fig:test_contaminated}. Can you pick which plot is different? The actual data plot location is 7.
%\begin{figure}[hbt]
%\begin{figurehere}
% \centering
% \scalebox{0.6}{\includegraphics{contaminated_data.pdf}}
% \caption{Scatterplot of a simulated contaminated data set generated such that a simple linear model produce a slope almost negligible, ie, if we fit model (Equation \ref{multi}) with $p=1$ to this data, the resulting estimate of $\beta_1$ would not be statistically significant. The conventional t-test $p$-value for the slope is 0.67 for this data set but visually we can see a convincing evidence of positive slope.
%}
% \label{fig:cont_dat}
%\end{figure}
\begin{figure}[p]
%\begin{figurehere}
\centering
\includegraphics[width=0.95\textwidth]{lineup_contaminated.pdf}
\caption{Lineup plot ($m=20$) using scatter plots for testing $H_0: \beta_k=0$ where covariate $X_k$ is continuous but the inclusion of some contamination with the data spoils the normality assumption of error structure. One of these plots is the plot of the actual data, and the remaining are null plots, produced by simulating data from a null model that assumes $H_0$ is true. Which plot is the most different from the others, in the sense that there is the steepest slope? (The position of the actual data plot is provided in Section \ref{sec:contamination}.)}
\label{fig:test_contaminated}
\end{figure}
%\green{Fill in details of lineups, probably need a figure with an example lineup, and how lineups were assigned to subjects. Show a sample lineup, probably the most difficult, instead of the single plot.}
%\begin{figure*}[hbtp]
% \centering
% \scalebox{0.6}{\includegraphics{power_observed.pdf}}
% \caption{Observed power of visual test from equation \eqref{eqn:power_estimate} with pointwise 95\% confidence limits and the power of UMP test for sample size $n= 100,300$ and $\sigma = 12,5$ as per experiment 1.}
% \label{fig:power_observed}
%\end{figure*}
%\begin{figure*}[hbtp]
% \centering
% \scalebox{0.7}{\includegraphics{power_loess_exp1.pdf}}
% \caption{Observed power shown by loess smoother with simultaneous bootstrap confidence band and the power of UMP test for sample size $n= 100,300$ and $\sigma = 12,5$ as per experiment 1.}
% \label{fig:power_loess1}
%\end{figure*}
%\begin{figure*}[hbtp]
% \centering
% \scalebox{0.7}{\includegraphics{power_loess_exp2.pdf}}
% \caption{Observed power shown by loess smoother with simultaneous bootstrap confidence band and the power of UMP test for sample size $n= 100,300$ and $\sigma = 12,5$ as per experiment 2.}
% \label{fig:power_loess2}
%\end{figure*}
%\begin{figure*}[hbtp]
% \centering
% \scalebox{0.7}{\includegraphics{power_diff_exp.pdf}}
% \caption{Difference of Observed power from UMP test power obtained in three experiments for sample size $n= 100,300$ and $\sigma = 12,5$ and values of slope parameter $\beta$ as shown in Table \ref{tbl:experiment_params}.}
% \label{fig:power_diff_exp}
%\end{figure*}
%\begin{figure*}[hbtp]
% \centering
% \scalebox{0.6}{\includegraphics{power_observed_exp3.pdf}}
% \caption{Observed power of visual test from equation \eqref{eqn:power_estimate} with pointwise 95\% confidence limits and the power of UMP test for sample size $n= 100$ and $\sigma = 5$ as per experiment 3. }
% \label{fig:power_observed_exp3}
%\end{figure*}
\section{Results} \label{sec:results}
\subsection{Data Cleaning}
% In this process of collecting data some participants did not provide demographic information as we see in Table \ref{tbl:summary} that male and female participants do not add up to total participants. It is not clear why they did not provide the information. Now the concern arises whether we should keep their responses in the study. It is also possible that participants were not serious enough to provide feedback.
%For each of the experiments we had a test plot (easy plot) for which they should have a correct evaluation without much effort. This could be a criteria to determine their attentiveness in the study. For this we consider six different data cleaning criteria as shown below;
%{\bf Screening scenarios:}
Amazon Mechanical Turk workers are paid for their efforts, not substantial, but on the scale of the minimum wage of the USA. Some workers will try to maximize their earnings for minimum effort, which can affect the results from the data. For example, some workers may simply randomly pick a plot, without actively examining the plots in the lineup. For the purpose of identifying these participants and cleaning the data, we take action based on a subject's answer to one of the very easy lineups that everybody was exposed to:
if the subject failed to identify the actual data plot, we remove all of this subject's data from the analysis. If the answer on the lineup is correct, we remove this lineup from the analysis, but keep all of the remaining answers.
Table \ref{tbl:summary} tabulates the number of subjects, genders and lineups evaluated after applying the data screening procedure.
%
%one of these was picked at random as the filtering lineup, and action was taken depending on whether they answered correctly or incorrectly. The data of participants who did not answer correctly was removed.
%For the subjects, who correctly answered the filtering lineup, all of their data was kept, except for that particular lineup. That is, for each block of 10 lineups, only the data from 9 lineups was used for each subject.
%Amazon Mechanical Turk data suffers from the problem that some participants are trying to 'game' the system, because they are being paid for their efforts, not much by the standards of the US minimum wage, but enough to encourage honest efforts. in our situation, these are participants who did not put in a best effort to identify the data plot, but just randomly picked a plot to maximize their 'winnings'. In order to avoid using those results, we are comparing six different screening methods for excluded participants from the data analysis.
%There are six suggested approach to clean the data.
%They are
%\begin{enumerate}
%\item {\bf include all participants} and their evaluations
%\item exclude all participants' evaluations, who did {\bf not} share their {\bf demographic information} (age, gender education level -- all three pieces of information are either missing or all present).
%\item exclude participants' records, if {\bf none of the evaluations} correctly identified the data plot -- every participant was shown a range of `easy' lineups.
%\item include participants' records, if {\bf at least 20 percent} of the evaluations are {\bf correct} -- based on ten evaluations per participants, two correct evaluations are significant evidence against a person just guessing
%\item include participants' records, if at least {\bf 50\% of all very easy lineups are correct}
%\item use easy lineups as {\bf reference charts}: sample one easy lineup from a person's records. If that lineup is evaluated {\bf correctly}, include all (other) lineups of that person, otherwise exclude all lineup evaluations by this participant.
%\end{enumerate}
%\begin{table}[hbtp]
%\caption{Number of unique subjects and their total feedbacks after applying six screening process on all three experimental data sets.
%Note that in some of the lines the number of male and female participants does not add up to the total number of participants. This is due to missing demographic information. \hh{Do we need this table or could we make this a visual comparison? For me, there's too many numbers with too many digits.}}
%\begin{center}
%\begin{tabular}{rrrrr|rrrr|rrrr}
% \hline
%Screening & \multicolumn{4}{c} {Experiment 1} & \multicolumn{4}{c} {Experiment 2} & \multicolumn{4}{c} {Experiment 3} \\
% \cline{2-5} \cline{6-9} \cline{10-13}
%criteria & Subj & Male & Fem & Total & Subj & Male & Fem & Total & Subj & Male & Fem & Total \\
% \hline
%1 & 520 & 264 & 201 & 4914 & 390 & 203 & 182 & 4603 & 257 & 166 & 84 & 2830 \\
% 2 & 465 & 264 & 201 & 4624 & 385 & 203 & 182 & 4558 & 250 & 166 & 84 & 2782 \\
% 3 & 384 & 200 & 165 & 4187 & 384 & 199 & 181 & 4573 & 227 & 147 & 78 & 2648 \\
% 4 & 318 & 168 & 135 & 3395 & 378 & 195 & 179 & 4492 & 205 & 133 & 72 & 2393 \\
% 5 & 277 & 139 & 124 & 3008 & 374 & 193 & 177 & 4451 & 185 & 122 & 63 & 2121 \\
% 6 & 239 & 121 & 107 & 2317 & 351 & 185 & 164 & 3858 & 156 & 106 & 50 & 1681 \\
% \hline
%\end{tabular}
%\end{center}
%\label{tbl:summary}
%\end{table}
\begin{table}[hbtp]
\caption{Number of subjects, gender and total lineups seen, and number of distinct lineups for all three experimental data sets. Note that in some of the lines the number of male and female participants does not add up to the total number of participants due to missing demographic information. }%\blue{Notice that we do not have data on one lineup (so we have only 29 lineups) for experiment 3 due to the screening process.}}
\begin{center}
\begin{tabular}{ccrrcc}
\hline \hline
Experiment & Subject & Male & Female & Responses &Lineup\\
\hline
1 & 239 & 121 & 107 & 2249 & 60 \\
2 & 351 & 185 & 164 & 3636 & 70 \\
3 & 155 & 103 & 52 & 1511 & 29 \\
\hline
\end{tabular}
\end{center}
\label{tbl:summary}
\end{table}
\subsection{Model fitting}
For each parameter combination, the effect is computed as
\[
E=\sqrt {n} \cdot \beta/\sigma.
\]
\noindent
The model in Equation \ref{eqn:mixed} is fit using $E$ as the only fixed effect covariate without intercept, i.e.
$W_{\ell i} = E_{\ell i}$. Instead of fitting an intercept, we make use of a fixed offset of $\log(0.05/0.95)$ so that the estimated power has a fixed lower limit at 0.05 (Type-I error) when $E=0$. Different skill levels of subjects are accounted for by allowing subject-specific random slopes for effect ($E$).
For experiment 3 we do fit intercepts: both a fixed and subject-specific random effects, since forcing power
%While for experiment 1 and 2 we don't allow intercept but offset, for experiment 3 an intercept is estimated to allow the power curves to go above 0.05 when $E=0$. This is because forcing power
to be fixed at 0.05 for $E=0$ is not justified by the data. % (Figure \ref{fig:power_loess_effect}). *** You cannot reference the figure before you explain it!!!
%For all the three experiments, each subject saw a lineup just once. This does not allow fitting a lineup random effect, in addition to a subject random effect, but it does simplify the model since $\tau_{\ell i} = \tau_i$.
%The subject-specific probability, $p_i$, of successful evaluation of a lineup is then estimated using Equation \ref{eqn:mixed_power}.
For computation we use package {\tt lme4} \citep{lme4:2011} and software R 2.15.0 \citep{R}. $p$-value calculations are based on asymptotic normality.
Table \ref{tbl:model_par} shows the parameter estimates of the mixed effects model of the subject-specific variation. The fixed effects estimates indicate that for all experiments the proportion of correct responses increases as the effect increases. This effect is less pronounced for experiment 3. The subject-specific variability is smaller for experiment 1, and relatively large for experiment 3. %The larger variability in the response data for experiment 3% as we see in Figure \ref{fig:power_loess_effect} *** You cannot reference the figure before you explain it!!!
%contributes to the random effect variance estimate be much larger compared to those of other experiments. For the same reason we see the standard error for intercept estimate is much higher as well. Also notice the slope estimate for experiment three is lowest as well. It is because for small parameter values the power is much higher for experiment 3. % \green{What do we learn about the models from these parameter estimates? Explain this here too.}
%\green{This section needs to have sufficient detail of the model fitting, output and diagnostics to feed into the next few sections discussing the findings.}
\begin{table}[hbtp]
\caption{Parameter estimates of model in Equation \ref{eqn:mixed}. Estimates are highly significant with $p$-value $<$ 0.0001 for all three experiment data.}
\begin{center}
\begin{tabular}{cr@{.}lcc}
\hline \hline
& \multicolumn{3}{c} {Fixed effect} & Random effect\\
\cline{2-4}
Experiment & \multicolumn{2}{l}{Estimate} &Std. error & Variance\\
\hline
1 & 0&39 & 0.0094 & 0.0080 \\
2 & 1&21 & 0.0197 & 0.0443 \\
3 & 0&59 (Intercept) & 0.1668 & 1.9917\\
& 0&21 (Slope) & 0.0511 & 0.0245\\
&-0&78 (correlation) & & \\
\hline
\end{tabular}
\end{center}
\label{tbl:model_par}
\end{table}
\subsection{Power comparison}
Figure \ref{fig:power_loess_effect} shows an overview of estimated power against effect for the three experiments. Responses from each experiment are summarized by effect size and represented as dots, with size indicating the number of responses.
A loess fit to the data gives an estimate of the observed power for different effect sizes, with grey bands indicating simultaneous bootstrap confidence bands \citep{buja}. For comparison, the dashed lines show the corresponding power curves of the conventional tests. Not surprisingly, the observed power for experiments 1 and 2 is lower than the power of the conventional test. It is encouraging to see that visual inference mirrors the power vs effect relationship of conventional testing, and comes quite close in experiment 2, where scatterplots are used as the test statistic.
Note that for effect $E=0$, the power is close to 0.05 (Type-I error) for both experiments 1 and 2, making the fixed offset a reasonable assumption.
\begin{figure}[hbtp]
\centering
\scalebox{0.55}{\includegraphics{power_loess_effect.pdf}}
\caption{Power in comparison to effect for the three experiments. Points indicate subject responses, with size indicating count. Responses are 1 and 0 depending on the success or failure respectively to identify the actual plot in the lineup. The loess curve (continuous line) estimates the observed power, and surrounding bands show simultaneous bootstrap confidence band. Conventional test power is drawn as a dashed line. For experiment 3, conventional power is based on the slopes of the linear part of the non-contaminated data. Power for contaminated data is shown by cross mark.}
\label{fig:power_loess_effect}
\end{figure}
Results for experiment 3 are quite different. This is the situation where we expect to see the potential of visual inference, and indeed we do: the power of visual inference is always high, and much higher than the conventional test at small effect sizes. There is no actual conventional power in this situation, because assumptions are violated.
The dashed line shows conventional power based on uncontaminated data, whereas the cross marks show effective power based on the coefficient estimated from the contaminated data.